A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

Jones, Alex; Wang, William Yang; Mahowald, Kyle

Computer Science > Computation and Language

arXiv:2109.06324 (cs)

[Submitted on 13 Sep 2021]

Title:A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

Authors:Alex Jones, William Yang Wang, Kyle Mahowald

View PDF

Abstract:In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family training data as a stronger predictor than language-specific training data across the board. We verify some of our linguistic findings by looking at the effect of morphological segmentation on English-Inuktitut alignment, in addition to examining the effect of word order agreement on isomorphism for 66 zero-shot language pairs from a different corpus. We make the data and code for our experiments publicly available.

Comments:	15 pages, 8 figures, EMNLP 2021
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
ACM classes:	I.2.7
Cite as:	arXiv:2109.06324 [cs.CL]
	(or arXiv:2109.06324v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2109.06324

Submission history

From: Alexander Jones [view email]
[v1] Mon, 13 Sep 2021 21:05:37 UTC (7,076 KB)

Computer Science > Computation and Language

Title:A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators