Software:LaBSE

LaBSE
Developer(s)	Google Research
Initial release	July 15, 2020
Repository	tfhub.dev/google/LaBSE
Written in	Python; TensorFlow;
Operating system	Cross-platform
Type	Open-source machine learning / Natural language processing
License	Apache License 2.0

LaBSE (Language-agnostic BERT Sentence Embedding) is an open-source sentence embedding model developed by Google Research and published in 2020.^[1]

It extends BERT language model with a multilingual dual-encoder architecture trained on parallel translation data, enabling semantically comparable sentence vectors across more than one hundred languages.^[2]

LaBSE is distributed via TensorFlow Hub and is widely used for cross-lingual information retrieval, semantic search, and machine translation evaluation.^[3]^[4]

Overview

LaBSE was introduced by Google Research as part of its multilingual representation learning program. The model maps text from diverse languages into a shared 768-dimensional vector space, where semantically equivalent sentences are located close to each other.^[5]^[6]

Unlike traditional translation-based systems, LaBSE relies on a single shared transformer encoder for all languages, allowing direct comparison between sentences without translation.^[1]

Architecture

The system follows the structure of BERT-base (12 transformer layers, 12 attention heads) but employs a dual-encoder training setup similar to the Universal Sentence Encoder.^[7]^[8]

Each sentence is tokenized using a joint multilingual WordPiece vocabulary covering 109 languages. Mean pooling across the final hidden states yields a fixed-size sentence representation. Training uses a translation ranking loss that maximizes cosine similarity between parallel sentences and minimizes it for unrelated pairs.^[9]^[10]

Training

LaBSE was trained on large multilingual corpora combining public datasets such as OPUS with internal translation data from Google.^[11]^[12]

Optimization employed Adam with in-batch negatives and temperature-scaled cross-entropy. According to the authors, LaBSE achieved state-of-the-art results on cross-lingual retrieval benchmarks such as BUCC and Tatoeba at the time of its release.^[1]

Applications

The model is publicly available on TensorFlow Hub and integrated into popular frameworks such as Hugging Face Transformers and Spark NLP. Typical applications include:

Cross-lingual document and semantic search.
Automatic evaluation of machine translation quality.
Multilingual clustering, deduplication, and classification.
Serving as a universal encoder for zero-shot learning tasks.

Reception and impact

LaBSE has been cited extensively in academic literature on cross-lingual representation learning.^[13] Independent evaluations report that it remains competitive with later multilingual embedding models such as LASER2 and multilingual Sentence-BERT.^[14]

Its introduction marked a milestone in multilingual semantic similarity research and influenced subsequent releases of multilingual encoders in the open-source ecosystem.^[15]^[16]^[17]

References

↑ ^1.0 ^1.1 ^1.2 Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
↑ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the ACL 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
↑ "tfhub.dev/google/LaBSE". TensorFlow Hub. https://tfhub.dev/google/LaBSE.
↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
↑ "“Language-Agnostic BERT Sentence Embedding”". Google Research Blog. 2020-08-18. https://research.google/blog/language-agnostic-bert-sentence-embedding/.
↑ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
↑ "Samanantar: The Largest Publicly Available Parallel Corpus". MIT Press. 2022. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00452/109468/Samanantar-The-Largest-Publicly-Available-Parallel.
↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
↑ Reimers, Nils; Gurevych, Iryna (2020). "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation". Transactions of the Association for Computational Linguistics 8: 121–135. doi:10.1162/tacl_a_00343.
↑ "Notes on LaBSE". 2021-02-01. https://blog.ceshine.net/post/notes-labse/.
↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
↑ Mao, Zhuoyuan; Chu, Chenhui; Kurohashi, Sadao (2022). "Efficient and Effective Massively Multilingual Sentence Embedding (EMS)". arXiv:2205.15744 [cs.CL].
↑ "Comparative Study of Multilingual Sentence Embedding Models for Semantic Search". Hugging Face Blog. 2023-03-15. https://huggingface.co/blog/multilingual-sentence-embeddings.

External links

LaBSE repository
Language-Agnostic BERT Sentence Embedding (by Yinfei Yang and Fangxiaoyu Feng, Software Engineers, Google Research).
TensorFlow Hub – LaBSE

0.00

(0 votes)

[Feng2020-1] 1.0 ^1.1 ^1.2 Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].

[2] Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the ACL 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[3] "tfhub.dev/google/LaBSE". TensorFlow Hub. https://tfhub.dev/google/LaBSE.

[4] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[5] "“Language-Agnostic BERT Sentence Embedding”". Google Research Blog. 2020-08-18. https://research.google/blog/language-agnostic-bert-sentence-embedding/.

[6] Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[7] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].

[8] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[9] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].

[10] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[11] "Samanantar: The Largest Publicly Available Parallel Corpus". MIT Press. 2022. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00452/109468/Samanantar-The-Largest-Publicly-Available-Parallel.

[12] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].

[13] Reimers, Nils; Gurevych, Iryna (2020). "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation". Transactions of the Association for Computational Linguistics 8: 121–135. doi:10.1162/tacl_a_00343.

[14] "Notes on LaBSE". 2021-02-01. https://blog.ceshine.net/post/notes-labse/.

[15] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].

[16] Mao, Zhuoyuan; Chu, Chenhui; Kurohashi, Sadao (2022). "Efficient and Effective Massively Multilingual Sentence Embedding (EMS)". arXiv:2205.15744 [cs.CL].

[17] "Comparative Study of Multilingual Sentence Embedding Models for Semantic Search". Hugging Face Blog. 2023-03-15. https://huggingface.co/blog/multilingual-sentence-embeddings.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

Anonymous

Search

Software:LaBSE

Namespaces

More

Page actions

Contents

Overview

Architecture

Training

Applications

Reception and impact

See also

References

External links

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Software:LaBSE

Overview

Architecture

Training

Applications

Reception and impact

See also

References

External links

Navigation

Wiki tools

Page tools

Other projects

Categories