Software:LaBSE
| Developer(s) | Google Research |
|---|---|
| Initial release | Script error: No such module "Date time". |
| Repository | tfhub |
| Written in | |
| Operating system | Cross-platform |
| Type | Open-source machine learning / Natural language processing |
| License | Apache License 2.0 |
LaBSE (Language-agnostic BERT Sentence Embedding) is an open-source sentence embedding model developed by Google Research and published in 2020.[1]
It extends BERT language model with a multilingual dual-encoder architecture trained on parallel translation data, enabling semantically comparable sentence vectors across more than one hundred languages.[2]
LaBSE is distributed via TensorFlow Hub and is widely used for cross-lingual information retrieval, semantic search, and machine translation evaluation.[3][4]
Overview
LaBSE was introduced by Google Research as part of its multilingual representation learning program. The model maps text from diverse languages into a shared 768-dimensional vector space, where semantically equivalent sentences are located close to each other.[5][6]
Unlike traditional translation-based systems, LaBSE relies on a single shared transformer encoder for all languages, allowing direct comparison between sentences without translation.[1]
Architecture
The system follows the structure of BERT-base (12 transformer layers, 12 attention heads) but employs a dual-encoder training setup similar to the Universal Sentence Encoder.[7][8]
Each sentence is tokenized using a joint multilingual WordPiece vocabulary covering 109 languages. Mean pooling across the final hidden states yields a fixed-size sentence representation. Training uses a translation ranking loss that maximizes cosine similarity between parallel sentences and minimizes it for unrelated pairs.[9][10]
Training
LaBSE was trained on large multilingual corpora combining public datasets such as OPUS with internal translation data from Google.[11][12]
Optimization employed Adam with in-batch negatives and temperature-scaled cross-entropy. According to the authors, LaBSE achieved state-of-the-art results on cross-lingual retrieval benchmarks such as BUCC and Tatoeba at the time of its release.[1]
Applications
The model is publicly available on TensorFlow Hub and integrated into popular frameworks such as Hugging Face Transformers and Spark NLP. Typical applications include:
- Cross-lingual document and semantic search.
- Automatic evaluation of machine translation quality.
- Multilingual clustering, deduplication, and classification.
- Serving as a universal encoder for zero-shot learning tasks.
Reception and impact
LaBSE has been cited extensively in academic literature on cross-lingual representation learning.[13] Independent evaluations report that it remains competitive with later multilingual embedding models such as LASER2 and multilingual Sentence-BERT.[14]
Its introduction marked a milestone in multilingual semantic similarity research and influenced subsequent releases of multilingual encoders in the open-source ecosystem.[15][16][17]
See also
- Natural language processing
- Sentence embedding
- Cross-lingual information retrieval
- BERT
References
- ↑ 1.0 1.1 1.2 Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
- ↑ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the ACL 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ "tfhub.dev/google/LaBSE". TensorFlow Hub. https://tfhub.dev/google/LaBSE.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ "“Language-Agnostic BERT Sentence Embedding”". Google Research Blog. 2020-08-18. https://research.google/blog/language-agnostic-bert-sentence-embedding/.
- ↑ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ "Samanantar: The Largest Publicly Available Parallel Corpus". MIT Press. 2022. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00452/109468/Samanantar-The-Largest-Publicly-Available-Parallel.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
- ↑ Reimers, Nils; Gurevych, Iryna (2020). "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation". Transactions of the Association for Computational Linguistics 8: 121–135. doi:10.1162/tacl_a_00343.
- ↑ "Notes on LaBSE". 2021-02-01. https://blog.ceshine.net/post/notes-labse/.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
- ↑ Mao, Zhuoyuan; Chu, Chenhui; Kurohashi, Sadao (2022). "Efficient and Effective Massively Multilingual Sentence Embedding (EMS)". arXiv:2205.15744 [cs.CL].
- ↑ "Comparative Study of Multilingual Sentence Embedding Models for Semantic Search". Hugging Face Blog. 2023-03-15. https://huggingface.co/blog/multilingual-sentence-embeddings.
External links
- LaBSE repository
- Language-Agnostic BERT Sentence Embedding (by Yinfei Yang and Fangxiaoyu Feng, Software Engineers, Google Research).
- TensorFlow Hub – LaBSE
