Learned sparse retrieval

From HandWiki

Learned sparse retrieval or sparse neural search is an approach to text search which uses a sparse vector representation of queries and documents.[1] It borrows techniques both from lexical bag-of-words and vector embedding algorithms, and is claimed to perform better than either alone. The best-known sparse neural search systems are SPLADE[2] and its successor SPLADE v2.[3] Others include DeepCT,[4] uniCOIL,[5] EPIC,[6] DeepImpact,[7] TILDE and TILDEv2,[8] Sparta,[9] SPLADE-max, and DistilSPLADE-max.[3]

Some implementations of SPLADE have similar latency to Okapi BM25 lexical search while giving as good results as state-of-the-art neural rankers on in-domain data.[10]

The SPLADE software is released under a Creative Commons NonCommercial license.[11]

SPRINT is a toolkit for evaluating neural sparse retrieval systems.[12]

External links

Notes

  1. Nguyen, Thong; MacAvaney, Sean; Yates, Andrew (2023). "A Unified Framework for Learned Sparse Retrieval". in Kamps, Jaap; Goeuriot, Lorraine; Crestani, Fabio et al. (in en). Advances in Information Retrieval. Lecture Notes in Computer Science. 13982. Cham: Springer Nature Switzerland. pp. 101–116. doi:10.1007/978-3-031-28241-6_7. ISBN 978-3-031-28241-6. https://link.springer.com/chapter/10.1007/978-3-031-28241-6_7. 
  2. Formal, Thibault; Piwowarski, Benjamin; Clinchant, Stéphane (2021-07-11). "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking". Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '21. New York, NY, USA: Association for Computing Machinery. pp. 2288–2292. doi:10.1145/3404835.3463098. ISBN 978-1-4503-8037-9. https://doi.org/10.1145/3404835.3463098. 
  3. 3.0 3.1 Formal, Thibault; Piworwarski, Benjamin; Lassance, Carlos; Clinchant, Stéphane (21 September 2021). "SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval". arXiv:2109.10086v1 [cs.IR].
  4. Dai, Zhuyun; Callan, Jamie (2020-04-20). "Context-Aware Document Term Weighting for Ad-Hoc Search". Proceedings of the Web Conference 2020. New York, NY, USA: ACM. pp. 1897–1907. doi:10.1145/3366423.3380258. ISBN 9781450370233. http://dx.doi.org/10.1145/3366423.3380258. 
  5. Lin, Jimmy; Ma, Xueguang (28 June 2021). "A few brief notes on DeepImpact, COIL, and a conceptual framework for information retrieval techniques". arXiv:2106.14807 [cs.IR].
  6. MacAvaney, Sean; Nardini, Franco Maria; Perego, Raffaele; Tonellotto, Nicola; Goharian, Nazli; Frieder, Ophir (2020-07-25). "Expansion via Prediction of Importance with Contextualization". Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '20. New York, NY, USA: Association for Computing Machinery. pp. 1573–1576. doi:10.1145/3397271.3401262. ISBN 978-1-4503-8016-4. https://doi.org/10.1145/3397271.3401262. 
  7. Mallia, Antonio; Khattab, Omar; Suel, Torsten; Tonellotto, Nicola (2021-07-11). "Learning Passage Impacts for Inverted Indexes". Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '21. New York, NY, USA: Association for Computing Machinery. pp. 1723–1727. doi:10.1145/3404835.3463030. ISBN 978-1-4503-8037-9. https://dl.acm.org/doi/10.1145/3404835.3463030. 
  8. Zhuang, Shengyao; Zuccon, Guido (13 September 2021). "Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion". arXiv:2108.08513 [cs.IR].
  9. Zhao, Tiancheng; Lu, Xiaopeng; Lee, Kyusong (28 September 2020). "SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval". arXiv:2009.13013 [cs.CL].
  10. Lassance, Carlos; Clinchant, Stéphane (2022-07-07). "An Efficiency Study for SPLADE Models". Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '22. New York, NY, USA: Association for Computing Machinery. pp. 2220–2226. doi:10.1145/3477495.3531833. ISBN 978-1-4503-8732-3. https://doi.org/10.1145/3477495.3531833. 
  11. "splade/LICENSE at main · naver/splade" (in en). https://github.com/naver/splade/blob/main/LICENSE. 
  12. Thakur, Nandan; Wang, Kexin; Gurevych, Iryna; Lin, Jimmy (2023-07-18). "SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval". Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '23. New York, NY, USA: Association for Computing Machinery. pp. 2964–2974. doi:10.1145/3539618.3591902. ISBN 978-1-4503-9408-6. https://doi.org/10.1145/3539618.3591902.