Topic model: Difference between revisions

From HandWiki
imported>JStaso
fixing
 
correction
 
Line 1: Line 1:
{{Short description|Statistical model}}
{{Short description|Text-based topic extraction method}}
In [[Statistics|statistics]] and [[Natural language processing|natural language processing]], a '''topic model''' is a type of [[Statistical model|statistical model]] for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.
In [[Natural language processing|natural language processing]], a '''topic model''' is a type of [[Statistical model|probabilistic]], neural, or [[Matrix decomposition|algebraic model]] for discovering the abstract topics that occur in a collection of documents. Topic modeling is a frequently used [[Text mining|text mining]] tool for discovering hidden semantic features and structures in a text. The topics produced by topic models are generated through a variety of mathematical frameworks, including probabilistic generative models, matrix factorization methods based on word co-occurrence, and [[Cluster analysis|clustering algorithms]] applied to [[Word embedding|semantic embeddings]].<ref>{{Cite journal |last=Egger |first=Roman |last2=Yu |first2=Joanne |date=2022 |title=A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts |journal=Frontiers in Sociology |volume=7 |article-number=886498 |doi=10.3389/fsoc.2022.886498  |doi-access=free|issn=2297-7775 |pmc=9120935 |pmid=35602001}}</ref><ref>{{Cite journal |last=Muthusami |first=R. |last2=Mani Kandan |first2=N. |last3=Saritha |first3=K. |last4=Narenthiran |first4=B. |last5=Nagaprasad |first5=N. |last6=Ramaswamy |first6=Krishnaraj |date=2024-05-25 |title=Investigating topic modeling techniques through evaluation of topics discovered in short texts data across diverse domains |url=https://www.nature.com/articles/s41598-024-61738-4 |journal=Scientific Reports |language=en |volume=14 |issue=1 |page=12003 |doi=10.1038/s41598-024-61738-4 |issn=2045-2322 |pmc=11127968 |pmid=38796483}}</ref><ref>{{Cite journal |last=Churchill |first=Rob |last2=Singh |first2=Lisa |date=2022-11-10 |title=The Evolution of Topic Modeling |journal=ACM Comput. Surv. |volume=54 |issue=10s |pages=215:1–215:35 |doi=10.1145/3507900 |issn=0360-0300}}</ref>


Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Originally developed as a text-mining tool, topic models have been used to detect instructive structures in data such as genetic information, images, and networks. They also have applications in other fields such as [[Biology:Bioinformatics|bioinformatics]]<ref>{{cite journal|last1=Blei|first1=David|title=Probabilistic Topic Models|journal=Communications of the ACM|date=April 2012|volume=55|issue=4|pages=77–84|doi=10.1145/2133806.2133826|s2cid=753304}}</ref> and [[Computer vision|computer vision]].<ref>Cao, Liangliang, and Li Fei-Fei. "[http://www.ifp.illinois.edu/~cao4/papers/CaoFei-Fei_ICCV2007_final.pdf Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes]." 2007 IEEE 11th International Conference on Computer Vision. IEEE, 2007.</ref>
Topic models are commonly used to organize and discover latent features in large collections of unstructured text and other forms of [[Big data|big data]].<ref>{{Cite journal |last=Murshed |first=Belal Abdullah Hezam |last2=Mallappa |first2=Suresha |last3=Abawajy |first3=Jemal |last4=Saif |first4=Mufeed Ahmed Naji |last5=Al-ariki |first5=Hasib Daowd Esmail |last6=Abdulwahab |first6=Hudhaifa Mohammed |date=2023-06-01 |title=Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis |journal=Artificial Intelligence Review |language=en |volume=56 |issue=6 |pages=5133–5260 |doi=10.1007/s10462-022-10254-w |issn=1573-7462 |pmc=9607740 |pmid=36320612}}</ref><ref>{{cite conference |last=Song |first=W. |last2=Zou |first2=L. |year=2016 |title=LDA-TM: A Two-step Approach to Twitter Topic Data Clustering |publisher=IEEE |pages=342–347 |doi=10.1109/ICCCBDA.2016.7529581 |book-title=Proceedings of the IEEE International Conference on Cloud Computing and Big Data Analysis}}</ref> Beyond text mining, topic models have also been used to uncover latent structures in fields such as [[Biology:Genetics|genetic information]], [[Biology:Bioinformatics|bioinformatics]], [[Computer vision|computer vision]], and [[Philosophy:Social network analysis|social networks]].<ref>{{cite journal|last1=Blei|first1=David|title=Probabilistic Topic Models|journal=Communications of the ACM|date=April 2012|volume=55|issue=4|pages=77–84|doi=10.1145/2133806.2133826|s2cid=753304}}</ref><ref>Cao, Liangliang, and Li Fei-Fei. "[http://www.ifp.illinois.edu/~cao4/papers/CaoFei-Fei_ICCV2007_final.pdf Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes]." 2007 IEEE 11th International Conference on Computer Vision. IEEE, 2007.</ref>


==History==
==History==
Line 16: Line 16:
  |title          = Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '98
  |title          = Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '98
  |chapter        = Latent semantic indexing
  |chapter        = Latent semantic indexing
|date          = 1998
  |pages          = 159–168
  |pages          = 159–168
  |year          = 1998
  |year          = 1998
Line 22: Line 21:
  |chapter-format = Postscript
  |chapter-format = Postscript
  |doi            = 10.1145/275487.275505
  |doi            = 10.1145/275487.275505
  |isbn          = 978-0897919968
  |isbn          = 978-0-89791-996-8
  |s2cid          = 1479546
  |s2cid          = 1479546
  |access-date    = 2012-04-17
  |access-date    = 2012-04-17
  |archive-date  = 2013-05-09
  |archive-date  = 2013-05-09
  |archive-url    = https://web.archive.org/web/20130509130907/http://www.cs.berkeley.edu/%7Echristos/ir.ps
  |archive-url    = https://web.archive.org/web/20130509130907/http://www.cs.berkeley.edu/%7Echristos/ir.ps
  |url-status    = dead
  }}</ref> Another one, called probabilistic latent semantic analysis (PLSA), was created by Thomas Hofmann in 1999.<ref name="hofmann1999">{{cite journal
}}</ref> Another one, called probabilistic latent semantic analysis (PLSA), was created by Thomas Hofmann in 1999.<ref name="hofmann1999">{{cite journal
  |last1      = Hofmann
  |last1      = Hofmann
  |first1      = Thomas
  |first1      = Thomas
Line 34: Line 32:
  |journal    = Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval
  |journal    = Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval
  |year        = 1999
  |year        = 1999
  |url        = http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf
  |url        = https://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf
|url-status    = dead
  |archive-url  = https://web.archive.org/web/20101214074049/http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf
  |archive-url  = https://web.archive.org/web/20101214074049/http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf
  |archive-date = 2010-12-14
  |archive-date = 2010-12-14
Line 52: Line 49:
}}</ref> Other topic models are generally extensions on LDA, such as [[Pachinko allocation]], which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Hierarchical latent tree analysis ([https://web.archive.org/web/20190901175618/http://www.cse.ust.hk/~lzhang/paper/pspdf/liu-n-ecml14.pdf HLTA]) is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.
}}</ref> Other topic models are generally extensions on LDA, such as [[Pachinko allocation]], which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Hierarchical latent tree analysis ([https://web.archive.org/web/20190901175618/http://www.cse.ust.hk/~lzhang/paper/pspdf/liu-n-ecml14.pdf HLTA]) is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.


[[File:Topic model scheme.webm|thumb|600px|thumbtime=24|start=1|end=24|Animation of the topic detection process in a document-word matrix through [[Biclustering|biclustering]]. Every column corresponds to a document, every row to a word. A cell stores the frequency of a word in a document, with dark cells indicating high word frequencies. This procedure groups documents, which use similar words, as it groups words occurring in a similar set of documents. Such groups of words are then called topics. More usual topic models, such as LDA, only group documents, based on a more sophisticated and probabilistic mechanism.]]
[[File:Topic model scheme.webm|thumb|600px|thumbtime=24|start=1|end=24|Animation of the topic detection process in a document-word matrix through [[biclustering]]. Every column corresponds to a document, every row to a word. A cell stores the frequency of a word in a document, with dark cells indicating high word frequencies. This procedure groups documents, which use similar words, as it groups words occurring in a similar set of documents. Such groups of words are then called topics. More usual topic models, such as LDA, only group documents, based on a more sophisticated and probabilistic mechanism.]]


==Topic models for context information==
==Topic models for context information==


Approaches for temporal information include Block and Newman's determination of the temporal dynamics of topics in the ''Pennsylvania Gazette'' during 1728–1800. [[Biography:Tom Griffiths (cognitive scientist)|Griffiths]] & Steyvers used topic modeling on abstracts from the journal ''PNAS'' to identify topics that rose or fell in popularity from 1991 to 2001 whereas Lamba & Madhusushan <ref name="Lamba 2019 477–505">{{Cite journal|last=Lamba|first=Manika jun|year=2019|title=Mapping of topics in DESIDOC Journal of Library and Information Technology, India: a study|journal=Scientometrics|volume=120|issue=2|pages=477–505|doi=10.1007/s11192-019-03137-5|s2cid=174802673|issn=0138-9130}}</ref> used topic modeling on full-text research articles retrieved from DJLIT journal from 1981 to 2018. In the field of library and information science, Lamba & Madhusudhan <ref name="Lamba 2019 477–505"/><ref>{{Cite journal|last=Lamba|first=Manika jun|year=2019|title=Metadata Tagging and Prediction Modeling: Case Study of DESIDOC Journal of Library and Information Technology (2008-2017)|url=https://content.iospress.com/articles/world-digital-libraries-an-international-journal/wdl12103|journal=World Digital Libraries|volume=12|pages=33–89|doi=10.18329/09757597/2019/12103|issn=0975-7597|doi-broken-date=1 August 2023}}</ref><ref>{{Cite journal|last=Lamba|first=Manika may|year=2019|title=Author-Topic Modeling of DESIDOC Journal of Library and Information Technology (2008-2017), India|url=https://www.proquest.com/openview/4416f54af3fe77e1c49c811af86990eb/1?pq-origsite=gscholar&cbl=54903|journal=Library Philosophy and Practice}}</ref><ref>{{Cite conference|last=Lamba|first=Manika sep|year=2018|title=Metadata Tagging of Library and Information Science Theses: Shodhganga (2013-2017)|url=https://etd2018.ncl.edu.tw/images/phocadownload/3-2_Manika_Lamba_Extended_Abstract_ETD_2018.pdf|conference=ETD2018:Beyond the boundaries of Rims and Oceans|place=Taiwan,Taipei}}</ref> applied topic modeling on different Indian resources like journal articles and electronic theses and resources (ETDs). Nelson <ref>{{cite web |last1=Nelson |first1=Rob |title=Mining the Dispatch |url=https://dsl.richmond.edu/dispatch/ |website=Mining the Dispatch |publisher=Digital Scholarship Lab, University of Richmond |access-date=26 March 2021}}</ref> has been analyzing change in topics over time in the ''Richmond Times-Dispatch'' to understand social and political changes and continuities in Richmond during the American Civil War. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829 to 2008. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.
Approaches for temporal information include Block and Newman's determination of the temporal dynamics of topics in the ''Pennsylvania Gazette'' during 1728–1800. [[Biography:Tom Griffiths (cognitive scientist)|Griffiths]] & Steyvers used topic modeling on abstracts from the journal ''PNAS'' to identify topics that rose or fell in popularity from 1991 to 2001 whereas Lamba & Madhusushan <ref name="Lamba 2019 477–505">{{Cite journal|last=Lamba|first=Manika jun|year=2019|title=Mapping of topics in DESIDOC Journal of Library and Information Technology, India: a study|journal=Scientometrics|volume=120|issue=2|pages=477–505|doi=10.1007/s11192-019-03137-5|s2cid=174802673|issn=0138-9130}}</ref> used topic modeling on full-text research articles retrieved from DJLIT journal from 1981 to 2018. In the field of library and information science, Lamba & Madhusudhan <ref name="Lamba 2019 477–505"/><ref>{{Cite journal|last=Lamba|first=Manika jun|year=2019|title=Metadata Tagging and Prediction Modeling: Case Study of DESIDOC Journal of Library and Information Technology (2008-2017)|url=https://content.iospress.com/articles/world-digital-libraries-an-international-journal/wdl12103|journal=World Digital Libraries|volume=12|pages=33–89|doi=10.18329/09757597/2019/12103|issn=0975-7597|doi-broken-date=12 July 2025}}</ref><ref>{{Cite journal|last=Lamba|first=Manika may|year=2019|title=Author-Topic Modeling of DESIDOC Journal of Library and Information Technology (2008-2017), India|url=https://www.proquest.com/openview/4416f54af3fe77e1c49c811af86990eb/1?pq-origsite=gscholar&cbl=54903|journal=Library Philosophy and Practice}}</ref><ref>{{Cite conference|last=Lamba|first=Manika sep|year=2018|title=Metadata Tagging of Library and Information Science Theses: Shodhganga (2013-2017)|url=https://etd2018.ncl.edu.tw/images/phocadownload/3-2_Manika_Lamba_Extended_Abstract_ETD_2018.pdf|conference=ETD2018:Beyond the boundaries of Rims and Oceans|place=Taiwan, Taipei}}</ref> applied topic modeling on different Indian resources like journal articles and electronic theses and resources (ETDs). Nelson <ref>{{cite web |last1=Nelson |first1=Rob |title=Mining the Dispatch |url=https://dsl.richmond.edu/dispatch/ |website=Mining the Dispatch |publisher=Digital Scholarship Lab, University of Richmond |access-date=26 March 2021}}</ref> has been analyzing change in topics over time in the ''Richmond Times-Dispatch'' to understand social and political changes and continuities in Richmond during the American Civil War. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829 to 2008. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.


Yin et al.<ref>{{Cite book|last=Yin|first=Zhijun|title=Proceedings of the 20th international conference on World wide web |chapter=Geographical topic discovery and comparison |date=2011 |year=2011|pages=247–256|doi=10.1145/1963405.1963443|isbn=9781450306324|s2cid=17883132}}</ref> introduced a topic model for geographically distributed documents, where document positions are explained by latent regions which are detected during inference.
Yin et al.<ref>{{Cite book|last=Yin|first=Zhijun|title=Proceedings of the 20th international conference on World wide web |chapter=Geographical topic discovery and comparison |year=2011|pages=247–256|doi=10.1145/1963405.1963443|isbn=978-1-4503-0632-4|s2cid=17883132}}</ref> introduced a topic model for geographically distributed documents, where document positions are explained by latent regions which are detected during inference.


Chang and Blei<ref>{{Cite journal|last=Chang|first=Jonathan|year=2009|title=Relational Topic Models for Document Networks|url=http://www.jmlr.org/proceedings/papers/v5/chang09a/chang09a.pdf|journal=Aistats|volume=9|pages=81–88}}</ref> included network information between linked documents in the relational topic model, to model the links between websites.
Chang and Blei<ref>{{Cite journal|last=Chang|first=Jonathan|year=2009|title=Relational Topic Models for Document Networks|url=http://www.jmlr.org/proceedings/papers/v5/chang09a/chang09a.pdf|journal=Aistats|volume=9|pages=81–88}}</ref> included network information between linked documents in the relational topic model, to model the links between websites.
Line 66: Line 63:
HLTA was applied to a collection of recent research papers published at major AI and Machine Learning venues. The resulting model is called [http://home.cse.ust.hk/~lzhang/topic/ai-tree.pdf The AI Tree]. The resulting topics are used to index the papers at [http://aipano.cse.ust.hk aipano.cse.ust.hk] to help researchers [http://home.cse.ust.hk/~lzhang/topic/aipanoIntro.pdf track research trends and identify papers to read], and help conference organizers and journal editors [https://slidetalk.net/Home/Viewer?Video=2626079 identify reviewers for submissions].
HLTA was applied to a collection of recent research papers published at major AI and Machine Learning venues. The resulting model is called [http://home.cse.ust.hk/~lzhang/topic/ai-tree.pdf The AI Tree]. The resulting topics are used to index the papers at [http://aipano.cse.ust.hk aipano.cse.ust.hk] to help researchers [http://home.cse.ust.hk/~lzhang/topic/aipanoIntro.pdf track research trends and identify papers to read], and help conference organizers and journal editors [https://slidetalk.net/Home/Viewer?Video=2626079 identify reviewers for submissions].


To improve the qualitative aspects and coherency of generated topics, some researchers have explored the efficacy of "coherence scores", or otherwise how computer-extracted clusters (i.e. topics) align with a human benchmark.<ref>{{Cite journal|last=Nikolenko|first=Sergey|year=2017|title=Topic modelling for qualitative studies|journal=Journal of Information Science|volume=43 |pages=88–102|doi=10.1177/0165551515617393 |s2cid=30657489 }}</ref><ref>{{Cite thesis|type=Honours thesis|last=Reverter-Rambaldi|first=Marcel|date=2022|title=Topic Modelling in Spontaneous Speech Data|publisher=Australian National University|doi=10.25911/M1YF-ZF55}}</ref> Coherence scores are metrics for optimising the number of topics to extract from a document corpus.<ref>{{Cite journal|author-link1=|last=Newman|first=David|year=2010|title=Automatic evaluation of topic coherence|journal=Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics|pages=100–108}}</ref>
To improve the qualitative aspects and coherency of generated topics, some researchers have explored the efficacy of "coherence scores", or otherwise how computer-extracted clusters (i.e. topics) align with a human benchmark.<ref>{{Cite journal|last=Nikolenko|first=Sergey|year=2017|title=Topic modelling for qualitative studies|journal=Journal of Information Science|volume=43 |pages=88–102|doi=10.1177/0165551515617393 |s2cid=30657489 }}</ref><ref>{{Cite thesis|type=Honours thesis|last=Reverter-Rambaldi|first=Marcel|date=2022|title=Topic Modelling in Spontaneous Speech Data|publisher=Australian National University|doi=10.25911/M1YF-ZF55}}</ref> Coherence scores are metrics for optimising the number of topics to extract from a document corpus.<ref>{{Cite journal|last=Newman|first=David|year=2010|title=Automatic evaluation of topic coherence|journal=Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics|pages=100–108}}</ref>


==Algorithms==
==Algorithms==
Line 76: Line 73:
  | s2cid = 753304
  | s2cid = 753304
  | url = https://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext
  | url = https://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext
}}</ref>
| url-access = subscription}}</ref>
Several groups of researchers starting with Papadimitriou et al.<ref name=PRTV1998 /> have attempted to design algorithms with provable guarantees. Assuming that the data were actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here include [[Singular value decomposition|singular value decomposition]] (SVD) and the [[Method of moments (statistics)|method of moments]]. In 2012 an algorithm based upon [[Non-negative matrix factorization|non-negative matrix factorization]] (NMF) was introduced that also generalizes to topic models with correlations among topics.<ref>{{Cite arXiv
Several groups of researchers starting with Papadimitriou et al.<ref name=PRTV1998 /> have attempted to design algorithms with provable guarantees. Assuming that the data were actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here include [[Singular value decomposition|singular value decomposition]] (SVD) and the [[Method of moments (statistics)|method of moments]]. In 2012 an algorithm based upon [[Non-negative matrix factorization|non-negative matrix factorization]] (NMF) was introduced that also generalizes to topic models with correlations among topics.<ref>{{Cite arXiv
| author1 = Sanjeev Arora |author2=Rong Ge |author3=Ankur Moitra
| author1 = Sanjeev Arora |author2=Rong Ge |author3=Ankur Moitra
Line 84: Line 81:
</ref>
</ref>


In 2017, neural network has been leveraged in topic modeling to make it faster in inference,<ref>{{Cite journal |last1=Miao |first1=Yishu |last2=Grefenstette |first2=Edward |last3=Blunsom |first3=Phil |date=2017 |title=Discovering Discrete Latent Topics with Neural Variational Inference |url=https://proceedings.mlr.press/v70/miao17a.html |journal=Proceedings of the 34th International Conference on Machine Learning |language=en |publisher=PMLR |pages=2410–2419}}</ref> which has been extended weakly supervised version.<ref>{{Cite journal |last1=Xu |first1=Weijie |last2=Jiang |first2=Xiaoyu |last3=Sengamedu Hanumantha Rao |first3=Srinivasan |last4=Iannacci |first4=Francis |last5=Zhao |first5=Jinjin |date=2023 |title=vONTSS: vMF based semi-supervised neural topic modeling with optimal transport |url=http://dx.doi.org/10.18653/v1/2023.findings-acl.271 |journal=Findings of the Association for Computational Linguistics: ACL 2023 |pages=4433–4457 |location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-acl.271|arxiv=2307.01226 }}</ref>
Since 2017, neural networks has been leveraged in topic modeling in order to improve the speed of inference,<ref>{{Cite journal |last1=Miao |first1=Yishu |last2=Grefenstette |first2=Edward |last3=Blunsom |first3=Phil |date=2017 |title=Discovering Discrete Latent Topics with Neural Variational Inference |url=https://proceedings.mlr.press/v70/miao17a.html |journal=Proceedings of the 34th International Conference on Machine Learning |language=en |publisher=PMLR |pages=2410–2419|arxiv=1706.00359 }}</ref> and leading to further advancements like vONTSS, which allows humans to incorporate domain knowledge via weakly supervised learning.<ref>{{Cite journal |last1=Xu |first1=Weijie |last2=Jiang |first2=Xiaoyu |last3=Sengamedu Hanumantha Rao |first3=Srinivasan |last4=Iannacci |first4=Francis |last5=Zhao |first5=Jinjin |date=2023 |title=vONTSS: vMF based semi-supervised neural topic modeling with optimal transport |journal=Findings of the Association for Computational Linguistics: ACL 2023 |pages=4433–4457 |location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-acl.271|arxiv=2307.01226 }}</ref>


In 2018 a new approach to topic models was proposed: it is based on [[Stochastic block model|stochastic block model]].<ref name="gerlach2018">{{cite journal
In 2018, a new approach to topic models was proposed based on the [[Stochastic block model|stochastic block model]].<ref name="gerlach2018">{{cite journal
| author1 = Martin Gerlach |author2=Tiago Pexioto |author3=Eduardo Altmann
| author1 = Martin Gerlach |author2=Tiago Pexioto |author3=Eduardo Altmann
| title = A network approach to topic models  
| title = A network approach to topic models  
| page=eaaq1360
| article-number=eaaq1360
| journal=Science Advances
| journal=Science Advances
|  date = 2018 |doi = 10.1126/sciadv.aaq1360 | volume=4 | issue=7
|  date = 2018 |doi = 10.1126/sciadv.aaq1360 | volume=4 | issue=7
|pmid=30035215 |pmc=6051742 |arxiv=1708.01677 |bibcode=2018SciA....4.1360G }}</ref>  
|pmid=30035215 |pmc=6051742 |arxiv=1708.01677 |bibcode=2018SciA....4.1360G }}</ref>


Because of the recent development of LLM,  topic modeling has leveraged LLM through contextual embedding<ref>{{Cite book |last1=Bianchi |first1=Federico |last2=Terragni |first2=Silvia |last3=Hovy |first3=Dirk |chapter=Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence |date=2021 |pages=759–766 |title=Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) |chapter-url=http://dx.doi.org/10.18653/v1/2021.acl-short.96 |location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2021.acl-short.96}}</ref> and fine tuning.<ref>{{Cite journal |last1=Xu |first1=Weijie |last2=Hu |first2=Wenxiang |last3=Wu |first3=Fanyou |last4=Sengamedu |first4=Srinivasan |date=2023 |title=DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM |url=http://dx.doi.org/10.18653/v1/2023.findings-emnlp.606 |journal=Findings of the Association for Computational Linguistics: EMNLP 2023 |pages=9040–9057 |location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-emnlp.606|arxiv=2310.15296 }}</ref>
Topic modeling has leveraged LLMs through contextual embedding<ref>{{Cite book |last1=Bianchi |first1=Federico |last2=Terragni |first2=Silvia |last3=Hovy |first3=Dirk |chapter=Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence |date=2021 |pages=759–766 |title=Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) |chapter-url=http://dx.doi.org/10.18653/v1/2021.acl-short.96 |location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2021.acl-short.96}}</ref> and fine tuning.<ref>{{Cite journal |last1=Xu |first1=Weijie |last2=Hu |first2=Wenxiang |last3=Wu |first3=Fanyou |last4=Sengamedu |first4=Srinivasan |date=2023 |title=DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM |journal=Findings of the Association for Computational Linguistics: EMNLP 2023 |pages=9040–9057 |location=Stroudsburg, PA, USA |publisher=Association for Computational Linguistics |doi=10.18653/v1/2023.findings-emnlp.606|arxiv=2310.15296 }}</ref>


== Applications of topic models ==
== Applications of topic models ==
Line 115: Line 112:


=== To analysis of music and creativity ===
=== To analysis of music and creativity ===
Topic models can be used for analysis of continuous signals like music. For instance, they were used to quantify how musical styles change in time, and identify the influence of specific artists on later music creation. <ref>{{Cite journal |last1=Shalit |first1=Uri |last2=Weinshall |first2=Daphna |last3=Chechik |first3=Gal |date=2013-05-13 |title=Modeling Musical Influence with Topic Models |url=https://proceedings.mlr.press/v28/shalit13.html |journal=Proceedings of the 30th International Conference on Machine Learning |language=en |publisher=PMLR |pages=244–252}}</ref>  
Topic models can be used for analysis of continuous signals like music. For instance, they were used to quantify how musical styles change in time, and identify the influence of specific artists on later music creation.<ref>{{Cite journal |last1=Shalit |first1=Uri |last2=Weinshall |first2=Daphna |last3=Chechik |first3=Gal |date=2013-05-13 |title=Modeling Musical Influence with Topic Models |url=https://proceedings.mlr.press/v28/shalit13.html |journal=Proceedings of the 30th International Conference on Machine Learning |language=en |publisher=PMLR |pages=244–252}}</ref>


== See also ==
== See also ==
* [[Dynamic topic model]]
* [[Unsupervised learning]]
* [[Explicit semantic analysis]]
* [[Explicit semantic analysis]]
* [[Latent semantic analysis]]
* [[Latent semantic analysis]]
Line 124: Line 123:
* [[Non-negative matrix factorization]]
* [[Non-negative matrix factorization]]
* [[Statistical classification]]
* [[Statistical classification]]
* [[Unsupervised learning]]
* [[Software:Mallet (software project)|Mallet (software project)]]
* [[Software:Gensim|Gensim]]
* [[Sentence embedding]]
* [[Sentence embedding]]


Line 133: Line 129:


==Further reading==
==Further reading==
*{{cite book |first1=Mark |last1=Steyvers |first2=Tom |last2=Griffiths |chapter=Probabilistic Topic Models |chapter-url=http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf |format=PDF |editor1-first=T. |editor1-last=Landauer |editor2-first=D |editor2-last=McNamara |editor3-first=S. |editor3-last=Dennis |editor4-first=W. |display-editors=3 |editor4-last=Kintsch |title=Handbook of Latent Semantic Analysis |publisher=Psychology Press |year=2007 |isbn=978-0-8058-5418-3 |url=http://www.psypress.com/books/details/9780805854183/ |url-status=dead |archive-url=https://web.archive.org/web/20130624013706/http://www.psypress.com/books/details/9780805854183/ |archive-date=2013-06-24 }}
*{{cite book |first1=Mark |last1=Steyvers |first2=Tom |last2=Griffiths |chapter=Probabilistic Topic Models |chapter-url=http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf |format=PDF |editor1-first=T. |editor1-last=Landauer |editor2-first=D |editor2-last=McNamara |editor3-first=S. |editor3-last=Dennis |editor4-first=W. |display-editors=3 |editor4-last=Kintsch |title=Handbook of Latent Semantic Analysis |publisher=Psychology Press |year=2007 |isbn=978-0-8058-5418-3 |url=http://www.psypress.com/books/details/9780805854183/ |archive-url=https://web.archive.org/web/20130624013706/http://www.psypress.com/books/details/9780805854183/ |archive-date=2013-06-24 }}
*{{cite web|last1=Blei|first1=D.M.|last2=Lafferty|first2=J.D.|title=Topic Models|year=2009|url=https://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf}}
*{{cite web|last1=Blei|first1=D.M.|last2=Lafferty|first2=J.D.|title=Topic Models|year=2009|url=https://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf}}
*{{cite journal |last1=Blei |first1=D. |last2=Lafferty |first2=J. |title=A correlated topic model of ''Science'' |journal=Annals of Applied Statistics |volume=1 |issue=1 |pages=17–35 |year=2007 |doi=10.1214/07-AOAS114 |arxiv=0708.3601 |s2cid=8872108 }}
*{{cite journal |last1=Blei |first1=D. |last2=Lafferty |first2=J. |title=A correlated topic model of ''Science'' |journal=Annals of Applied Statistics |volume=1 |issue=1 |pages=17–35 |year=2007 |doi=10.1214/07-AOAS114 |arxiv=0708.3601 |s2cid=8872108 }}
*{{cite journal |last1=Mimno |first1=D. |title=Computational Historiography: Data Mining in a Century of Classics Journals |journal=Journal on Computing and Cultural Heritage |volume=5 |issue=1 |pages=1–19 |date=April 2012 |doi=10.1145/2160165.2160168 |s2cid=12153151 |url=https://www.perseus.tufts.edu/~amahoney/02-jocch-mimno.pdf }}
*{{cite journal |last1=Mimno |first1=D. |title=Computational Historiography: Data Mining in a Century of Classics Journals |journal=Journal on Computing and Cultural Heritage |volume=5 |issue=1 |pages=1–19 |date=April 2012 |doi=10.1145/2160165.2160168 |s2cid=12153151 |url=https://www.perseus.tufts.edu/~amahoney/02-jocch-mimno.pdf }}
*{{cite book | last1=Marwick | first1=Ben | year=2013| chapter=Discovery of Emergent Issues and Controversies in Anthropology Using Text Mining, Topic Modeling, and Social Network Analysis of Microblog Content | editor1-last=Yanchang | editor1-first=Zhao |editor2-last=Yonghua |editor2-first=Cen |title=Data Mining Applications with R |publisher=Elsevier |pages=63–93 |chapter-url=https://www.academia.edu/5508141}}
*{{cite book | last1=Marwick | first1=Ben | year=2013| chapter=Discovery of Emergent Issues and Controversies in Anthropology Using Text Mining, Topic Modeling, and Social Network Analysis of Microblog Content | editor1-last=Yanchang | editor1-first=Zhao |editor2-last=Yonghua |editor2-first=Cen |title=Data Mining Applications with R |publisher=Elsevier |pages=63–93 | doi=10.1016/B978-0-12-411511-8.00003-7 | isbn=978-0-12-411511-8 |chapter-url=https://www.academia.edu/5508141}}
*Jockers, M. 2010 [http://www.matthewjockers.net/2010/03/19/whos-your-dh-blog-mate-match-making-the-day-of-dh-bloggers-with-topic-modeling/ Who's your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling] Matthew L. Jockers, posted 19 March 2010
*Jockers, M. 2010 [http://www.matthewjockers.net/2010/03/19/whos-your-dh-blog-mate-match-making-the-day-of-dh-bloggers-with-topic-modeling/ Who's your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling] Matthew L. Jockers, posted 19 March 2010
*Drouin, J. 2011 [http://orgs.utulsa.edu/proust/?q=node/35 Foray Into Topic Modeling] Ecclesiastical Proust Archive. posted 17 March 2011
*Drouin, J. 2011 [http://orgs.utulsa.edu/proust/?q=node/35 Foray Into Topic Modeling] Ecclesiastical Proust Archive. posted 17 March 2011
Line 147: Line 143:


==External links==
==External links==
*{{cite web |first=David |last=Mimno |title=Topic modeling bibliography |url=http://mimno.infosci.cornell.edu/topics.html}}
*{{cite web |first=David |last=Mimno |title=Topic modeling bibliography |url=https://mimno.infosci.cornell.edu/topics.html}}
*{{cite web |first=Megan R. |last=Brett |title=Topic Modeling: A Basic Introduction |publisher=Journal of Digital Humanities |url=http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/}}
*{{cite web |first=Megan R. |last=Brett |title=Topic Modeling: A Basic Introduction |publisher=Journal of Digital Humanities |url=http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/}}
*[https://www.youtube.com/watch?v=1wcX4fEdNUo Topic Models Applied to Online News and Reviews] Video of a Google Tech Talk presentation by Alice Oh on topic modeling with [[Latent Dirichlet allocation|LDA]]
*[https://www.youtube.com/watch?v=1wcX4fEdNUo Topic Models Applied to Online News and Reviews] Video of a Google Tech Talk presentation by Alice Oh on topic modeling with [[Latent Dirichlet allocation|LDA]]
*[https://www.youtube.com/watch?v=8nBE5Qm8y6I Modeling Science: Dynamic Topic Models of Scholarly Research] Video of a Google Tech Talk presentation by David M. Blei
*[https://www.youtube.com/watch?v=8nBE5Qm8y6I Modeling Science: Dynamic Topic Models of Scholarly Research] Video of a Google Tech Talk presentation by David M. Blei
*[http://vimeo.com/13597441 Automated Topic Models in Political Science] Video of a presentation by Brandon Stewart at the [http://toolsfortext.wordpress.com/ Tools for Text Workshop], 14 June 2010
*[http://vimeo.com/13597441 Automated Topic Models in Political Science] Video of a presentation by Brandon Stewart at the [http://toolsfortext.wordpress.com/ Tools for Text Workshop], 14 June 2010
*Shawn Graham, Ian Milligan, and Scott Weingart {{cite web |title=Getting Started with Topic Modeling and MALLET |publisher=The Programming Historian |url=http://programminghistorian.org/lessons/topic-modeling-and-mallet/ |access-date=2014-05-29 |archive-url=https://web.archive.org/web/20140828231754/http://programminghistorian.org/lessons/topic-modeling-and-mallet |archive-date=2014-08-28 |url-status=dead }}
*Shawn Graham, Ian Milligan, and Scott Weingart {{cite web |title=Getting Started with Topic Modeling and MALLET |publisher=The Programming Historian |url=http://programminghistorian.org/lessons/topic-modeling-and-mallet/ |access-date=2014-05-29 |archive-url=https://web.archive.org/web/20140828231754/http://programminghistorian.org/lessons/topic-modeling-and-mallet |archive-date=2014-08-28 }}
*Blei, David M. [https://web.archive.org/web/20121002061418/http://www.cs.princeton.edu/~blei/topicmodeling.html "Introductory material and software"]
*Blei, David M. [https://web.archive.org/web/20121002061418/http://www.cs.princeton.edu/~blei/topicmodeling.html "Introductory material and software"]
* [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/skbayes/decomposition_models/gibbs_lda_cython.pyx code], [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/ipython_notebooks_tutorials/decomposition_models/example_lda.ipynb demo] - example of using LDA for topic modelling
* [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/skbayes/decomposition_models/gibbs_lda_cython.pyx code], [https://github.com/AmazaspShumik/sklearn-bayes/blob/master/ipython_notebooks_tutorials/decomposition_models/example_lda.ipynb demo] - example of using LDA for topic modelling

Latest revision as of 23:31, 13 April 2026

Short description: Text-based topic extraction method

In natural language processing, a topic model is a type of probabilistic, neural, or algebraic model for discovering the abstract topics that occur in a collection of documents. Topic modeling is a frequently used text mining tool for discovering hidden semantic features and structures in a text. The topics produced by topic models are generated through a variety of mathematical frameworks, including probabilistic generative models, matrix factorization methods based on word co-occurrence, and clustering algorithms applied to semantic embeddings.[1][2][3]

Topic models are commonly used to organize and discover latent features in large collections of unstructured text and other forms of big data.[4][5] Beyond text mining, topic models have also been used to uncover latent structures in fields such as genetic information, bioinformatics, computer vision, and social networks.[6][7]

History

An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998.[8] Another one, called probabilistic latent semantic analysis (PLSA), was created by Thomas Hofmann in 1999.[9] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSA. Developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002, LDA introduces sparse Dirichlet prior distributions over document-topic and topic-word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words.[10] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Hierarchical latent tree analysis (HLTA) is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.

File:Topic model scheme.webm

Topic models for context information

Approaches for temporal information include Block and Newman's determination of the temporal dynamics of topics in the Pennsylvania Gazette during 1728–1800. Griffiths & Steyvers used topic modeling on abstracts from the journal PNAS to identify topics that rose or fell in popularity from 1991 to 2001 whereas Lamba & Madhusushan [11] used topic modeling on full-text research articles retrieved from DJLIT journal from 1981 to 2018. In the field of library and information science, Lamba & Madhusudhan [11][12][13][14] applied topic modeling on different Indian resources like journal articles and electronic theses and resources (ETDs). Nelson [15] has been analyzing change in topics over time in the Richmond Times-Dispatch to understand social and political changes and continuities in Richmond during the American Civil War. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829 to 2008. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.

Yin et al.[16] introduced a topic model for geographically distributed documents, where document positions are explained by latent regions which are detected during inference.

Chang and Blei[17] included network information between linked documents in the relational topic model, to model the links between websites.

The author-topic model by Rosen-Zvi et al.[18] models the topics associated with authors of documents to improve the topic detection for documents with authorship information.

HLTA was applied to a collection of recent research papers published at major AI and Machine Learning venues. The resulting model is called The AI Tree. The resulting topics are used to index the papers at aipano.cse.ust.hk to help researchers track research trends and identify papers to read, and help conference organizers and journal editors identify reviewers for submissions.

To improve the qualitative aspects and coherency of generated topics, some researchers have explored the efficacy of "coherence scores", or otherwise how computer-extracted clusters (i.e. topics) align with a human benchmark.[19][20] Coherence scores are metrics for optimising the number of topics to extract from a document corpus.[21]

Algorithms

In practice, researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A survey by D. Blei describes this suite of algorithms.[22] Several groups of researchers starting with Papadimitriou et al.[8] have attempted to design algorithms with provable guarantees. Assuming that the data were actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here include singular value decomposition (SVD) and the method of moments. In 2012 an algorithm based upon non-negative matrix factorization (NMF) was introduced that also generalizes to topic models with correlations among topics.[23]

Since 2017, neural networks has been leveraged in topic modeling in order to improve the speed of inference,[24] and leading to further advancements like vONTSS, which allows humans to incorporate domain knowledge via weakly supervised learning.[25]

In 2018, a new approach to topic models was proposed based on the stochastic block model.[26]

Topic modeling has leveraged LLMs through contextual embedding[27] and fine tuning.[28]

Applications of topic models

To quantitative biomedicine

Topic models are being used also in other contexts. For examples uses of topic models in biology and bioinformatics research emerged.[29] Recently topic models has been used to extract information from dataset of cancers' genomic samples.[30] In this case topics are biological latent variables to be inferred.

To analysis of music and creativity

Topic models can be used for analysis of continuous signals like music. For instance, they were used to quantify how musical styles change in time, and identify the influence of specific artists on later music creation.[31]

See also

References

  1. Egger, Roman; Yu, Joanne (2022). "A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts". Frontiers in Sociology 7. doi:10.3389/fsoc.2022.886498. ISSN 2297-7775. PMID 35602001. 
  2. Muthusami, R.; Mani Kandan, N.; Saritha, K.; Narenthiran, B.; Nagaprasad, N.; Ramaswamy, Krishnaraj (2024-05-25). "Investigating topic modeling techniques through evaluation of topics discovered in short texts data across diverse domains" (in en). Scientific Reports 14 (1): 12003. doi:10.1038/s41598-024-61738-4. ISSN 2045-2322. PMID 38796483. PMC 11127968. https://www.nature.com/articles/s41598-024-61738-4. 
  3. Churchill, Rob; Singh, Lisa (2022-11-10). "The Evolution of Topic Modeling". ACM Comput. Surv. 54 (10s): 215:1–215:35. doi:10.1145/3507900. ISSN 0360-0300. 
  4. Murshed, Belal Abdullah Hezam; Mallappa, Suresha; Abawajy, Jemal; Saif, Mufeed Ahmed Naji; Al-ariki, Hasib Daowd Esmail; Abdulwahab, Hudhaifa Mohammed (2023-06-01). "Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis" (in en). Artificial Intelligence Review 56 (6): 5133–5260. doi:10.1007/s10462-022-10254-w. ISSN 1573-7462. PMID 36320612. 
  5. Song, W.; Zou, L. (2016). "LDA-TM: A Two-step Approach to Twitter Topic Data Clustering". IEEE. pp. 342–347. doi:10.1109/ICCCBDA.2016.7529581. 
  6. Blei, David (April 2012). "Probabilistic Topic Models". Communications of the ACM 55 (4): 77–84. doi:10.1145/2133806.2133826. 
  7. Cao, Liangliang, and Li Fei-Fei. "Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes." 2007 IEEE 11th International Conference on Computer Vision. IEEE, 2007.
  8. 8.0 8.1 Papadimitriou, Christos; Raghavan, Prabhakar; Tamaki, Hisao; Vempala, Santosh (1998). "Latent semantic indexing". Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems - PODS '98. pp. 159–168. doi:10.1145/275487.275505. ISBN 978-0-89791-996-8. http://www.cs.berkeley.edu/~christos/ir.ps. Retrieved 2012-04-17. 
  9. Hofmann, Thomas (1999). "Probabilistic Latent Semantic Indexing". Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. https://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf. 
  10. Blei, David M.; Ng, Andrew Y. (January 2003). "Latent Dirichlet allocation". Journal of Machine Learning Research 3: 993–1022. doi:10.1162/jmlr.2003.3.4-5.993. http://jmlr.csail.mit.edu/papers/v3/blei03a.html. 
  11. 11.0 11.1 Lamba, Manika jun (2019). "Mapping of topics in DESIDOC Journal of Library and Information Technology, India: a study". Scientometrics 120 (2): 477–505. doi:10.1007/s11192-019-03137-5. ISSN 0138-9130. 
  12. Lamba, Manika jun (2019). "Metadata Tagging and Prediction Modeling: Case Study of DESIDOC Journal of Library and Information Technology (2008-2017)". World Digital Libraries 12: 33–89. doi:10.18329/09757597/2019/12103. ISSN 0975-7597. https://content.iospress.com/articles/world-digital-libraries-an-international-journal/wdl12103. 
  13. Lamba, Manika may (2019). "Author-Topic Modeling of DESIDOC Journal of Library and Information Technology (2008-2017), India". Library Philosophy and Practice. https://www.proquest.com/openview/4416f54af3fe77e1c49c811af86990eb/1?pq-origsite=gscholar&cbl=54903. 
  14. Lamba, Manika sep (2018). "Metadata Tagging of Library and Information Science Theses: Shodhganga (2013-2017)". ETD2018:Beyond the boundaries of Rims and Oceans. Taiwan, Taipei. https://etd2018.ncl.edu.tw/images/phocadownload/3-2_Manika_Lamba_Extended_Abstract_ETD_2018.pdf. 
  15. Nelson, Rob. "Mining the Dispatch". Digital Scholarship Lab, University of Richmond. https://dsl.richmond.edu/dispatch/. 
  16. Yin, Zhijun (2011). "Geographical topic discovery and comparison". Proceedings of the 20th international conference on World wide web. pp. 247–256. doi:10.1145/1963405.1963443. ISBN 978-1-4503-0632-4. 
  17. Chang, Jonathan (2009). "Relational Topic Models for Document Networks". Aistats 9: 81–88. http://www.jmlr.org/proceedings/papers/v5/chang09a/chang09a.pdf. 
  18. Rosen-Zvi, Michal (2004). "The author-topic model for authors and documents". Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence: 487–494. 
  19. Nikolenko, Sergey (2017). "Topic modelling for qualitative studies". Journal of Information Science 43: 88–102. doi:10.1177/0165551515617393. 
  20. Reverter-Rambaldi, Marcel (2022). Topic Modelling in Spontaneous Speech Data (Honours thesis). Australian National University. doi:10.25911/M1YF-ZF55.
  21. Newman, David (2010). "Automatic evaluation of topic coherence". Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: 100–108. 
  22. Blei, David M. (April 2012). "Introduction to Probabilistic Topic Models" (PDF). Comm. ACM 55 (4): 77–84. doi:10.1145/2133806.2133826. https://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext. 
  23. Sanjeev Arora; Rong Ge; Ankur Moitra (April 2012). "Learning Topic Models—Going beyond SVD". arXiv:1204.1956 [cs.LG].
  24. Miao, Yishu; Grefenstette, Edward; Blunsom, Phil (2017). "Discovering Discrete Latent Topics with Neural Variational Inference" (in en). Proceedings of the 34th International Conference on Machine Learning (PMLR): 2410–2419. https://proceedings.mlr.press/v70/miao17a.html. 
  25. Xu, Weijie; Jiang, Xiaoyu; Sengamedu Hanumantha Rao, Srinivasan; Iannacci, Francis; Zhao, Jinjin (2023). "vONTSS: vMF based semi-supervised neural topic modeling with optimal transport". Findings of the Association for Computational Linguistics: ACL 2023 (Stroudsburg, PA, USA: Association for Computational Linguistics): 4433–4457. doi:10.18653/v1/2023.findings-acl.271. 
  26. Martin Gerlach; Tiago Pexioto; Eduardo Altmann (2018). "A network approach to topic models". Science Advances 4 (7). doi:10.1126/sciadv.aaq1360. PMID 30035215. Bibcode2018SciA....4.1360G. 
  27. Bianchi, Federico; Terragni, Silvia; Hovy, Dirk (2021). "Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence". Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 759–766. doi:10.18653/v1/2021.acl-short.96. http://dx.doi.org/10.18653/v1/2021.acl-short.96. 
  28. Xu, Weijie; Hu, Wenxiang; Wu, Fanyou; Sengamedu, Srinivasan (2023). "DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM". Findings of the Association for Computational Linguistics: EMNLP 2023 (Stroudsburg, PA, USA: Association for Computational Linguistics): 9040–9057. doi:10.18653/v1/2023.findings-emnlp.606. 
  29. Liu, L. et al. (2016). "An overview of topic modeling and its current applications in bioinformatics". SpringerPlus 5 (1): 1608. doi:10.1186/s40064-016-3252-8. PMID 27652181. 
  30. Valle, F.; Osella, M.; Caselle, M. (2020). "A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data". Cancers 12 (12): 3799. doi:10.3390/cancers12123799. PMID 33339347. 
  31. Shalit, Uri; Weinshall, Daphna; Chechik, Gal (2013-05-13). "Modeling Musical Influence with Topic Models" (in en). Proceedings of the 30th International Conference on Machine Learning (PMLR): 244–252. https://proceedings.mlr.press/v28/shalit13.html. 

Further reading