Medicine:Biomedical text mining
Biomedical text mining (including biomedical natural language processing or BioNLP) refers to the methods and study of how text mining may be applied to texts and literature of the biomedical and molecular biology domains. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies developed through studies in this field are frequently applied to the biomedical and molecular biology literature available through services such as PubMed.
Applying text mining approaches to biomedical text requires specific considerations common to the domain.
Availability of annotated text data
Large annotated corpora used in the development and training of general purpose text mining methods (e.g., sets of movie dialogue, product reviews, or Wikipedia article text) are not specific for biomedical language. While they may provide evidence of general text properties such as parts of speech, they rarely contain concepts of interest to biologists or clinicians. Development of new methods to identify features specific to biomedical documents therefore requires assembly of specialized corpora. Resources designed to aid in building new biomedical text mining methods have been developed through the Informatics for Integrating Biology and the Bedside (i2b2) challenges and biomedical informatics researchers. Text mining researchers frequently combine these corpora with the controlled vocabularies and ontologies available through the National Library of Medicine's Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH).
Machine learning-based methods often require very large data sets as training data to build useful models. Manual annotation of large text corpora is not realistically possible. Training data may therefore be products of weak supervision or purely statistical methods.
Data structure variation
Like other text documents, biomedical documents contain unstructured data. Research publications follow different formats, contain different types of information, and are interspersed with figures, tables, and other non-text content. Both unstructured text and semi-structured document elements, such as tables, may contain important information that should be text mined. Clinical documents may vary in structure and language between departments and locations. Other types of biomedical text, such as drug labels, may follow general structural guidelines but lack further details.
Biomedical literature contains statements about observations that may not be statements of fact. This text may express uncertainty or skepticism about claims. Without specific adaptations, text mining approaches designed to identify claims within text may mis-characterize these "hedged" statements as facts.
Supporting clinical needs
Biomedical text mining applications developed for clinical use should ideally reflect the needs and demands of clinicians. This is a concern in environments where clinical decision support is expected to be informative and accurate.
Interoperability with clinical systems
New text mining systems must work with existing standards, electronic medical records, and databases. Methods for interfacing with clinical systems such as LOINC have been developed but require extensive organizational effort to implement and maintain.
Specific sub tasks are of particular concern when processing biomedical text.
Named entity recognition
Developments in biomedical text mining have incorporated identification of biological entities with named entity recognition, or NER. Names and identifiers for biomolecules such as proteins and genes, chemical compounds and drugs, and disease names have all been used as entities. Most entity recognition methods are supported by pre-defined linguistic features or vocabularies, though methods incorporating deep learning and word embeddings have also been successful at biomedical NER.
Document classification and clustering
Biomedical documents may be classified or clustered based on their contents and topics. In classification, document categories are specified manually, while in clustering, documents form algorithm-dependent, distinct groups. These two tasks are representative of supervised and unsupervised methods, respectively, yet the goal of both is to produce subsets of documents based on their distinguishing features. Methods for biomedical document clustering have relied upon k-means clustering.
Biomedical documents describe connections between concepts, whether they are interactions between biomolecules, events occurring subsequently over time (i.e., temporal relationships), or causal relationships. Text mining methods may perform relation discovery to identify these connections, often in concert with named entity recognition.
Hedge cue detection
The challenge of identifying uncertain or "hedged" statements has been addressed through hedge cue detection in biomedical literature.
Multiple researchers have developed methods to identify specific scientific claims from literature. In practice, this process involves both isolating phrases and sentences denoting the core arguments made by the authors of a document (a process known as argument mining, employing tools used in fields such as political science) and comparing claims to find potential contradictions between them.
Information extraction, or IE, is the process of automatically identifying structured information from unstructured or partially structured text. IE processes can involve several or all of the above activities, including named entity recognition, relationship discovery, and document classification, with the overall goal of translating text to a more structured form, such as the contents of a template or knowledge base. In the biomedical domain, IE is used to generate links between concepts described in text, such as gene A inhibits gene B and gene C is involved in disease G. Biomedical knowledge bases containing this type of information are generally products of extensive manual curation, so replacement of manual efforts with automated methods remains a compelling area of research.
Information retrieval and question answering
Biomedical text mining supports applications for identifying documents and concepts matching search queries. Search engines such as PubMed search allow users to query literature databases with words or phrases present in document contents, metadata, or indices such as MeSH. Similar approaches may be used for medical literature retrieval. For more fine-grained results, some applications permit users to search with natural language queries and identify specific biomedical relationships.
On 16 March 2020, the National Library of Medicine and others launched the COVID-19 Open Research Dataset (CORD-19) to enable text mining of the current literature on the novel virus. The dataset is hosted by the Semantic Scholar project of the Allen Institute for AI. Other participants include Google, Microsoft Research, the Center for Security and Emerging Technology, and the Chan Zuckerberg Initiative.
The following table lists a selection of biomedical text corpora and their contents. These items include annotated corpora, sources of biomedical research literature, and resources frequently used as vocabulary and/or ontology references, such as MeSH. Items marked "Yes" under "Freely Available" can be downloaded from a publicly accessible location.
|Corpus Name||Authors or Group||Contents||Freely Available||Citation|
|2006 i2b2 Deidentification and Smoking Challenge||i2b2||889 de-identified medical discharge summaries annotated for patient identification and smoking status features.||Yes, with registration|||
|2008 i2b2 Obesity Challenge||i2b2||1,237 de-identified medical discharge summaries annotated for presence or absence of comorbidities of obesity.||Yes, with registration|||
|2009 i2b2 Medication Challenge||i2b2||1,243 de-identified medical discharge summaries annotated for names and details of medications, including dosage, mode, frequency, duration, reason, and presence in a list or narrative structure.||Yes, with registration|||
|2010 i2b2 Relations Challenge||i2b2||Medical discharge summaries annotated for medical problems, tests, treatments, and the relations among these concepts. Only a subset of these data records are available for research use due to IRB limitations.||Yes, with registration|||
|2011 i2b2 Coreference Challenge||i2b2||978 de-identified medical discharge summaries, progress notes, and other clinical reports annotated with concepts and coreferences. Includes the ODIE corpus.||Yes, with registration|||
|2012 i2b2 Temporal Relations Challenge||i2b2||310 de-identified medical discharge summaries annotated for events and temporal relations.||Yes, with registration|||
|2014 i2b2 De-identification Challenge||i2b2||1,304 de-identified longitudinal medical records annotated for protected health information (PHI).||Yes, with registration|||
|2014 i2b2 Heart Disease Risk Factors Challenge||i2b2||1,304 de-identified longitudinal medical records annotated for risk factors for cardiac artery disease.||Yes, with registration|||
|AIMed||Bunescu et al.||200 abstracts annotated for protein–protein interactions, as well as negative example abstracts containing no protein-protein interactions.||Yes|||
|BioC-BioGRID||BioCreAtIvE||120 full text research articles annotated for protein–protein interactions.||Yes|||
|BioCreAtIvE 1||BioCreAtIvE||15,000 sentences (10,000 training and 5,000 test) annotated for protein and gene names. 1,000 full text biomedical research articles annotated with protein names and Gene Ontology terms.||Yes|||
|BioCreAtIvE 2||BioCreAtIvE||15,000 sentences (10,000 training and 5,000 test, different from the first corpus) annotated for protein and gene names. 542 abstracts linked to EntrezGene identifiers. A variety of research articles annotated for features of protein–protein interactions.||Yes|||
|BioCreative V CDR Task Corpus (BC5CDR)||BioCreAtIvE||1,500 articles (title and abstract) published in 2014 or later, annotated for 4,409 chemicals, 5,818 diseases and 3116 chemical–disease interactions.||Yes|||
|BioInfer||Pyysalo et al.||1,100 sentences from biomedical research abstracts annotated for relationships, named entities, and syntactic dependencies.||No|||
|BioScope||Vincze et al.||1,954 clinical reports, 9 papers, and 1,273 abstracts annotated for linguistic scope and terms denoting negation or uncertainty.||Yes|||
|BioText Recognizing Abbreviation Definitions||BioText Project||1,000 abstracts on the subject of "yeast", annotated for abbreviations and their meanings.||Yes|||
|BioText Protein–Protein Interaction Data||BioText Project||1,322 sentences describing protein–protein interactions between HIV-1 and human proteins, annotated with interaction types.||Yes|||
|Comparative Toxicogenomics Database||Davis et al.||A database of manually-curated associations between chemicals, gene products, phenotypes, diseases, and environmental exposures.||Yes|||
|CRAFT||Verspoor et al.||97 full-text biomedical publications annotated with linguistic structures and biological concepts||Yes|||
|GENIA Corpus||GENIA Project||1,999 biomedical research abstracts on the topics "human", "blood cells", and "transcription factors", annotated for parts of speech, syntax, terms, events, relations, and coreferences.||Yes|||
|FamPlex||Bachman et al.||Protein names and families linked to unique identifiers. Includes affix sets.||Yes|||
|FlySlip Abstracts||FlySlip||82 research abstracts on Drosophila annotated with gene names.||Yes|||
|FlySlip Full Papers||FlySlip||5 research papers on Drosophila annotated with anaphoric relations between noun phrases referring to genes and biologically related entities.||Yes|||
|FlySlip Speculative Sentences||FlySlip||More than 1,500 sentences annotated as speculative or not speculative. Includes annotations of clauses.||Yes|||
|IEPA||Ding et al.||486 sentences from biomedical research abstracts annotated for pairs of co-occurring chemicals, including proteins.||No|||
|JNLPBA corpus||Kim et al.||An extended version of version 3 of the GENIA corpus for NER tasks.||No|||
|Learning Language in Logic (LLL)||Nédellec et al.||77 sentences from research articles about the bacterium Bacillus subtilis, annotated for protein–gene interactions.||Yes|||
|Medical Subject Headings (MeSH)||National Library of Medicine||Hierarchically-organized terminology for indexing and cataloging biomedical documents.||Yes|||
|Metathesaurus||National Library of Medicine / UMLS||3.67 million concepts and 14 million concept names, mapped between more than 200 sources of biomedical vocabulary and identifiers.||Yes, with UMLS License Agreement|||
|MIMIC-III||MIT Lab for Computational Physiology||de-identified data associated with 53,423 distinct hospital admissions for adult patients.||Requires training and formal access request|||
|ODIE Corpus||Savova et al.||180 clinical notes annotated with 5,992 coreference pairs.||No|||
|OHSUMED||Hersh et al.||348,566 biomedical research abstracts and indexing information from MEDLINE, including MeSH (as of 1991).||Yes|||
|PMC Open Access Subset||National Library of Medicine / PubMed Central||More than 2 million research articles, updated weekly.||Yes|||
|RxNorm||National Library of Medicine / UMLS||Normalized names for clinical drugs and drug packs, with combined ingredients, strengths, and form, and assigned types from the Semantic Network.||Yes, with UMLS License Agreement|||
|Semantic Network||National Library of Medicine / UMLS||Lists of 133 semantic types and 54 semantic relationships covering biomedical concepts and vocabulary.||Yes, with UMLS License Agreement|||
|SPECIALIST Lexicon||National Library of Medicine / UMLS||A syntactic lexicon of biomedical and general English.||Yes|||
|Word Sense Disambiguation (WSD)||National Library of Medicine / UMLS||203 ambiguous words and 37,888 automatically extracted instances of their use in biomedical research publications.||Yes, with UMLS License Agreement|||
|Yapex||Franzén et al.||200 biomedical research abstracts annotated with protein names.||No|||
Several groups have developed sets of biomedical vocabulary mapped to vectors of real numbers, known as word vectors or word embeddings. Sources of pre-trained embeddings specific for biomedical vocabulary are listed in the table below. The majority are results of the word2vec model developed by Mikolov et al or variants of word2vec.
|Set Name||Authors or Group||Contents and Source||Citation|
|BioASQword2vec||BioASQ||Vectors produced by word2vec from 10,876,004 English PubMed abstracts.|||
|bio.nlplab.org resources||Pyysalo et al.||A collection of word vectors produced by different approaches, trained on text from PubMed and PubMed Central.|||
|BioVec||Asgari and Mofrad||Vectors for gene and protein sequences, trained using Swiss-Prot.|||
|RadiologyReportEmbedding||Banerjee et al.||Vectors produced by word2vec from the text of 10,000 radiology reports.|||
Gene cluster identification
Automatic extraction of protein interactions and associations of proteins to functional concepts (e.g. gene ontology terms) has been explored. The search engine PIE was developed to identify and return protein-protein interaction mentions from MEDLINE-indexed articles. The extraction of kinetic parameters from text or the subcellular location of proteins have also been addressed by information extraction and text mining technology.
Text mining can aid in gene prioritization, or identification of genes most likely to contribute to genetic disease. One group compared several vocabularies, representations and ranking algorithms to develop gene prioritization benchmarks.
Applications of phrase mining to disease associations
A text mining study assembled a collection of 709 core extracellular matrix proteins and associated proteins based on two databases: MatrixDB (matrixdb.univ-lyon1.fr) and UniProt. This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools. The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of cardiovascular diseases. They used a phrase-mining pipeline, Context-aware Semantic Online Analytical Processing (CaseOLAP), then semantically scored all 709 proteins according to their Integrity, Popularity, and Distinctiveness using the CaseOLAP pipeline. The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.
Search engines designed to retrieve biomedical literature relevant to a user-provided query frequently rely upon text mining approaches. Publicly available tools specific for research literature include PubMed search, Europe PubMed Central search, GeneView, and APSE Similarly, search engines and indexing systems specific for biomedical data have been developed, including DataMed and OmicsDI.
Some search engines, such as Essie, OncoSearch, PubGene, and GoPubMed were previously public but have since been discontinued, rendered obsolete, or integrated into commercial products.
Medical record analysis systems
Electronic medical records (EMRs) and electronic health records (EHRs) are collected by clinical staff in the course of diagnosis and treatment. Though these records generally include structured components with predictable formats and data types, the remainder of the reports are often free-text and difficult to search, leading to challenges with patient care. Numerous complete systems and tools have been developed to analyse these free-text portions. The MedLEE system was originally developed for analysis of chest radiology reports but later extended to other report topics. The clinical Text Analysis and Knowledge Extraction System, or cTAKES, annotates clinical text using a dictionary of concepts. The CLAMP system offers similar functionality with a user-friendly interface.
Computational frameworks have been developed to rapidly build tools for biomedical text mining tasks. SwellShark is a framework for biomedical NER that requires no human-labeled data but does make use of resources for weak supervision (e.g., UMLS semantic types). The SparkText framework uses Apache Spark data streaming, a NoSQL database, and basic machine learning methods to build predictive models from scientific articles.
The following academic conferences and workshops host discussions and presentations in biomedical text mining advances. Most publish proceedings.
|Association for Computational Linguistics (ACL) annual meeting||plenary session and as part of the BioNLP workshop|
|ACL BioNLP workshop|||
|American Medical Informatics Association (AMIA) annual meeting||in plenary session|
|Intelligent Systems for Molecular Biology (ISMB)||in plenary session and in the BioLINK and Bio-ontologies workshops|||
|International Conference on Bioinformatics and Biomedicine (BIBM)|||
|International Conference on Information and Knowledge Management (CIKM)||within International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO)|||
|North American Association for Computational Linguistics (NAACL) annual meeting||plenary session and as part of the BioNLP workshop|
|Pacific Symposium on Biocomputing (PSB)||in plenary session|||
|Practical Applications of Computational Biology & Bioinformatics (PACBB)|||
|Text REtrieval Conference (TREC)||formerly as part of TREC Genomics track; as of 2018 part of Precision Medicine Track|||
A variety of academic journals publishing manuscripts on biology and medicine include topics in text mining and natural language processing software. Some journals, including the Journal of the American Medical Informatics Association (JAMIA) and the Journal of Biomedical Informatics are popular publications for these topics.
- "A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts". PLOS Computational Biology 14 (2): e1005962. February 2018. doi:10.1371/journal.pcbi.1005962. PMID 29447159. Bibcode: 2018PLSCB..14E5962W.
- Danescu-Niculescu-Mizil, Cristian; Lee, Lillian (2011). Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs. 76–87. ISBN 978-1-932432-95-4. Bibcode: 2011arXiv1106.3077D. https://dl.acm.org/citation.cfm?id=2021105.
- McAuley, Julian; Leskovec, Jure (2013-10-12). Hidden factors and hidden topics: understanding rating dimensions with review text. ACM. pp. 165–172. doi:10.1145/2507157.2507163. ISBN 978-1-4503-2409-0.
- "Natural language processing: algorithms and tools to extract computable information from EHRs and from the biomedical literature". Journal of the American Medical Informatics Association 20 (5): 805. 2013. doi:10.1136/amiajnl-2013-002214. PMID 23935077.
- "2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text". Journal of the American Medical Informatics Association 18 (5): 552–6. 2011. doi:10.1136/amiajnl-2011-000203. PMID 21685143.
- "Evaluating temporal relations in clinical text: 2012 i2b2 Challenge". Journal of the American Medical Informatics Association 20 (5): 806–13. 2013. doi:10.1136/amiajnl-2013-001628. PMID 23564629.
- "Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1". Journal of Biomedical Informatics 58 Suppl: S11–9. December 2015. doi:10.1016/j.jbi.2015.06.007. PMID 26225918.
- "Towards comprehensive syntactic and semantic annotations of the clinical narrative". Journal of the American Medical Informatics Association 20 (5): 922–30. 2013. doi:10.1136/amiajnl-2012-001317. PMID 23355458.
- "Concept annotation in the CRAFT corpus". BMC Bioinformatics 13 (1): 161. July 2012. doi:10.1186/1471-2105-13-161. PMID 22776079.
- Holzinger, Andreas; Jurisica, Igor (2014), "Knowledge Discovery and Data Mining in Biomedical Informatics: The Future Is in Integrative, Interactive Machine Learning Solutions" (in en), Interactive Knowledge Discovery and Data Mining in Biomedical Informatics (Springer Berlin Heidelberg): pp. 1–18, doi:10.1007/978-3-662-43968-5_1, ISBN 9783662439678
- "Snorkel: Rapid Training Data Creation with Weak Supervision". Proceedings of the VLDB Endowment 11 (3): 269–282. November 2017. doi:10.14778/3157794.3157797. PMID 29770249. Bibcode: 2017arXiv171110160R.
- "Co Type". CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases. WWW '17. International World Wide Web Conferences Steering Committee. 2017-04-03. pp. 1015–1024. doi:10.1145/3038912.3052708. ISBN 9781450349130. http://dl.acm.org/citation.cfm?id=3038912.3052708.
- "Status of text-mining techniques applied to biomedical text". Drug Discovery Today 11 (7–8): 315–25. April 2006. doi:10.1016/j.drudis.2006.02.011. PMID 16580973.
- "A framework for information extraction from tables in biomedical literature". International Journal on Document Analysis and Recognition 22 (1): 55–78. February 2019. doi:10.1007/s10032-019-00317-0. Bibcode: 2019arXiv190210031M.
- "A dataset of 200 structured product labels annotated for adverse drug reactions" (in En). Scientific Data 5: 180001. January 2018. doi:10.1038/sdata.2018.1. PMID 29381145. Bibcode: 2018NatSD...580001D.
- "Detecting hedge cues and their scope in biomedical text with conditional random fields". Journal of Biomedical Informatics 43 (6): 953–61. December 2010. doi:10.1016/j.jbi.2010.08.003. PMID 20709188.
- "Implementation and management of a biomedical observation dictionary in a large healthcare information system". Journal of the American Medical Informatics Association 20 (5): 940–6. 2013. doi:10.1136/amiajnl-2012-001410. PMID 23635601.
- "The Georges Pompidou University Hospital Clinical Data Warehouse: A 8-years follow-up experience". International Journal of Medical Informatics 102: 21–28. June 2017. doi:10.1016/j.ijmedinf.2017.02.006. PMID 28495345.
- Levy, Brian. "Health Care's Semantics Challenge". Great Valley Publishing Company. https://www.fortherecordmag.com/archives/0514p26.shtml.
- "Protecting patient privacy in clinical data mining". Journal of Healthcare Information Management 16 (4): 62–7. 2002. PMID 12365302.
- "Protecting patient privacy when sharing patient-level data from clinical trials". BMC Medical Research Methodology 16 Suppl 1 (S1): 77. July 2016. doi:10.1186/s12874-016-0169-4. PMID 27410040.
- "Confidentiality, electronic health records, and the clinician". Perspectives in Biology and Medicine 56 (1): 105–25. 2013. doi:10.1353/pbm.2013.0003. PMID 23748530.
- "What makes a gene name? Named entity recognition in the biomedical literature". Briefings in Bioinformatics 6 (4): 357–369. 2005-01-01. doi:10.1093/bib/6.4.357. ISSN 1467-5463. PMID 16420734.
- "Overview of the chemical compound and drug name recognition (CHEMDNER) task.". Proceedings of the Fourth BioCreative Challenge Evaluation Workshop 2: 6–37. http://www.biocreative.org/media/store/files/2013/bc4_v2_1.pdf.
- "Assessment of disease named entity recognition on a corpus of annotated sentences". BMC Bioinformatics 9 Suppl 3 (Suppl 3): S3. April 2008. doi:10.1186/1471-2105-9-s3-s3. PMID 18426548.
- "Deep learning with word embeddings improves biomedical named entity recognition". Bioinformatics 33 (14): i37–i48. July 2017. doi:10.1093/bioinformatics/btx228. PMID 28881963.
- "An effective general purpose approach for automated biomedical document classification". AMIA ... Annual Symposium Proceedings. AMIA Symposium: 161–5. 2006. PMID 17238323.
- "Clustering algorithms in biomedical research: a review" (in en-US). IEEE Reviews in Biomedical Engineering 3: 120–54. 2010. doi:10.1109/rbme.2010.2083647. PMID 22275205.
- "Biomedical text mining and its applications". PLOS Computational Biology 5 (12): e1000597. December 2009. doi:10.1371/journal.pcbi.1000597. PMID 20041219. Bibcode: 2009PLSCB...5E0597R.
- "Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles". Journal of Biomedical Informatics 43 (2): 173–89. April 2010. doi:10.1016/j.jbi.2009.11.001. PMID 19900574.
- Alamri, Abdulaziz; Stevensony, Mark (2015). Automatic identification of potentially contradictory claims to support systematic reviews. IEEE. doi:10.1109/bibm.2015.7359808. ISBN 978-1-4673-6799-8.
- "Application of text mining in the biomedical domain". Methods 74: 97–106. March 2015. doi:10.1016/j.ymeth.2015.01.015. PMID 25641519.
- "Can we replace curation with information extraction software?". Database 2016: baw150. 2016-01-01. doi:10.1093/database/baw150. PMID 28025341.
- "Linking genes to literature: text mining, information extraction, and retrieval applications for biology" (in En). Genome Biology 9 Suppl 2 (Suppl 2): S8. 2008. doi:10.1186/gb-2008-9-s2-s8. PMID 18834499.
- "Question answering for biology". Methods 74: 36–46. March 2015. doi:10.1016/j.ymeth.2014.10.023. PMID 25448292.
- Semantics Scholar. (2020) "Cut through the clutter:[Open Access] Download the Coronavirus Open Research Dataset". Semantics Scholar website Retrieved 30 March 2020
- Brennan, Patti. (24 March 2020). "Blog:How Does a Library Respond to a Global Health Crisis?". National Library of Medicine website Retrieved 30 March 2020.
- Brainard, Jeffrey (13 May 2020). "Scientists are drowning in COVID-19 papers. Can new tools keep them afloat?" (in en). Science | AAAS. https://www.sciencemag.org/news/2020/05/scientists-are-drowning-covid-19-papers-can-new-tools-keep-them-afloat.
- "Evaluating the state-of-the-art in automatic de-identification". Journal of the American Medical Informatics Association 14 (5): 550–63. 2007-09-01. doi:10.1197/jamia.m2444. PMID 17600094.
- "Identifying patient smoking status from medical discharge records". Journal of the American Medical Informatics Association 15 (1): 14–24. 2008-01-01. doi:10.1197/jamia.m2408. PMID 17947624.
- "Recognizing obesity and comorbidities in sparse data". Journal of the American Medical Informatics Association 16 (4): 561–70. 2009. doi:10.1197/jamia.M3115. PMID 19390096.
- "Community annotation experiment for ground truth generation for the i2b2 medication challenge". Journal of the American Medical Informatics Association 17 (5): 519–23. 2010. doi:10.1136/jamia.2010.004200. PMID 20819855.
- "Extracting medication information from clinical text". Journal of the American Medical Informatics Association 17 (5): 514–8. 2010. doi:10.1136/jamia.2010.003947. PMID 20819854.
- "Evaluating the state of the art in coreference resolution for electronic medical records". Journal of the American Medical Informatics Association 19 (5): 786–91. 2012. doi:10.1136/amiajnl-2011-000784. PMID 22366294.
- "Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus". Journal of Biomedical Informatics 58 Suppl: S20–9. December 2015. doi:10.1016/j.jbi.2015.07.020. PMID 26319540.
- "Annotating risk factors for heart disease in clinical narratives for diabetic patients". Journal of Biomedical Informatics 58 Suppl: S78–91. December 2015. doi:10.1016/j.jbi.2015.05.009. PMID 26004790.
- "Comparative experiments on learning information extractors for proteins and their interactions". Artificial Intelligence in Medicine 33 (2): 139–55. February 2005. doi:10.1016/j.artmed.2004.07.016. PMID 15811782.
- "The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions". Database 2017: baw147. 2017-01-01. doi:10.1093/database/baw147. PMID 28077563.
- "Overview of BioCreAtIvE: critical assessment of information extraction for biology". BMC Bioinformatics 6 Suppl 1: S1. 2005. doi:10.1186/1471-2105-6-S1-S1. PMID 15960821.
- "Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge" (in En). Genome Biology 9 Suppl 2 (Suppl 2): S1. 2008. doi:10.1186/gb-2008-9-s2-s1. PMID 18834487.
- "BioCreative V CDR task corpus: a resource for chemical disease relation extraction". Database 2016: baw068. 2016. doi:10.1093/database/baw068. PMID 27161011.
- "BioInfer: a corpus for information extraction in the biomedical domain" (in En). BMC Bioinformatics 8 (1): 50. February 2007. doi:10.1186/1471-2105-8-50. PMID 17291334.
- "The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes". BMC Bioinformatics 9 Suppl 11 (Suppl 11): S9. November 2008. doi:10.1186/1471-2105-9-s11-s9. PMID 19025695.
- "A simple algorithm for identifying abbreviation definitions in biomedical text". Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing: 451–62. 2003. PMID 12603049.
- Rosario, Barbara; Hearst, Marti A. (2005-10-06). "Multi-way relation classification". Multi-way relation classification: application to protein-protein interactions. Hlt '05. Association for Computational Linguistics. pp. 732–739. doi:10.3115/1220575.1220667. http://dl.acm.org/citation.cfm?id=1220575.1220667.
- Davis, Allan Peter; Grondin, Cynthia J; Johnson, Robin J; Sciaky, Daniela; McMorran, Roy; Wiegers, Jolene; Wiegers, Thomas C; Mattingly, Carolyn J (2019-01-08). "The Comparative Toxicogenomics Database: update 2019" (in en). Nucleic Acids Research 47 (D1): D948–D954. doi:10.1093/nar/gky868. ISSN 0305-1048. PMID 30247620.
- "A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools". BMC Bioinformatics 13 (1): 207. August 2012. doi:10.1186/1471-2105-13-207. PMID 22901054.
- "GENIA corpus--a semantically annotated corpus for bio-textmining". Bioinformatics 19 (Suppl 1): i180–i182. 2003-07-03. doi:10.1093/bioinformatics/btg1023. PMID 12855455.
- "GENIA Project". http://www.geniaproject.org/.
- "FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining" (in En). BMC Bioinformatics 19 (1): 248. June 2018. doi:10.1186/s12859-018-2211-5. PMID 29954318.
- Vlachos, Andreas; Gasperin, Caroline (2006). "Bootstrapping and evaluating named entity recognition in the biomedical domain". BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis. BioNLP '06: 138–145. doi:10.3115/1567619.1567652. https://dl.acm.org/citation.cfm?id=1567652.
- Gasperin, Caroline; Karamanis, Nikiforos; Seal, Ruth (2007). "Annotation of anaphoric relations in biomedical full text articles using a domain-relevant scheme". Proceedings of DAARC 2007: 19–24.
- Medlock, Ben; Briscoe, Ted (2007). "Weakly Supervised Learning for Hedge Classification in Scientific Literature". Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics: 992–999. http://anthology.aclweb.org/P/P07/P07-1125.pdf.
- "Mining MEDLINE: Abstracts, sentences, or phrases?" (in en). Pacific Symposium on Biocomputing 2002. World Scientific. 2001. 326–337. doi:10.1142/9789812799623_0031. ISBN 9789810247775. https://archive.org/details/pacificsymposium00paci/page/326.
- Kim, Jin-Dong; Ohta, Tomoko; Tsuruoka, Yoshimasa; Tateisi, Yuka; Collier, Nigel (2004). "Introduction to the bio-entity recognition task at JNLPBA". Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications - JNLPBA '04: 70. doi:10.3115/1567594.1567610.
- "LLLchallenge". http://genome.jouy.inra.fr/texte/LLLchallenge/#task1.
- "Medical Subject Headings - Home Page". https://www.nlm.nih.gov/mesh/.
- "The Unified Medical Language System (UMLS): integrating biomedical terminology". Nucleic Acids Research 32 (Database issue): D267–70. January 2004. doi:10.1093/nar/gkh061. PMID 14681409.
- "Metathesaurus". https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html.
- "MIMIC-III, a freely accessible critical care database" (in En). Scientific Data 3: 160035. May 2016. doi:10.1038/sdata.2016.35. PMID 27219127. Bibcode: 2016NatSD...360035J.
- "Anaphoric relations in the clinical narrative: corpus creation". Journal of the American Medical Informatics Association 18 (4): 459–65. 2011. doi:10.1136/amiajnl-2011-000108. PMID 21459927.
- Hersh, William; Buckley, Chris; Leone, T. J.; Hickam, David (1994) (in en). OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. Springer London. pp. 192–201. doi:10.1007/978-1-4471-2099-5_20. ISBN 9783540198895.
- "Open Access Subset" (in en). https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.
- "Normalized names for clinical drugs: RxNorm at 6 years". Journal of the American Medical Informatics Association 18 (4): 441–8. 2011. doi:10.1136/amiajnl-2011-000116. PMID 21515544.
- "An upper-level ontology for the biomedical domain". Comparative and Functional Genomics 4 (1): 80–4. 2003. doi:10.1002/cfg.255. PMID 18629109.
- "The UMLS Semantic Network" (in en). https://semanticnetwork.nlm.nih.gov/.
- "Lexical methods for managing variation in biomedical terminologies". Proceedings. Symposium on Computer Applications in Medical Care: 235–9. 1994. PMID 7949926.
- "The SPECIALIST NLP Tools" (in en). https://lexsrv3.nlm.nih.gov/Specialist/Summary/lexicon.html.
- "Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation" (in En). BMC Bioinformatics 12 (1): 223. June 2011. doi:10.1186/1471-2105-12-223. PMID 21635749.
- "Word Sense Disambiguation (WSD) Test Collections". https://wsd.nlm.nih.gov/.
- "Protein names and how to find them". International Journal of Medical Informatics 67 (1–3): 49–61. December 2002. doi:10.1016/s1386-5056(02)00052-7. PMID 12460631.
- Mikolov T, Chen K, Corrado G, Dean J (2013-01-16). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781 [cs.CL].
- "BioASQ Releases Continuous Space Word Vectors Obtained by Applying Word2Vec to PubMed Abstracts | bioasq.org". http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts.
- "bio.nlplab.org". http://bio.nlplab.org/.
- "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics". PLOS ONE 10 (11): e0141287. 2015-11-10. doi:10.1371/journal.pone.0141287. PMID 26555596. Bibcode: 2015PLoSO..1041287A.
- "Intelligent Word Embeddings of Free-Text Radiology Reports". AMIA ... Annual Symposium Proceedings. AMIA Symposium 2017: 411–420. 2017. PMID 29854105. Bibcode: 2017arXiv171106968B.
- "Text Mining for Protein Docking". PLOS Computational Biology 11 (12): e1004630. December 2015. doi:10.1371/journal.pcbi.1004630. PMID 26650466. Bibcode: 2015PLSCB..11E4630B.
- "Protein-protein interaction predictions using text mining methods". Methods 74: 47–53. March 2015. doi:10.1016/j.ymeth.2014.10.026. PMID 25448298.
- "The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible". Nucleic Acids Research 45 (D1): D362–D368. January 2017. doi:10.1093/nar/gkw937. PMID 27924014.
- "Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease". American Journal of Physiology. Heart and Circulatory Physiology 315 (4): H910–H924. October 2018. doi:10.1152/ajpheart.00175.2018. PMID 29775406.
- "MedMeSH summarizer: text mining for gene clusters.". InProceedings of the 2002 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics. 11 April 2002. pp. 548–565. doi:10.1137/1.9781611972726.32. ISBN 978-0-89871-517-0.
- "Comparative analysis of five protein-protein interaction corpora" (in En). BMC Bioinformatics 9 Suppl 3 (Suppl 3): S6. April 2008. doi:10.1186/1471-2105-9-s3-s6. PMID 18426551.
- "PIE the search: searching PubMed literature for protein interaction information". Bioinformatics 28 (4): 597–8. February 2012. doi:10.1093/bioinformatics/btr702. PMID 22199390.
- "Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining". Bioinformatics 24 (16): i119–25. August 2008. doi:10.1093/bioinformatics/btn291. PMID 18689812.
- "Prioritization of candidate genes for cattle reproductive traits, based on protein-protein interactions, gene expression, and text-mining". Physiological Genomics 45 (10): 400–6. May 2013. doi:10.1152/physiolgenomics.00172.2012. PMID 23572538.
- "Analysis of biological processes and diseases using text mining approaches". Bioinformatics Methods in Clinical Research. Methods in Molecular Biology. 593. 2010. pp. 341–82. doi:10.1007/978-1-60327-194-3_16. ISBN 978-1-60327-193-6.
- "Multi-Dimensional, Phrase-Based Summarization in Text Cubes.". IEEE Data Eng. Bull. 39 (3): 74–84. 2016. http://sites.computer.org/debull/A16sept/p74.pdf.
- "GeneView: a comprehensive semantic search engine for PubMed". Nucleic Acids Research 40 (Web Server issue): W585–91. July 2012. doi:10.1093/nar/gks563. PMID 22693219.
- "Biomedical literature: Testers wanted for article search tool". Nature 549 (7670): 31. September 2017. doi:10.1038/549031c. PMID 28880292. Bibcode: 2017Natur.549...31B.
- "Finding useful data across multiple biomedical data repositories using DataMed". Nature Genetics 49 (6): 816–819. May 2017. doi:10.1038/ng.3864. PMID 28546571.
- "Discovering and linking public omics data sets using the Omics Discovery Index". Nature Biotechnology 35 (5): 406–409. May 2017. doi:10.1038/nbt.3790. PMID 28486464.
- "Essie: a concept-based search engine for structured biomedical text". Journal of the American Medical Informatics Association 14 (3): 253–63. 2007-05-01. doi:10.1197/jamia.m2233. PMID 17329729.
- "OncoSearch: cancer gene search engine with literature evidence". Nucleic Acids Research 42 (Web Server issue): W416–21. July 2014. doi:10.1093/nar/gku368. PMID 24813447.
- "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics 28 (1): 21–8. May 2001. doi:10.1038/ng0501-21. PMID 11326270.
- "Linking microarray data to the literature". Nature Genetics 28 (1): 9–10. May 2001. doi:10.1038/ng0501-9. PMID 11326264.
- "GoPubMed: exploring PubMed with the Gene Ontology". Nucleic Acids Research 33 (Web Server issue): W783–6. July 2005. doi:10.1093/nar/gki470. PMID 15980585.
- Turchin, Alexander; Florez Builes, Luisa F. (2021-03-19). "Using Natural Language Processing to Measure and Improve Quality of Diabetes Care: A Systematic Review" (in en). Journal of Diabetes Science and Technology 15 (3): 553–560. doi:10.1177/19322968211000831. ISSN 1932-2968. PMID 33736486.
- "Clinical information extraction applications: A literature review". Journal of Biomedical Informatics 77: 34–49. January 2018. doi:10.1016/j.jbi.2017.11.011. PMID 29162496.
- "Towards a comprehensive medical language processing system: methods and issues". Proceedings: 595–9. 1997. PMID 9357695.
- "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications". Journal of the American Medical Informatics Association 17 (5): 507–13. 2010. doi:10.1136/jamia.2009.001560. PMID 20819853.
- "CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines". Journal of the American Medical Informatics Association 25 (3): 331–336. 2018. doi:10.1093/jamia/ocx132. PMID 29186491.
- Fries J, Wu S, Ratner A, Ré C (2017-04-20). "SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data". arXiv:1704.06360 [cs.CL].
- "SparkText: Biomedical Text Mining on Big Data Framework". PLOS ONE 11 (9): e0162721. 2016-09-29. doi:10.1371/journal.pone.0162721. PMID 27685652. Bibcode: 2016PLoSO..1162721Y.
- "NOBLE - Flexible concept recognition for large-scale biomedical natural language processing" (in En). BMC Bioinformatics 17 (1): 32. January 2016. doi:10.1186/s12859-015-0871-y. PMID 26763894.
- "BioNLP - ACL Anthology" (in en). https://aclanthology.coli.uni-saarland.de/venues/bionlp.
- "ISMB Proceedings". https://www.iscb.org/ismb-proceedings.
- "IEEE Xplore - Conference Home Page" (in en-US). https://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1001586.
- "dblp: CIKM" (in en). https://dblp.uni-trier.de/db/conf/cikm/index.html.
- "PSB Proceedings". https://psb.stanford.edu/psb-online/.
- "dblp: Practical Applications of Computational Biology & Bioinformatics" (in en). https://dblp.org/db/conf/pacbb/index.html.
- "Text REtrieval Conference (TREC) Proceedings". https://trec.nist.gov/proceedings/proceedings.html.
- "Text-mining and information-retrieval services for molecular biology". Genome Biology 6 (7): 224. 2005. doi:10.1186/gb-2005-6-7-224. PMID 15998455.
- "Text mining for metabolic pathways, signaling cascades, and protein networks". Science's STKE 2005 (283): pe21. May 2005. doi:10.1126/stke.2832005pe21. PMID 15886388.
- "Text-mining approaches in molecular biology and biomedicine". Drug Discovery Today 10 (6): 439–45. March 2005. doi:10.1016/S1359-6446(05)03376-3. PMID 15808823.
- Biomedical Literature Mining Publications (BLIMP) : A comprehensive and regularly updated index of publications on (bio)medical text mining
- Bio-NLP resources, systems and application database collection
- The BioNLP mailing list archives
- Corpora for biomedical text mining
- The BioCreative evaluations of biomedical text mining technologies
- Directory of people involved in BioNLP
Original source: https://en.wikipedia.org/wiki/Biomedical text mining. Read more