Document-term matrix

From HandWiki

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms.[1] It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.[2]

While the value of the cells is commonly the raw count of a given term, there are various schemes for weighting the raw counts such as, row normalizing (i.e. relative frequency/proportions) and tf-idf.

Terms are commonly single words separated by whitespace or punctuation on either side (a.k.a. unigrams). In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document.

General concept

When creating a data-set of terms that appear in a corpus of documents, the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. Each ij cell, then, is the number of times word j occurs in document i. As such, each row is a vector of term counts that represents the content of the document corresponding to that row. For instance if one has the following two (short) documents:

  • D1 = "I like databases"
  • D2 = "I dislike databases",

then the document-term matrix would be:

I like dislike databases
D1 1 1 0 1
D2 1 0 1 1

which shows which documents contain which terms and how many times they appear. Note that, unlike representing a document as just a token-count list, the document-term matrix includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-counts for terms in the corpus which do not also occur in a specific document. For this reason, document-term matrices are usually stored in a sparse matrix format.

As a result of the power-law distribution of tokens in nearly every corpus (see Zipf's law), it is common to weight the counts. This can be as simple as dividing counts by the total number of tokens in a document (called relative frequency or proportions), dividing by the maximum frequency in each document (called prop max), or taking the log of frequencies (called log count). If one desires to weight the words most unique to an individual document as compared to the corpus as a whole, it is common to use tf-idf, which divides the term frequency by the term's document frequency.

History of the concept

The document-term matrix emerged in the earliest years of the computerization of text. The increasing capacity for storing documents created the problem of retrieving a given document in an efficient manner. While previously the work of classifying and indexing was accomplished by hand, researchers explored the possibility of doing this automatically using word frequency information.

One of the first published document-term matrices was in Harold Borko's 1962 article "The construction of an empirically based mathematically derived classification system" (page 282, see also his 1965 article[3]). Borko references two computer programs, "FEAT" which stood for "Frequency of Every Allowable Term," written by John C. Olney of the System Development Corporation and the Descriptor Word Index Program, written by Eileen Stone also of the System Development Corporation:

Having selected the documents which were to make up the experimental library, the next step consisted of keypunching the entire body of text preparatory to computer processing.  The program used for this analysis was FEAT (Frequency of Every Allowable Term).  it was written by John C. Olney of the System Development Corporation and is designed to perform frequency and summary counts of individual words and of word pairs.  The  output of this program is an alphabetical listing, by frequency of occurrence, of all word types which appeared in the text.  Certain function words such as and, the,  at, a, etc., were placed in a "forbidden word list" table, and the frequency of these words was recorded  in a separate listing... A special computer program, called the Descriptor Word Index Program, was written to provide this information and to prepare a document-term matrix in a form suitable for in-put to the Factor Analysis Program. The Descriptor Word Index program was prepared by Eileen Stone of the System Development Corporation.[4]

Shortly thereafter, Gerard Salton published "Some hierarchical models for automatic document retrieval" in 1963 which also included a visual depiction of a document-term matrix.[5] Salton was at Harvard University at the time and his work was supported by the Air Force Cambridge Research Laboratories and Sylvania Electric Products, Inc. In this paper, Salton introduces the document-term matrix by comparison to a kind of term-context matrix used to measure similarities between words:

If it is desired to generate document associations or document clusters instead of word associations, the same procedures can be used with slight modifications. Instead of starting with a word-sentence matrix C,... it is now convenient to construct a word-document matrix F, listing frequency of occurrence of word Wi in Document Dj... Document similarities can now be computed as before by comparing pairs of rows and by obtaining similarity coefficients based on the frequency of co-occurrences of the content words included in the given document. This procedure produces a document-document similarity matrix which can in turn be used for the generation of document clusters...[5]

In addition to Borko and Salton, in 1964, F.W. Lancaster published a comprehensive review of automated indexing and retrieval. While the work was published while he worked at the Herner and Company in Washington D.C., the paper was written while he was "employed in research work at Aslib, on the Aslib Cranfield Project."[6] Lancaster credits Borko with the document-term matrix:

Harold Borko, of the System Development Corporation, has carried this operation a little further. A significant group of clue words is chosen from the vocabulary of an experimental collection. These are arranged in a document/term matrix to show the frequency of occurrence of each term in each document.... A correlation coefficient for each word pair is then computed, based on their co-occurrence in the document set. The resulting term/term matrix... is then factor analysed and a series of factors are isolated. These factors, when interpreted and named on the basis of the terms with high loadings which appear in each of the factors, become the classes of an empirical classification. The terms with high loadings in each factor are the clue words or predictors of the categories.

Choice of terms

A point of view on the matrix is that each row represents a document. In the vectorial semantic model, which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant categories, and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.

Applications

Improving search results

Latent semantic analysis (LSA, performing singular-value decomposition on the document-term matrix) can improve search results by disambiguating polysemous words and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.

Finding topics

Multivariate analysis of the document-term matrix can reveal topics/themes of the corpus. Specifically, latent semantic analysis and data clustering can be used, and, more recently, probabilistic latent semantic analysis with its generalization Latent Dirichlet allocation, and non-negative matrix factorization, have been found to perform well for this task.

See also

  • Bag of words model

Implementations

  • Gensim: Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms for constructing term-document matrices from text plus common transformations (tf-idf, LSA, LDA).

References

  1. "Document-feature matrix :: Tutorials for quanteda". https://tutorials.quanteda.io/basic-operations/dfm/. 
  2. "15 Ways to Create a Document-Term Matrix in R" (in en-US). https://www.dustinstoltz.com/blog/2020/12/1/creating-document-term-matrix-comparison-in-r. 
  3. Borko, Harold (1965). "A Factor Analytically Derived Classification System for Psychological Reports". Perceptual and Motor Skills 20 (2): 393–406. doi:10.2466/pms.1965.20.2.393. ISSN 0031-5125. PMID 14279310. http://dx.doi.org/10.2466/pms.1965.20.2.393. 
  4. Borko, Harold (1962). "The construction of an empirically based mathematically derived classification system". Proceedings of the May 1-3, 1962, spring joint computer conference on - AIEE-IRE '62 (Spring). AIEE-IRE '62 (Spring). New York, New York, USA: ACM Press. pp. 279–289. doi:10.1145/1460833.1460865. ISBN 9781450378758. 
  5. 5.0 5.1 Salton, Gerard (July 1963). "Some hierarchical models for automatic document retrieval". American Documentation 14 (3): 213–222. doi:10.1002/asi.5090140307. ISSN 0096-946X. http://dx.doi.org/10.1002/asi.5090140307. 
  6. LANCASTER, F.W. (1964-01-01). "MECHANIZED DOCUMENT CONTROL: A Review of Some Recent Research". ASLIB Proceedings 16 (4): 132–152. doi:10.1108/eb049960. ISSN 0001-253X. https://doi.org/10.1108/eb049960.