Query likelihood model

From HandWiki
Revision as of 17:29, 6 March 2023 by S.Timg (talk | contribs) (update)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Short description: Language model

The query likelihood model is a language model[1][2] used in information retrieval. A language model is constructed for each document in the collection. It is then possible to rank each document by the probability of specific documents given a query. This is interpreted as being the likelihood of a document being relevant given a query.

Calculating the likelihood

Using Bayes' rule, the probability [math]\displaystyle{ P }[/math] of a document [math]\displaystyle{ d }[/math], given a query [math]\displaystyle{ q }[/math] can be written as follows:

[math]\displaystyle{ P(d|q) = \frac{P(q|d) P(d)}{P(q)} }[/math]

Since the probability of the query [math]\displaystyle{ P(q) }[/math] is the same for all documents, this can be ignored. Further, it is typical to assume that the probability of documents is uniform. Thus, [math]\displaystyle{ P(d) }[/math] is also ignored.

[math]\displaystyle{ P(d|q) \propto P(q|d) }[/math]

Documents are then ranked by the probability that a query is observed as a random sample from the document model. The multinomial unigram language model is commonly used to achieve this. We have:

[math]\displaystyle{ P(q|M_d) = K_q \prod_{t \in V} P(t|M_d)^{tf_{t,q}} }[/math],where the multinomial coefficient is [math]\displaystyle{ K_q = L_q!/(tf_{t1,q}!tf_{t2,q}!...tf_{tN,q}!) }[/math] for query [math]\displaystyle{ q }[/math],

and [math]\displaystyle{ L_q = \sum_{1 \leq i \leq N}tf_{t_i,q} }[/math] is the length of query [math]\displaystyle{ q }[/math] given the term frequencies [math]\displaystyle{ tf }[/math] in the query vocabulary [math]\displaystyle{ N }[/math].

In practice the multinomial coefficient is usually removed from the calculation. The reason is that it is a constant for a given bag of words (such as all the words from a specific document [math]\displaystyle{ d }[/math]). The language model [math]\displaystyle{ M_d }[/math] should be the true language model calculated from the distribution of words underlying each retrieved document. In practice this language model is unknown, so it is usually approximated by considering each term (unigram) from the retrieved document together with its probability of appearance. So [math]\displaystyle{ P(t|M_d) }[/math] is the probability of term [math]\displaystyle{ t }[/math] being generated by the language model [math]\displaystyle{ M_d }[/math] of document [math]\displaystyle{ d }[/math]. This probability is multiplied for all terms from query [math]\displaystyle{ q }[/math] to get a rank for document [math]\displaystyle{ d }[/math] in the interval [math]\displaystyle{ [0,1] }[/math]. The calculation is repeated for all documents to create a ranking of all documents in the document collection. [3]

References

  1. Ponte, Jay M.; Croft, W. Bruce (1998). "A language modeling approach to information retrieval". Proceedings of the 21st ACM SIGIR Conference. Melbourne, Australia: ACM. pp. 275–281. doi:10.1145/290941.291008. 
  2. Hiemstra, Djoerd (1998). "A linguistically motivated probabilistically model of information retrieval". Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries. LNCS, Springer. pp. 569–584. doi:10.1007/3-540-49653-X_34. 
  3. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval, page 241. Cambridge University Press, 2009