Second-order co-occurrence pointwise mutual information

From HandWiki
Short description: Semantic similarity measure

In computational linguistics, second-order co-occurrence pointwise mutual information (SOC-PMI) is a method used to measure semantic similarity, or how close in meaning two words are. The method does not require the two words to appear together in a text. Instead, it works by analyzing the "neighbor" words that typically appear alongside each of the two target words in a large body of text (corpus). If the two target words frequently share the same neighbors, they are considered semantically similar.

For example, the words "cemetery" and "graveyard" may not appear in the same sentence often, but they both frequently appear near words like "buried," "dead," and "funeral." SOC-PMI uses this shared context to determine that they have a similar meaning.

The method is called "second-order" because it doesn't look at the direct co-occurrence of the target words (which would be first-order), but at the co-occurrence of their neighbors (a second level of association). The strength of these associations is quantified using pointwise mutual information (PMI).

History

The method builds on earlier work like the PMI-IR algorithm, which used the AltaVista search engine to calculate word association probabilities. The key advantage of a second-order approach like SOC-PMI is its ability to measure similarity between words that do not co-occur often, or at all. The British National Corpus (BNC) has been used as a source for word frequencies and contexts for this method.

Methodology

The SOC-PMI algorithm measures the similarity between two words, w1 and w2, in several steps.

Step 1: Score neighboring words with PMI

First, for each target word (w1 and w2), the algorithm identifies its "neighbor" words within a certain text window (e.g., within 5 words to the left or right) across a large corpus. The strength of the association between a target word ti and its neighbor w is calculated using pointwise mutual information (PMI). A higher PMI value means the two words appear together more often than would be expected by chance.

The PMI between a target word ti and a neighbor word w is calculated as:

fpmi(ti,w)=log2fb(ti,w)×mft(ti)ft(w)

where:

  • fb(ti,w) is the number of times ti and w appear together in the context window.
  • ft(ti) is the total number of times ti appears in the corpus.
  • ft(w) is the total number of times w appears in the corpus.
  • m is the total number of tokens (words) in the corpus.

Step 2: Create a semantic 'signature' for each word

For each target word (w1 and w2), the algorithm creates a list of its most significant neighbors. This is done by taking the top β neighbor words, sorted in descending order by their PMI score with the target word. This list of top neighbors, Xw, acts as a semantic "signature" for the word w.

Xw={Xiw}, for i=1,2,,β

The size of this list, β, is a parameter of the method.

Step 3: Compare the signatures

The algorithm then compares the signatures of w1 and w2. It looks for words that are present in both signatures. The similarity of w1 to w2 is calculated by summing the PMI scores of w2 with every word in w1's signature list.

The β-PMI summation function defines this score. The score for w1 with respect to w2 is:

f(w1,w2,β)=i=1β(fpmi(Xiw1,w2))γ

This sum only includes terms where the PMI value is positive. The exponent γ (with a value > 1) is used to give more weight to neighbors that are more strongly associated with w2.

This calculation is done in both directions:

  • The similarity of w1 with respect to w2:
f(w1,w2,β1)=i=1β1(fpmi(Xiw1,w2))γ
  • The similarity of w2 with respect to w1:
f(w2,w1,β2)=i=1β2(fpmi(Xiw2,w1))γ

Step 4: Calculate final similarity score

Finally, the total semantic similarity is the average of the two scores from the previous step.

Sim(w1,w2)=f(w1,w2,β1)β1+f(w2,w1,β2)β2

This score can be normalized to fall between 0 and 1. For example, using this method, the words cemetery and graveyard achieve a high similarity score of 0.986 (with specific parameter settings).

References