Markovian discrimination

From HandWiki
Short description: Technique for filtering spam e-mail

Within the probability theory Markov model, Markovian discrimination in spam filtering is a method used in CRM114 and other spam filters to model the statistical behaviors of spam and nonspam more accurately than in simple Bayesian methods.[1] A simple Bayesian model of written text contains only the dictionary of legal words and their relative probabilities. A Markovian model adds the relative transition probabilities that given one word, predict what the next word will be. It is based on the theory of Markov chains by Andrey Markov, hence the name. In essence, a Bayesian filter works on single words alone, while a Markovian filter works on phrases or entire sentences.

Comparison with Bayesian Model and Spam Filtering Application

In contrast to a simple Bayesian model, which represents a message simply as a bag-of-words and examines a dictionary of legitimate words along with their relative probabilities, a Markovian model enhances this approach by incorporating relative transition probabilities. These transition probabilities predict the next word based on the preceding word. This methodology originates from the theory of Markov chains by Andrey Markov, hence the term 'Markovian.' The fundamental distinction between Bayesian and Markovian filters lies in their scope: while a Bayesian filter examines individual words in isolation, a Markovian filter extends its analysis to phrases or complete sentences. Markovian filters used in spam filtering are not limited to the word level; they can also process letter or partial word tokens. Additionally, weights can be assigned to these tokens based on their probability of appearing in spam or legitimate content, further enhancing the accuracy of the filter.[2]

Markov Models

There are two variants of Markov models: the visible Markov model and the hidden Markov model (HMM). The difference lies in the concept of the current word: with a visible Markov model, the current word is considered to contain the entire information about previous states of the language model, whereas a hidden Markov model obscures the relationship, assuming that the current word is only probabilistically linked to the previously read words.[3]

Application of Hidden Markov Models

To illustrate, in a visible Markov model, the word "the" should predict the subsequent word with a high degree of accuracy. In contrast, a hidden Markov model implies that the entire preceding text indicates the actual state and foresees the subsequent words, but it does not provide an absolute guarantee of that state or prediction. As spam filtering typically encounters the latter scenario, hidden Markov models are predominantly employed. Moreover, due to storage constraints, a specific type of hidden Markov model, known as a Markov random field, proves particularly suitable, typically with a 'sliding window' or clique size ranging between four and six tokens[4]

See also

References

  1. Chhabra, S., Yerazunis, W. S., and Siefkes, C. 2004. Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas. In Proceedings of the Fourth IEEE international Conference on Data Mining (November 1–04, 2004). ICDM. IEEE Computer Society, Washington, DC, Mazharul
  2. Duarte, J. P., Leite, V. C., & Frutuoso e Melo, P. F. F. (January 2013). A comparison between Markovian models and Bayesian networks for treating some dependent events in reliability evaluations. 2013 International Nuclear Atlantic Conference, Recife, Brazil. [1]
  3. Jurafsky, Daniel & Martin, James H. Speech and Language Processing. 2023. Stanford University. [2]
  4. Yerazunis, W. S. The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It. Mitsubishi Electric Research Laboratories, TR2004-091, January 2004. [3]