w-shingling
In natural language processing a w-shingling is a set of unique shingles (therefore n-grams) each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity between documents. The symbol w denotes the quantity of tokens in each shingle selected, or solved for.
The document, "a rose is a rose is a rose" can therefore be maximally tokenized as follows:
- (a,rose,is,a,rose,is,a,rose)
The set of all contiguous sequences of 4 tokens (Thus 4=n, thus 4-grams) is
- { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } Which can then be reduced, or maximally shingled in this particular instance to { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }.
Resemblance
For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or
- [math]\displaystyle{ r(A,B)={{|S(A)\cap S(B)|}\over {|S(A)\cup S(B)|}} }[/math]
where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.
See also
- Bag-of-words model
- Concept mining
- k-mer
- MinHash
- N-gram
- Rabin fingerprint
- Rolling hash
- Vector space model
References
- Broder; Glassman; Manasse; Zweig (1997). "Syntactic Clustering of the Web". SRC Technical Note #1997-015. http://www.std.org/~msm/common/clustering.html.
- Manber (1993). "Finding Similar Files in a Large File System". http://webglimpse.net/pubs/TR93-33.pdf. Does not yet use the term "shingling".
- Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich (7 July 2008). "w-shingling". Introduction to Information Retrieval. Cambridge University Press. ISBN 978-1-139-47210-4. http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html.
Original source: https://en.wikipedia.org/wiki/W-shingling.
Read more |