w-shingling

{{DISPLAYTITLE:w-shingling}}

{{no footnotes|date=March 2023}}

In natural language processing a w-shingling is a set of unique shingles (therefore n-grams) each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity between documents. The symbol w denotes the quantity of tokens in each shingle selected, or solved for.

The document, "a rose is a rose is a rose" can therefore be maximally tokenized as follows:

:(a,rose,is,a,rose,is,a,rose)

The set of all contiguous sequences of 4 tokens (Thus 4=n, thus 4-grams) is

:{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } Which can then be reduced, or maximally shingled in this particular instance to { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }.

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

:r(A,B)={

S(A)\cap S(B)
\over
S(A)\cup S(B)
}

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

See also

References

{{incomplete citations|date=March 2023}}

  • {{cite web |last1=Broder |last2=Glassman |last3=Manasse |last4=Zweig |year=1997 |url=http://www.std.org/~msm/common/clustering.html |title=Syntactic Clustering of the Web |work=SRC Technical Note #1997-015}}
  • {{cite web |last=Manber |year=1993 |url=https://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/manber94finding.pdf |title=Finding Similar Files in a Large File System}} Does not yet use the term "shingling".
  • {{cite book|last1=Manning|first1=Christopher D.|last2=Raghavan|first2=Prabhakar|last3=Schütze|first3=Hinrich|title=Introduction to Information Retrieval|url={{google books |plainurl=y |id=t1PoSh4uwVcC}}|date=7 July 2008|publisher=Cambridge University Press|isbn=978-1-139-47210-4|chapter-url=http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html |chapter=w-shingling }}

{{Natural language processing}}

Category:Natural language processing