Trigram search
Trigram search is a method of searching for text when the exact syntax or spelling of the target object is not precisely known[1] or when queries may be regular expressions.[2] It finds objects which match the maximum number of three consecutive character strings (i.e. trigrams) in the entered search terms, which are generally near matches.[3] Two strings with many shared trigrams can be expected to be very similar.[4] Trigrams also allow for efficiently creating search engine indexes for searches that are regular expressions or match the text inexactly. Indexes can significantly accelerate searches.[5][6] A threshold for number of trigram matches can be specified as a cutoff point, after which a result is no longer considered a match.[4] Using trigrams for accelerating searches is a technique used in some systems for code searching, in situations in which queries that are regular expressions may be useful,[5][2][7] in search engines such as Elasticsearch,[8] as well as in databases such as PostgreSQL.[4]
Examples
Consider the string "alice". The trigrams of the string would be "ali", "lic", and "ice," not including spaces.[5] Searching for this string in a database with a trigram-based index would involve finding which objects contain as many of the three trigrams as possible.
As a concrete example of using trigram search to search for a regular expression query, consider searching for the string ab[cd]e
, where the brackets denote that the third character in the string being searched for could be c
or d
. In this situation, one could query the index for objects that have the two trigrams abc
and bce
or the two trigrams abd
and bde
. Thus, finding this query would involve no string matching, and could just query the index directly, which can be faster in practice.[2]
See also
References
- ↑ Hardarson, Omar (1997). "Interactive coding of economic activity using trigram search in BLAISE III". International Blaise User Group. http://www.blaiseusers.org/1997/papers/hardar97.pdf. Note: This article discusses trigram search as a way of efficiently encoding certain kinds of economic data, and finds that this technique is particularly useful when the users of the system don't have a lot of context for the structure of the data.
- ↑ 2.0 2.1 2.2 Cox, Russ (January 2012). "Regular Expression Matching with a Trigram Index or How Google Code Search Worked". https://swtch.com/~rsc/regexp/regexp4.html.
- ↑ Adams, Elizabeth; Meltzer, Arnold (1 March 1993). "Trigrams as index element in full text retrieval: Observations and experimental results". Proceedings of the 1993 ACM conference on Computer science - CSC '93. pp. 433–439. doi:10.1145/170791.170891. ISBN 0897915585. https://dl.acm.org/doi/10.1145/170791.170891.
- ↑ 4.0 4.1 4.2 "F.33. pg_trgm" (in en). 2022-05-12. https://www.postgresql.org/docs/14/pgtrgm.html.
- ↑ 5.0 5.1 5.2 "Fast Search Using PostgreSQL Trigram Text Indexes" (in en). 2016-03-18. https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/.
- ↑ Zobel, Justin; Moffat, Alistair; Sacks-Davis, Ron (1993). "Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files". Conference on Very Large Databases (VLDB). http://www.vldb.org/conf/1993/P290.PDF. Note: This research paper does not use the term "trigram search" but does seem to be the first instance in the literature of using n-grams as indices, and is cited in the Russ Cox article as the first instance of the structure of a reverse index based on trigrams. The paper also cited successful performance results from using this style of search.
- ↑ "Big Grep" (in en). https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=507972. Also called BigGrep. Uses n-grams (not always 3),
- ↑ "N-gram tokenizer | Elasticsearch Guide [8.2 | Elastic"] (in en-us). https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html.
Original source: https://en.wikipedia.org/wiki/Trigram search.
Read more |