Gestalt Pattern Matching

Gestalt Pattern Matching,^[1] also Ratcliff/Obershelp Pattern Recognition,^[2] is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.^[2]

Algorithm

The similarity of two strings [math]\displaystyle{ S_1 }[/math] and [math]\displaystyle{ S_2 }[/math] is determined by the formula, calculating twice the number of matching characters [math]\displaystyle{ K_m }[/math] divided by the total number of characters of both strings. The matching characters are defined as the longest common substring (LCS) plus recursively the number of matching characters in the non-matching regions on both sides of the LCS:^[2]

[math]\displaystyle{ D_{ro} = \frac{2K_m}{|S_1|+|S_2|} }[/math]^[3]

where the similarity metric can take a value between zero and one:

[math]\displaystyle{ 0 \leq D_{ro} \leq 1 }[/math]

The value of 1 stands for the complete match of the two strings, whereas the value of 0 means there is no match and not even one common letter.

Sample

S₁	W	I	K	I	M	E	D	I	A
S₂	W	I	K	I	M	A	N	I	A

The longest common substring is WIKIM (grey) with 5 characters. There is no further substring on the left. The non-matching substrings on the right side are EDIA and ANIA. They again have a longest common substring IA (dark gray) with length 2. The similarity metric is determined by:

[math]\displaystyle{ \frac{2K_m}{|S_1|+|S_2|} = \frac{2 \cdot (|\text{''WIKIM''}|+|\text{''IA''}|)}{|S_1|+|S_2|} = \frac{2 \cdot (5 + 2)}{9 + 9} = \frac{14}{18} = 0.\overline{7} }[/math]

Properties

Complexity

The execution time of the algorithm is [math]\displaystyle{ O(n^3) }[/math] in a worst case and [math]\displaystyle{ O(n^2) }[/math] in an average case. By changing the computing method, the execution time can be improved significantly.^[1]

Commutative property

It can be shown, that the Gestalt Pattern Matching Algorithm is not commutative: ^[4]

[math]\displaystyle{ D_{ro}(S_1, S_2) \neq D_{ro}(S_2, S_1). }[/math]

Sample

For the two strings

[math]\displaystyle{ S_1 = \text{GESTALT PATTERN MATCHING} }[/math]

and

[math]\displaystyle{ S_2 = \text{GESTALT PRACTICE} }[/math]

the metric result for

[math]\displaystyle{ D_{ro}(S_1, S_2) }[/math] is [math]\displaystyle{ \frac{24}{40} }[/math] with the substrings GESTALT P, A, T, E and for

[math]\displaystyle{ D_{ro}(S_2, S_1) }[/math] the metric is [math]\displaystyle{ \frac{26}{40} }[/math] with the substrings GESTALT P, R, A, C, I.

Applications

The algorithm became a basis of the Python difflib library, which was introduced in version 2.1.^[1] Due to the unfavourable runtime behaviour of this similarity metric, three methods have been implemented. Two of them return an upper bound in a faster execution time.^[1] The fastest variant only compares the length of the two substrings:^[5]

[math]\displaystyle{ D_{rqr} = \frac{2 \cdot \min(|S1|, |S2|)}{|S1| + |S2|} }[/math],

# Drqr Implementation in Python
def real_quick_ratio(s1: str, s2: str) -> float:
    """Return an upper bound on ratio() very quickly."""
    l1, l2 = len(s1), len(s2)
    length = l1 + l2

    if not length:
        return 1.0

    return 2.0 * min(l1, l2) / length

The second upper bound calculates twice the sum of all used characters [math]\displaystyle{ S_1 }[/math] which occur in [math]\displaystyle{ S_2 }[/math] divided by the length of both strings but the sequence is ignored.

[math]\displaystyle{ D_{qr} = \frac{2 \cdot \big | \{\!\vert S1 \vert\!\} \cap \{\!\vert S2 \vert\!\} \big |}{|S1| + |S2|} }[/math],

# Dqr Implementation in Python
def quick_ratio(s1: str, s2: str) -> float:
    """Return an upper bound on ratio() relatively quickly."""
    length = len(s1) + len(s2)

    if not length:
        return 1.0

    intersect = collections.Counter(s1) & collections.Counter(s2)
    matches = sum(intersect.values())
    return 2.0 * matches / length

Trivially the following applies:

[math]\displaystyle{ 0 \leq D_{ro} \leq D_{qr} \leq D_{rqr} \leq 1 }[/math] and

[math]\displaystyle{ 0 \leq K_m \leq | \{\!\vert S1 \vert\!\} \cap \{\!\vert S2 \vert\!\} \big | \leq \min(|S1|, |S2|) \leq \frac {|S1| + |S2|}{2} }[/math].

References

↑ ^{Jump up to: 1.0} ^1.1 ^1.2 ^1.3 difflib — Helpers for computing deltas inside the Python documentation
↑ ^{Jump up to: 2.0} ^2.1 ^2.2 National Institute of Standards and Technology Ratcliff/Obershelp pattern recognition
↑ Ilya Ilyankou: Comparison of Jaro-Winkler and Ratcliff/Obershelp algorithms in spell check, May 2014 (PDF)
↑ How does Pythons SequenceMatcher work? at stackoverflow.com
↑ Borrowed from Python 3.7.0, difflib.py Lines 38–41 and 676–686

Collapse v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt Pattern Matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular expression engines Regular tree grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structures	DAFSA Suffix array Suffix automaton Suffix tree Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting

Anonymous

Search

Gestalt Pattern Matching

Namespaces

More

Page actions

Contents

Algorithm

Sample

Properties

Complexity

Commutative property

Applications

References

Further reading

See also

Navigation

Navigation

Help

Translate

Wiki tools

Wiki tools

Anonymous

Search

Gestalt Pattern Matching

Algorithm

Sample

Properties

Complexity

Commutative property

Applications

References

Further reading

See also

Navigation

Wiki tools

Page tools

Other projects

Categories