Generalized suffix array

From HandWiki

In computer science, a generalized suffix array (GSA) is a suffix array containing all suffixes for a set of strings. Given the set of strings [math]\displaystyle{ S = S_1, S_2, S_3, ..., S_k }[/math] of total length [math]\displaystyle{ n }[/math], it is a lexicographically sorted array of all suffixes of each string in [math]\displaystyle{ S }[/math]. It is primarily used in bioinformatics and string processing.

Functionality

The functionality of a generalized suffix array is as follows:[1]

  • For a collection or set of strings, [math]\displaystyle{ S = S_1, S_2, S_3, ..., S_k }[/math].
  • It is a lexicographically sorted array of all suffixes of each string in the set [math]\displaystyle{ S }[/math].
  • In the array, each suffix is represented by an integer pair [math]\displaystyle{ (i, j) }[/math] which denotes the suffix starting from position [math]\displaystyle{ j }[/math] in [math]\displaystyle{ s_i }[/math].
  • In the case where different strings in [math]\displaystyle{ S }[/math] have identical suffixes, in the generalized suffix array, those suffixes will occupy consecutive positions. However, for convenience, the exception can be made where repeats will not be listed.

A generalized suffix array can be generated for a generalized suffix tree. When compared to a generalized suffix tree, while the generalized suffix array will require more time to construct, it will use less space than the tree.

Construction Algorithms and Implementations

Algorithms and tools for constructing a generalized suffix array include:

  • Fei Shi's (1996) algorithm which runs in [math]\displaystyle{ \mathcal{O}(N \log n) }[/math] worst case time and [math]\displaystyle{ \mathcal{O}(N) }[/math] space, where [math]\displaystyle{ N }[/math] is the sum of the lengths of all strings in [math]\displaystyle{ S }[/math] and [math]\displaystyle{ n }[/math] the length of the longest string in [math]\displaystyle{ S }[/math]. This includes sorting, searching and finding the longest common prefixes.[1]
  • The external generalized enhanced suffix array, or eGSA, construction algorithm which specializes in external memory construction, is particularly useful when the size of the input collection or data structure is larger than the amount of available internal memory[2]
  • gsufsort is an open-source, fast, portable and lightweight tool for the construction of generalized suffix arrays and related data structures like Burrows–Wheeler transform or LCP Array)[3]
  • Mnemonist, a collection of data structures implemented in JavaScript contains an implementation for a generalized suffix tree and can be found publicly on npm and GitHub.[4]

Solving the Pattern Matching Problem

Generalized suffix arrays can be used to solve the pattern matching problem:[5]

  • Given a pattern [math]\displaystyle{ P }[/math] and a text [math]\displaystyle{ T }[/math], find all occurrences of [math]\displaystyle{ P }[/math] in [math]\displaystyle{ T }[/math].
  • Using the generalized suffix array [math]\displaystyle{ G }[/math] of [math]\displaystyle{ T }[/math], then first, the suffixes that have [math]\displaystyle{ P }[/math] as a prefix need to be found.
  • Since [math]\displaystyle{ G }[/math] is a lexicographically sorted array of the suffixes of [math]\displaystyle{ T }[/math], then all such suffixes will appear in consecutive positions within [math]\displaystyle{ G }[/math]. Particularly important, since [math]\displaystyle{ G }[/math] is sorted, it makes identification of suffixes possible and easy using binary search.
  • Using binary search, first find the smallest index [math]\displaystyle{ i }[/math] in [math]\displaystyle{ G }[/math] such that [math]\displaystyle{ suff_G[i] }[/math] contains [math]\displaystyle{ P }[/math] as a prefix, or determine that no such suffix is present. In the case where the suffix is not found, [math]\displaystyle{ P }[/math] does not occur in [math]\displaystyle{ T }[/math]. Otherwise, find the largest index [math]\displaystyle{ j(\geq i) }[/math] which contains [math]\displaystyle{ P }[/math] as a prefix. The elements in the range [math]\displaystyle{ G[i..j] }[/math] indicate the starting positions of the occurrences of [math]\displaystyle{ P }[/math] in [math]\displaystyle{ T }[/math].
  • Binary search on [math]\displaystyle{ G }[/math] takes [math]\displaystyle{ \Theta (log n) }[/math] comparisons. [math]\displaystyle{ P }[/math] is compared with a suffix to determine their lexicographic order in each comparison that is done. Thus, this requires comparing at most [math]\displaystyle{ |P| = m }[/math] characters. Note that a [math]\displaystyle{ lcp }[/math] array is not required, but will offer the benefit of a lower running time.

The runtime of the algorithm is [math]\displaystyle{ \Theta (m log n) }[/math]. By comparison, solving this problem using suffix trees takes [math]\displaystyle{ \Theta (m) }[/math] time. Note that with a generalized suffix array, the space required is smaller compared to a suffix tree, since the algorithm only requires space for [math]\displaystyle{ n }[/math] words and the space to store the string. As mentioned above, by optionally keeping track of [math]\displaystyle{ lcp }[/math] information which will use slightly more space, the running time of the algorithm can be improved to [math]\displaystyle{ \Theta (m+log n) }[/math].

Other Applications

  • A generalized suffix array can be utilized to compute the longest common subsequence of all the strings in a set or collection. A naive implementation would compute the largest common subsequence of all the strings in the set in [math]\displaystyle{ \Theta (n^2) }[/math].[6]
  • A generalized suffix array can be utilized to find the longest previous factor array, a concept central to text compression techniques and in the detection of motifs and repeats[7]

See also

References

  1. 1.0 1.1 Shi, Fei (1996), Suffix arrays for multiple strings: A method for on-line multiple string searches., Lecture Notes in Computer Science, 1179, Springer Berlin Heidelberg, pp. 11–22, doi:10.1007/BFb0027775, ISBN 978-3-540-62031-0 
  2. Louza, Felipe; Telles, Guilherme; Hoffman, Steve; Ciferri, Cristina (2017), "Generalized enhanced suffix array construction in external memory", Algorithms for molecular biology : AM 12: p. 26, doi:10.1186/s13015-017-0117-9, PMID 29234460 
  3. Louza, Felipe, gsufsort, https://github.com/felipelouza/gsufsort 
  4. Plique, Guillaume, Mnemonist: Curated collection of data structures for the JavaScript language. 
  5. Aluru, Srinivas, Suffix Trees and Suffix Arrays, http://web.cs.iastate.edu/~cs548/suffix.pdf 
  6. Plique, Guillaume, Mnemonist: Generalized Suffix Array 
  7. Crochemore, Maxime; Grossi, Roberto; Kärkkäinen, Juha; Landau, Gad (2013), "A Constant-Space Comparison-Based Algorithm for Computing the Burrows–Wheeler Transform", Combinatorial Pattern Matching. CPM 2013. Lecture Notes in Computer Science, Lecture Notes in Computer Science (Springer, Berlin, Heidelberg) 7922: pp. 74–82, doi:10.1007/978-3-642-38905-4_9, ISBN 978-3-642-38904-7 

External links

Generalized enhanced suffix array construction in external memory