LCP array
{{more footnotes needed|date=September 2012}}
{{Infobox
| above = LCP array
| label1 = Type
| data1 = Array
| label2 = Invented by
| data2 = {{harvtxt|Manber|Myers|1993}}
| header3 = Time complexity and {{nowrap|space complexity}} {{nowrap|in big O notation}}
| headerstyle = background:lavender
| data4 =
{{aligned table|cols=3|row1header=y|col1header=y|fullwidth=y
| | Average | Worst case
| Space
|
|
| Construction
|
|
}}
}}
In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.
For example, if A := [aab, ab, abaab, b, baab] is a suffix array, the longest common prefix between A[1] = aab and A[2] = ab is a which has length 1, so H[2] = 1 in the LCP array H. Likewise, the LCP of A[2] = ab and A[3] = abaab is ab, so H[3] = 2.
Augmenting the suffix array with the LCP array allows one to efficiently simulate top-down and bottom-up traversals of the suffix tree,{{sfn|Kasai|Lee|Arimura|Arikawa|2001}}{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}} speeds up pattern matching on the suffix array{{sfn|Manber|Myers|1993}} and is a prerequisite for compressed suffix trees.{{sfn|Ohlebusch|Fischer|Gog|2010}}
History
The LCP array was introduced in 1993, by Udi Manber and Gene Myers alongside the suffix array in order to improve the running time of their string search algorithm.{{sfn|Manber|Myers|1993}}
Definition
Let be the suffix array of the string of length , where is a sentinel letter that is unique and lexicographically smaller than any other character. Let denote the substring of ranging from to . Thus, is the th smallest suffix of .
Let denote the length of the longest common prefix between two strings and . Then the LCP array is an integer array of size such that is undefined and for every
Difference between LCP array and suffix array:
- Suffix array: Represents the lexicographic rank of each suffix of an array.
- LCP array: Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.
Example
Consider the string
class="wikitable" |
{{left header}} | i
| 1 || 2 || 3 || 4 || 5 || 6 || 7 |
---|
{{left header}} | S[i]
| b || a || n || a || n || a || $ |
and its corresponding sorted suffix array
class="wikitable"
! {{left header}} | i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
{{left header}} | A[i]
| 7 || 6 || 4 || 2 || 1 || 5 || 3 |
---|
Suffix array with suffixes written out underneath vertically:
class="wikitable"
! {{left header}} | i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
{{left header}} | A[i]
| 7 || 6 || 4 || 2 || 1 || 5 || 3 | ||||||
---|---|---|---|---|---|---|
{{left header}} | S[A[i], n][1]
| $ || a || a || a || b || n || n | ||||||
{{left header}} | S[A[i], n][2]
| || $ || n || n || a || a || a | ||||||
{{left header}} | S[A[i], n][3]
| || || a || a || n || $ || n | ||||||
{{left header}} | S[A[i], n][4]
| || || $ || n || a || || a | ||||||
{{left header}} | S[A[i], n][5]
| || || || a || n || || $ | ||||||
{{left header}} | S[A[i], n][6]
| || || || $ || a || || | ||||||
{{left header}} | S[A[i], n][7]
| || || || || $ || || |
Then the LCP array
class="wikitable" |
{{left header}} | i
| 1 || 2 || 3 || 4 || 5 || 6 || 7 |
---|
{{left header}} | H[i]
| undefined || 0 || 1 || 3 || 0 || 0 || 2 |
So, for example,
Efficient construction algorithms
LCP array construction algorithms can be divided into two different categories: algorithms that compute the LCP array as a byproduct to the suffix array and algorithms that use an already constructed suffix array in order to compute the LCP values.
{{harvtxt|Manber|Myers|1993}} provide an algorithm to compute the LCP array alongside the suffix array in
Assuming that each text symbol takes one byte and each entry of the suffix or LCP array takes 4 bytes, the major drawback of their algorithm is a large space occupancy of
{{harvtxt|Gog|Ohlebusch|2011}} provide two algorithms that although being theoretically slow (
{{As of|2012}}, the currently fastest linear-time LCP array construction algorithm is due to {{harvtxt|Fischer|2011}}, which in turn is based on one of the fastest suffix array construction algorithms (SA-IS) by {{harvtxt|Nong|Zhang|Chan|2009}}. {{harvtxt|Fischer|Kurpicz|2017}} based on Yuta Mori's DivSufSort is even faster.
Applications
As noted by {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} several string processing problems can be solved by the following kinds of tree traversals:
- bottom-up traversal of the complete suffix tree
- top-down traversal of a subtree of the suffix tree
- suffix tree traversal using the suffix links.
{{harvtxt|Kasai|Lee|Arimura|Arikawa|2001}} show how to simulate a bottom-up traversal of the suffix tree using only the suffix array and LCP array. {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} enhance the suffix array with the LCP array and additional data structures and describe how this enhanced suffix array can be used to simulate all three kinds of suffix tree traversals. {{harvtxt|Fischer|Heun|2007}} reduce the space requirements of the enhanced suffix array by preprocessing the LCP array for range minimum queries. Thus, every problem that can be solved by suffix tree algorithms can also be solved using the enhanced suffix array.{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}
Deciding if a pattern
The LCP array is also an essential part of compressed suffix trees which provide full suffix tree functionality like suffix links and lowest common ancestor queries.{{sfn|Sadakane|2007}}{{sfn|Fischer|Mäkinen|Navarro|2009}} Furthermore, it can be used together with the suffix array to compute the Lempel-Ziv LZ77 factorization in
The longest repeated substring problem for a string
The remainder of this section explains two applications of the LCP array in more detail: How the suffix array and the LCP array of a string can be used to construct the corresponding suffix tree and how it is possible to answer LCP queries for arbitrary suffixes using range minimum queries on the LCP array.
= Find the number of occurrences of a pattern =
{{Cleanup|reason=this section is a straight-up copy of a [https://stackoverflow.com/a/11374737 StackOverflow answer] so it has the form of a reply to a question.|date=June 2016}}
In order to find the number of occurrences of a given string
- We use binary search against the suffix array of
T to find the starting and end position of all occurrences ofP . - Now to speed up the search, we use LCP array, specifically a special version of the LCP array (LCP-LR below).
The issue with using standard binary search (without the LCP information) is that in each of the
The LCP-LR array helps improve this to
At any point during the binary search algorithm, we consider, as usual, a range
M ...... M' ...... R
|
we know:
lcp(P,M)==k
The trick now is that LCP-LR is precomputed such that an
We already know (from the previous step) that
- Case 1:
k < \mathrm{lcp}(M,M') , i.e.P has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R). - Case 2:
k > \mathrm{lcp}(M,M') , i.e.P has more prefix characters in common withM thanM has in common withM' . Consequently, if we were to compareP toM' , the common prefix would be smaller thank , andM' would be lexicographically larger thanP , so, without actually making the comparison, we continue in the left half(M,\dots,M') . - Case 3:
k = \mathrm{lcp}(M,M') . So M and M' are both identical withP in the firstk characters. To decide whether we continue in the left or right half, it suffices to compareP toM' starting from the(k+1) th character. - We continue recursively.
The overall effect is that no character of
We still need to precompute LCP-LR so it is able to tell us in
The key to this is to realize that only certain ranges
Moreover, there is a straightforward recursive algorithm to compute the
To sum up:
- It is possible to compute LCP-LR in
O(N) time andO(2N)=O(N) space from LCP. - Using LCP-LR during binary search helps accelerate the search procedure from
O(M\log N) toO(M+\log N) . - We can use two binary searches to determine the left and right end of the match range for
P , and the length of the match range corresponds with the number of occurrences for P.
= Suffix tree construction =
Given the suffix array
Let
File:Constructing the suffix tree of banana based on the suffix array and the LCP array - Case 1.pdf
Start with
We need to distinguish two cases:
d(v)=H[i+1] : This means that the concatenation of the labels on the root-to-v path equals the longest common prefix of suffixesA[i] andA[i+1] .
In this case, insertA[i+1] as a new leafx of nodev and label the edge(v,x) withS[A[i+1]+H[i+1],n] . Thus the edge label consists of the remaining characters of suffixA[i+1] that are not already represented by the concatenation of the labels of the root-to-v path.
This creates the partial suffix treeST_{i+1} . File:Constructing the suffix tree of banana based on the suffix array and the LCP array - Case 2.pdfd(v) < H[i+1] : This means that the concatenation of the labels on the root-to-v path displays less characters than the longest common prefix of suffixesA[i] andA[i+1] and the missing characters are contained in the edge label ofv 's rightmost edge. Therefore, we have to split up that edge as follows:
Letw be the child ofv onST_i 's rightmost path.
- Delete the edge
(v,w) . - Add a new internal node
y and a new edge(v,y) with labelS[A[i]+d(v),A[i]+H[i+1]-1] . The new label consists of the missing characters of the longest common prefix ofA[i] andA[i+1] . Thus, the concatenation of the labels of the root-to-y path now displays the longest common prefix ofA[i] andA[i+1] . - Connect
w to the newly created internal nodey by an edge(y,w) that is labeledS[A[i]+H[i+1],A[i]+d(w)-1] . The new label consists of the remaining characters of the deleted edge(v,w) that were not used as the label of edge(v,y) . - Add
A[i+1] as a new leafx and connect it to the new internal nodey by an edge(y,x) that is labeledS[A[i+1]+H[i+1],n] . Thus the edge label consists of the remaining characters of suffixA[i+1] that are not already represented by the concatenation of the labels of the root-to-v path. - This creates the partial suffix tree
ST_{i+1} .
A simple amortization argument shows that the running time of this algorithm is bounded by
The nodes that are traversed in step
= LCP queries for arbitrary suffixes =
The LCP array
Because of the lexicographic order of the suffix array, every common prefix of the suffixes
Thus given a string
Notes
{{Reflist}}
References
- {{cite journal| doi=10.1016/S1570-8667(03)00065-0 | title=Replacing suffix trees with enhanced suffix arrays| year=2004| last1=Abouelhoda| first1=Mohamed Ibrahim| last2=Kurtz| first2=Stefan| last3=Ohlebusch| first3=Enno| journal=Journal of Discrete Algorithms| volume=2| pages=53–86 | doi-access=free}}
- {{cite journal| doi=10.1137/0222058 | title=Suffix Arrays: A New Method for On-Line String Searches| year=1993| last1=Manber| first1=Udi| last2=Myers| first2=Gene| journal=SIAM Journal on Computing| volume=22| issue=5| pages=935| citeseerx=10.1.1.105.6571| s2cid=5074629}}
- {{cite conference |doi = 10.1007/3-540-48194-X_17 | conference = Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching | year = 2001 | last1 = Kasai | first1 = T. | last2 = Lee | first2 = G. | last3 = Arimura | first3 = H. | last4 = Arikawa | first4 = S. | last5 = Park | first5 = K. | title = Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications | series = Lecture Notes in Computer Science | volume = 2089 | pages = 181–192 | isbn = 978-3-540-42271-6
}}
- {{cite conference| doi=10.1007/978-3-642-16321-0_34 | title=CST++| conference=String Processing and Information Retrieval| series=Lecture Notes in Computer Science| year=2010| last1=Ohlebusch| first1=Enno| last2=Fischer| first2=Johannes| last3=Gog| first3=Simon| isbn=978-3-642-16320-3| volume=6393| pages=322}}
- {{cite conference| title = Simple linear work suffix array construction
| url = http://dl.acm.org/citation.cfm?id=1759210.1759301
| year = 2003
| conference = Proceedings of the 30th international conference on Automata, languages and programming
| pages = 943–955
| last1 = Kärkkäinen | first1 = Juha
| last2 = Sanders | first2 = Peter | author2-link = Peter Sanders (computer scientist)
| access-date = 2012-08-28 }}
- {{cite conference| doi = 10.1007/978-3-642-22300-6_32 | last1 = Fischer | first1 = Johannes | title = Inducing the LCP-Array | conference = Algorithms and Data Structures | series = Lecture Notes in Computer Science | volume = 6844 | pages = 374–385 | year = 2011 | isbn = 978-3-642-22299-3 | arxiv = 1101.3448 }}
- {{cite conference| doi=10.1007/978-3-540-27810-8_32 | title=Two Space Saving Tricks for Linear Time LCP Array Computation| conference=Algorithm Theory – SWAT 2004| series=Lecture Notes in Computer Science| year=2004| last1=Manzini| first1=Giovanni| isbn=978-3-540-22339-9| volume=3111| pages=372}}
- {{cite conference| doi=10.1007/978-3-642-02441-2_17 | title=Permuted Longest-Common-Prefix Array| conference=Combinatorial Pattern Matching| series=Lecture Notes in Computer Science| year=2009| last1=Kärkkäinen| first1=Juha| last2=Manzini| first2=Giovanni| last3=Puglisi| first3=Simon J.| isbn=978-3-642-02440-5| volume=5577| pages=181}}
- {{cite conference| doi=10.1007/978-3-540-92182-0_14 | title=Space-Time Tradeoffs for Longest-Common-Prefix Array Computation| conference=Algorithms and Computation| series=Lecture Notes in Computer Science| year=2008| last1=Puglisi| first1=Simon J.| last2=Turpin| first2=Andrew| isbn=978-3-540-92181-3| volume=5369| pages=124}}
- {{cite conference| title = Fast and Lightweight LCP-Array Construction Algorithms
| url = http://www.siam.org/proceedings/alenex/2011/alx11_03_gogs.pdf
| year = 2011
| conference = Proceedings of the Workshop on Algorithm Engineering and Experiments, ALENEX 2011
| pages = 25–34
| last1 = Gog | first1 = Simon
| last2 = Ohlebusch | first2 = Enno
| access-date = 2012-08-28 }}
- {{cite conference| doi=10.1109/DCC.2009.42 | title=Linear Suffix Array Construction by Almost Pure Induced-Sorting| conference=2009 Data Compression Conference| year=2009| last1=Nong| first1=Ge| last2=Zhang| first2=Sen| last3=Chan| first3=Wai Hong| isbn=978-0-7695-3592-0| pages=193}}
- {{cite conference | doi = 10.1007/978-3-540-74450-4_41 | last1 = Fischer | first1 = Johannes
| last2 = Heun | first2 = Volker
| title = A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array
| conference = Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
| series = Lecture Notes in Computer Science
| volume = 4614
| pages = 459
| year = 2007
| isbn = 978-3-540-74449-8
}}
- {{Cite journal | doi = 10.1007/s11786-007-0024-4 | last1 = Chen | first1 = G. | last2 = Puglisi | first2 = S. J. | last3 = Smyth | first3 = W. F. | title = Lempel–Ziv Factorization Using Less Time & Space | journal = Mathematics in Computer Science | volume = 1 | issue = 4 | pages = 605 | year = 2008 | s2cid = 1721891 }}
- {{Cite journal | doi = 10.1016/j.ipl.2007.10.006 | last1 = Crochemore | first1 = M. | last2 = Ilie | first2 = L. | title = Computing Longest Previous Factor in linear time and applications | journal = Information Processing Letters | volume = 106 | issue = 2 | pages = 75 | year = 2008 | citeseerx=10.1.1.70.5720 | s2cid = 5492217 }}
- {{Cite conference | doi = 10.1109/DCC.2008.36 | last1 = Crochemore | first1 = M. | last2 = Ilie | first2 = L. | last3 = Smyth | first3 = W. F. | title = A Simple Algorithm for Computing the Lempel Ziv Factorization | conference = Data Compression Conference (dcc 2008) | pages = 482 | year = 2008 | isbn = 978-0-7695-3121-2 | hdl = 20.500.11937/5907 | hdl-access = free }}
- {{Cite journal| doi = 10.1007/s00224-006-1198-x | last1 = Sadakane | first1 = K. | title = Compressed Suffix Trees with Full Functionality | journal = Theory of Computing Systems | volume = 41 | issue = 4 | pages = 589–607 | year = 2007 |citeseerx=10.1.1.224.4152| s2cid = 263130 }}
- {{Cite journal | doi = 10.1016/j.tcs.2009.09.012 | last1 = Fischer | first1 = Johannes | last2 = Mäkinen | first2 = Veli | last3 = Navarro | first3 = Gonzalo | title = Faster entropy-bounded compressed suffix trees | journal = Theoretical Computer Science | volume = 410 | issue = 51 | pages = 5354 | year = 2009 | doi-access = free }}
- {{cite journal |last1=Fischer |first1=Johannes |last2=Kurpicz |first2=Florian |title=Dismantling DivSufSort |arxiv=1710.01896 |date=5 October 2017 |journal=Proceedings of the Prague Stringology Conference 2017}}
External links
{{Commons category}}
- [https://github.com/elventear/sais-lite-lcp Mirror of the ad-hoc-implementation of the code] described in {{harvtxt|Fischer|2011}}
- [https://github.com/simongog/sdsl/ SDSL: Succinct Data Structure Library - Provides various LCP array implementations, Range Minimum Query (RMQ) support structures and many more succinct data structures ]
- [https://github.com/carrotsearch/jsuffixarrays/blob/master/src/main/java/org/jsuffixarrays/Traversals.java Bottom-up suffix tree traversal emulated using suffix array and LCP array (Java)]
- [https://code.google.com/p/text-indexing/ Text-Indexing project] (linear-time construction of suffix trees, suffix arrays, LCP array and Burrows–Wheeler Transform)