substring index

{{Short description|Data structure}}

In computer science, a substring index is a data structure which gives substring search in a text or text collection in sublinear time. Once constructed from a document or set of documents, a substring index can be used to locate all occurrences of a pattern in time linear or near-linear in the pattern size, with no dependence or only logarithmic dependence on the document size.{{citation

| last1 = Barsky | first1 = Marina

| last2 = Stege | first2 = Ulrike

| last3 = Thomo | first3 = Alex

| contribution = Chapter 1: Structures for Indexing Substrings

| doi = 10.1007/978-3-031-01885-5_1

| isbn = 9783031018855

| pages = 1–15

| publisher = Springer International Publishing

| series = Synthesis Lectures on Data Management

| title = Full-Text (Substring) Indexes in External Memory

| year = 2012}}

The phrase full-text index is often used for substring indexes. But this is ambiguous, as it is also used for regular word indexes such as inverted files and document retrieval. See full text search.

General considerations

These data structures typically treat their text and pattern as strings over a fixed alphabet, and search for locations where the pattern occurs as a substring of the text. The symbols of the alphabet may be characters (for instance in Unicode) but in practical applications for text retrieval it may be preferable to treat the (stemmed) words of a document as the symbols of its alphabet, because doing this reduces the lengths of both the text and pattern as measured in numbers of symbols.{{citation

| last = Risvik | first = Knut Magne

| editor-last = Farach-Colton | editor-first = Martin | editor-link = Martin Farach-Colton

| contribution = Approximate word sequence matching over sparse suffix trees

| doi = 10.1007/BFB0030781

| pages = 65–79

| publisher = Springer

| series = Lecture Notes in Computer Science

| title = Combinatorial Pattern Matching, 9th Annual Symposium, CPM 98, Piscataway, New Jersey, USA, July 20–22, 1998, Proceedings

| volume = 1448

| year = 1998| isbn = 978-3-540-64739-3

}}

Examples

Specific data structures that can be used as substring indexes include:

| last1 = Blumer | first1 = Anselm

| last2 = Blumer | first2 = J.

| last3 = Ehrenfeucht | first3 = Andrzej | author3-link = Andrzej Ehrenfeucht

| last4 = Haussler | first4 = David | author4-link = David Haussler

| last5 = McConnell | first5 = Ross M.

| editor-last = Paredaens | editor-first = Jan

| contribution = Building the minimal DFA for the set of all subwords of a word on-line in linear time

| doi = 10.1007/3-540-13345-3_9

| pages = 109–118

| publisher = Springer

| series = Lecture Notes in Computer Science

| title = Automata, Languages and Programming, 11th Colloquium, Antwerp, Belgium, July 16–20, 1984, Proceedings

| volume = 172

| year = 1984| isbn = 978-3-540-13345-2

}}

  • The suffix array, a sorted array of the starting positions of suffixes of the string, allowing substring search to be performed by binary search Augmenting a suffix array with an LCP array of the lengths of common prefixes of consecutive suffixes allows the search to be performed symbol-by-symbol, matching the search time of the suffix tree.{{citation

| last1 = Manber | first1 = Udi | author1-link = Udi Manber

| last2 = Myers | first2 = Gene | author2-link = Eugene Myers

| doi = 10.1137/0222058

| issue = 5

| journal = SIAM Journal on Computing

| mr = 1237156

| pages = 935–948

| title = Suffix arrays: a new method for on-line string searches

| volume = 22

| year = 1993}}

| last1 = Grossi | first1 = Roberto

| last2 = Vitter | first2 = Jeffrey Scott | author2-link = Jeffrey Vitter

| doi = 10.1137/S0097539702402354

| issue = 2

| journal = SIAM Journal on Computing

| mr = 2191449

| pages = 378–407

| title = Compressed suffix arrays and suffix trees with applications to text indexing and string matching

| url = https://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf

| volume = 35

| year = 2005| hdl = 1808/18962

}}

| last1 = Ferragina | first1 = Paolo

| last2 = Manzini | first2 = Giovanni

| doi = 10.1145/1082036.1082039

| issue = 4

| journal = Journal of the ACM

| mr = 2164632

| pages = 552–581

| title = Indexing compressed text

| volume = 52

| year = 2005}}

References