Shortest common supersequence#Shortest common superstring
In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y.
A shortest common supersequence (SCS) is a common supersequence of minimal length. In the SCS problem, two sequences X and Y are given, and the task is to find a shortest possible common supersequence of these sequences. In general, an SCS is not unique.
For two input sequences, an SCS can be formed from a longest common subsequence (LCS) easily. For example, the longest common subsequence of X and Y is Z. By inserting the non-LCS symbols into Z while preserving their original order, we obtain a shortest common supersequence U. In particular, the equation holds for any two input sequences.
There is no similar relationship between shortest common supersequences and longest common subsequences of three or more input sequences. (In particular, LCS and SCS are not dual problems.) However, both problems can be solved in time using dynamic programming, where is the number of sequences, and is their maximum length. For the general case of an arbitrary number of input sequences, the problem is NP-hard.{{cite journal| author = David Maier| title = The Complexity of Some Problems on Subsequences and Supersequences| journal = J. ACM| volume = 25| year = 1978| pages = 322–336| doi = 10.1145/322063.322075| publisher = ACM Press| issue = 2| s2cid = 16120634| doi-access = free}}
Shortest common superstring
The closely related problem of finding a minimum-length string which is a superstring of a finite set of strings {{var|S}} = { {{var|s}}1,{{var|s}}2,...,{{var|s}}n } is also NP-hard.{{cite journal
|title=The shortest common supersequence problem over binary alphabet is NP-complete
|author=Kari-Jouko Räihä, Esko Ukkonen
|journal=Theoretical Computer Science
|volume=16
|number=2
|date=1981
|pages=187–198
|doi=10.1016/0304-3975(81)90075-x
|doi-access=
}} Several constant factor approximations have been proposed throughout the years, and the current best known algorithm has an approximation factor of 2.475.{{cite book
|author=Matthias Englert and Nicolaos Matsakis and Pavel Vesel
|title=Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing
|chapter=Improved approximation guarantees for shortest superstrings using cycle classification by overlap to length ratios
|date=2022
|pages=317–330
|doi=10.1145/3519935.3520001
|isbn=9781450392648
|s2cid=243847650
|url=http://wrap.warwick.ac.uk/164501/1/WRAP-Improved-approximation-guarantees-shortest-superstrings-cycle-classification-overlap-length-ratios-2022.pdf
|chapter-url=https://dl.acm.org/doi/10.1145/3519935.3520001
}}
However, perhaps the simplest solution is to reformulate the problem as an instance of weighted set cover in such a way that the weight of the optimal solution to the set cover instance is less than twice the length of the shortest superstring {{var|S}}. One can then use the Approximation algorithm#Performance guarantees for weighted set-cover to obtain an O(log({{var|n}}))-approximation for the shortest superstring (note that this is not a constant factor approximation).
For any string {{var|x}} in this alphabet, define {{var|P}}({{var|x}}) to be the set of all strings which are substrings of {{var|x}}. The instance {{var|I}} of set cover is formulated as follows:
- Let {{var|M}} be empty.
- For each pair of strings {{var|s}}{{var|i}} and {{var|s}}{{var|j}}, if the last {{var|k}} symbols of {{var|s}}{{var|i}} are the same as the first {{var|k}} symbols of {{var|s}}{{var|j}}, then add a string to {{var|M}} that consists of the concatenation with maximal overlap of {{var|s}}{{var|i}} with {{var|s}}{{var|j}}.
- Define the universe of the set cover instance to be {{var|S}}
- Define the set of subsets of the universe to be { {{var|P}}({{var|x}}) | {{var|x}} ∈ {{var|S}} ∪ {{var|M}} }
- Define the cost of each subset {{var|P}}(x) to be |{{var|x}}|, the length of {{var|x}}.
The instance {{var|I}} can then be solved using an algorithm for weighted set cover, and the algorithm can output an arbitrary concatenation of the strings {{var|x}} for which the weighted set cover algorithm outputs {{var|P}}({{var|x}}).{{sfn|Vazirani|2001|p=20}}
= Example =
Consider the set {{var|S}} = { abc, cde, fab }, which becomes the universe of the weighted set cover instance. In this case, {{var|M}} = { abcde, fabc }. Then the set of subsets of the universe is
:
\begin{align}
\{ P(x) | x\in S\cup M \}
&= \{ P(x) | x\in \{ abc, cde, fab, abcde, fabc \} \} \\
&= \{ P(abc), P(cde), P(fab), P(abcde), P(fabc) \} \} \\
&= \{ \{a,b,c,ab,bc,abc\}, \{c,d,e,cd,de,cde\},\ldots, \{f,a,b,c,fa,ab,bc,fab,abc,fabc\} \} \} \\
\end{align}
which have costs 3, 3, 3, 5, and 4, respectively.
References
{{Reflist}}
- {{cite book | first1=Michael R. | last1=Garey | author1-link=Michael R. Garey | first2=David S. | last2=Johnson | author2-link=David S. Johnson | year=1979 | title=Computers and Intractability: A Guide to the Theory of NP-Completeness | publisher=W.H. Freeman | isbn=0-7167-1045-5 | zbl=0411.68039 | at=[https://archive.org/details/computersintract0000gare/page/ p. 228 A4.2: SR8] | title-link=Computers and Intractability: A Guide to the Theory of NP-Completeness }}
- {{cite book | last=Szpankowski | first=Wojciech | authorlink = Wojciech Szpankowski | title=Average case analysis of algorithms on sequences | others=With a foreword by Philippe Flajolet | series=Wiley-Interscience Series in Discrete Mathematics and Optimization | location=Chichester | publisher=Wiley | year=2001 | isbn=0-471-24063-X | zbl=0968.68205 }}
- {{Citation| last = Vazirani | first = Vijay V.
| authorlink = Vijay Vazirani
| title = Approximation Algorithms
| year = 2001
| publisher = Springer-Verlag
| isbn = 3-540-65367-8
| url = http://www.cc.gatech.edu/fac/Vijay.Vazirani/book.pd
}}
External links
- [http://nist.gov/dads/HTML/shortestCommonSuperseq.html Dictionary of Algorithms and Data Structures: shortest common supersequence]
- [https://www.geeksforgeeks.org/shortest-common-supersequence/ Shortest Common Supersequence]
- [https://leetcode.com/problems/shortest-common-supersequence/ 1092. Shortest Common Supersequence]
- [https://www.codingninjas.com/studio/problem-details/shortest-supersequence_4244493 Shortest Common Supersequence]
- [https://algorithm-visualizer.org/dynamic-programming/shortest-common-supersequence Shortest common supersequence]
- [https://p-library.com/o/scs/ Shortest Common Supersequence (SCS)]