Chvátal–Sankoff constants

In mathematics, the Chvátal–Sankoff constants are mathematical constants that describe the lengths of longest common subsequences of random strings. Although the existence of these constants has been proven, their exact values are unknown. They are named after Václav Chvátal and David Sankoff, who began investigating them in the mid-1970s.{{citation

| last1 = Chvatal | first1 = Václáv | author1-link = Václav Chvátal

| last2 = Sankoff | first2 = David | author2-link = David Sankoff

| journal = Journal of Applied Probability

| mr = 0405531

| pages = 306–315

| title = Longest common subsequences of two random sequences

| volume = 12

| year = 1975 | issue = 2 | doi=10.2307/3212444| jstor = 3212444 | s2cid = 250345191 }}.{{citation|title=Mathematical Constants|series=Encyclopedia of Mathematics and its Applications|first=Steven R.|last=Finch|publisher=Cambridge University Press|year=2003|isbn=9780521818056|contribution=5.20.2 Common Subsequences|pages=384–385|url=https://books.google.com/books?id=Pl5I2ZSI6uAC&pg=PA384}}.

There is one Chvátal–Sankoff constant $\gamma_k$ for each positive integer k, where k is the number of characters in the alphabet from which the random strings are drawn. The sequence of these numbers grows inversely proportionally to the square root of k. However, some authors write "the Chvátal–Sankoff constant" to refer to $\gamma_2$ , the constant defined in this way for the binary alphabet.{{citation

| last1 = Kiwi | first1 = M.

| last2 = Soto | first2 = J.

| doi = 10.1017/S0963548309009900

| issue = 4

| journal = Combinatorics, Probability and Computing

| mr = 2507735

| pages = 517–532

| title = On a speculated relation between Chvátal–Sankoff constants of several sequences

| volume = 18

| year = 2009| arxiv = 0810.1066

| s2cid = 10882010

}}.

Background

A common subsequence of two strings S and T is a string whose characters appear in the same order (not necessarily consecutively) both in S and in T. The problem of computing a longest common subsequence has been well studied in computer science. It can be solved in polynomial time by dynamic programming;{{citation

| last1 = Cormen | first1 = Thomas H. | author1-link = Thomas H. Cormen

| last2 = Leiserson | first2 = Charles E. | author2-link = Charles E. Leiserson

| last3 = Rivest | first3 = Ronald L. | author3-link = Ronald L. Rivest

| last4 = Stein | first4 = Clifford | author4-link = Clifford Stein

| contribution = 15.4

| edition = 2nd

| isbn = 0-262-53196-8

| pages = 350–355

| publisher = MIT Press and McGraw-Hill

| title = Introduction to Algorithms

| year = 2001}}. this basic algorithm has additional speedups for small alphabets (the Method of Four Russians),{{citation

| last1 = Masek | first1 = William J.

| last2 = Paterson | first2 = Michael S. | author2-link = Mike Paterson

| doi = 10.1016/0022-0000(80)90002-1

| issue = 1

| journal = Journal of Computer and System Sciences

| mr = 566639

| pages = 18–31

| title = A faster algorithm computing string edit distances

| volume = 20

| year = 1980| doi-access = free

| hdl = 1721.1/148933

| hdl-access = free

}}. for strings with few differences, for strings with few matching pairs of characters,{{citation

| last1 = Hunt | first1 = James W.

| last2 = Szymanski | first2 = Thomas G.

| issue = 5

| journal = Communications of the ACM

| mr = 0436655

| pages = 350–353

| title = A fast algorithm for computing longest common subsequences

| volume = 20

| year = 1977

| doi=10.1145/359581.359603| s2cid = 3226080

| doi-access = free

}}. etc. This problem and its generalizations to more complex forms of edit distance have important applications in areas that include bioinformatics (in the comparison of DNA and protein sequences and the reconstruction of evolutionary trees), geology (in stratigraphy), and computer science (in data comparison and revision control).{{citation|title=Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison|first1=David|last1=Sankoff|author1-link=David Sankoff|first2=Joseph B.|last2=Kruskal|author2-link=Joseph Kruskal|publisher=Addison-Wesley|year=1983|bibcode=1983twse.book.....S }}.

One motivation for studying the longest common subsequences of random strings, given already by Chvátal and Sankoff, is to calibrate the computations of longest common subsequences on strings that are not random. If such a computation returns a subsequence that is significantly longer than what would be obtained at random, one might infer from this result that the match is meaningful or significant.

Definition and existence

The Chvátal–Sankoff constants describe the behavior of the following random process. Given parameters n and k, choose two length-n strings S and T from the same k-symbol alphabet, with each character of each string chosen uniformly at random, independently of all the other characters. Compute a longest common subsequence of these two strings, and let $\lambda_{n,k}$ be the random variable whose value is the length of this subsequence. Then the expected value of $\lambda_{n,k}$ is (up to lower-order terms) proportional to n, and the kth Chvátal–Sankoff constant $\gamma_k$ is the constant of proportionality.

More precisely, the expected value $\operatorname{E}[\lambda_{n,k}]$ is superadditive: for all m and n, $\operatorname{E}[\lambda_{m+n,k}]\ge \operatorname{E}[\lambda_{m,k}]+\operatorname{E}[\lambda_{n,k}]$ . This is because, if strings of length m + n are broken into substrings of lengths m and n, and the longest common subsequences of those substrings are found, they can be concatenated together to get a common substring of the whole strings. It follows from a lemma of Michael Fekete{{citation

| last = Fekete | first = M. | authorlink = Michael Fekete

| doi = 10.1007/BF01504345

| issue = 1

| journal = Mathematische Zeitschrift

| language = German

| pages = 228–249

| title = Über die Verteilung der Wurzeln bei gewissen algebraischen Gleichungen mit ganzzahligen Koeffizienten

| volume = 17

| year = 1923| s2cid = 186223729 }}. that the limit

: $\gamma_k = \lim_{n\to\infty} \frac{\operatorname{E}[\lambda_{n,k}]}{n}$

exists, and equals the supremum of the values $\operatorname{E}[\lambda_{n,k}]/n$ . These limiting values $\gamma_k$ are the Chvátal–Sankoff constants.

Bounds

The exact values of the Chvátal–Sankoff constants remain unknown, but rigorous upper and lower bounds have been proven.

Because $\gamma_k$ is a supremum of the values $\operatorname{E}[\lambda_{n,k}]$ which each depend only on a finite probability distribution, one way to prove rigorous lower bounds on $\gamma_k$ would be to compute the exact values of $\operatorname{E}[\lambda_{n,k}]$ ; however, this method scales exponentially in n, so it can only be implemented for small values of n, leading to weak lower bound. In his Ph.D. thesis, Vlado Dančík pioneered an alternative approach in which a deterministic finite automaton is used to read symbols of two input strings and produce a (long but not optimal) common subsequence of these inputs. The behavior of this automaton on random inputs can be analyzed as a Markov chain, the steady state of which determines the rate at which it finds elements of the common subsequence for large values of n. This rate is necessarily a lower bound on the Chvátal–Sankoff constant.{{citation

| last1 = Dančík | first1 = Vlado

| last2 = Paterson | first2 = Mike | author2-link = Mike Paterson

| doi = 10.1002/rsa.3240060408

| issue = 4

| journal = Random Structures & Algorithms

| mr = 1368846

| pages = 449–458

| title = Upper bounds for the expected length of a longest common subsequence of two binary sequences

| volume = 6

| year = 1995}}. By using Dančík's method, with an automaton whose state space buffers the most recent h characters from its two input strings, and with additional techniques for avoiding the expensive steady-state Markov chain analysis of this approach, {{harvtxt|Lueker|2009}} was able to perform a computerized analysis with n = 15 that proved $\gamma_2\ge 0.788071$ .

Similar methods can be generalized to non-binary alphabets. Lower bounds obtained in this way for various values of k are:

class="wikitable"
k ! Lower bound on $\gamma_k$
2 \| 0.788071
3 \| 0.671697
4 \| 0.599248
5 \| 0.539129
6 \| 0.479452
7 \| 0.44502
8 \| 0.42237
9 \| 0.40321
10 \| 0.38656

class="wikitable"

! Lower bound on $\gamma_k$

| 0.788071

| 0.671697

| 0.599248

| 0.539129

| 0.479452

| 0.44502

| 0.42237

| 0.40321

| 0.38656

{{harvtxt|Dančík|Paterson|1995}} also used automata-theoretic methods to prove upper bounds on the Chvátal–Sankoff constants, and again {{harvtxt|Lueker|2009}} extended these results by computerized calculations. The upper bound he obtained was $\gamma_2\le 0.826280$ . This result disproved a conjecture of J. Michael Steele that $\gamma_2 = 2/(1+\sqrt 2)$ , because this value is greater than the upper bound.{{citation

| last = Lueker | first = George S.

| doi = 10.1145/1516512.1516519

| issue = 3

| journal = Journal of the ACM

| mr = 2536132

| at = A17

| title = Improved bounds on the average length of longest common subsequences

| volume = 56

| year = 2009| s2cid = 7232681

}}. Non-rigorous numerical evidence suggests that $\gamma_2$ is approximately $0.811$ , closer to the upper bound than the lower bound.{{citation

| last = Dixon | first = John D.

| arxiv = 1307.2796

| title = Longest common subsequences in binary sequences

| year = 2013| bibcode = 2013arXiv1307.2796D}}.

In the limit as k goes to infinity, the constants $\gamma_k$ grow inversely proportionally to the square root of k. More precisely,{{citation

| last1 = Kiwi | first1 = Marcos

| last2 = Loebl | first2 = Martin

| last3 = Matoušek | first3 = Jiří | author3-link = Jiří Matoušek (mathematician)

| doi = 10.1016/j.aim.2004.10.012

| doi-access = free

| issue = 2

| journal = Advances in Mathematics

| mr = 2173842

| pages = 480–498

| title = Expected length of the longest common subsequence for large alphabets

| volume = 197

| year = 2005| arxiv = math/0308234

}}.

: $\lim_{k\to\infty} \gamma_k \sqrt k = 2.$

Distribution of LCS lengths

There has also been research into the distribution of values of the longest common subsequence, generalizing the study of the expectation of this value. For instance, the standard deviation of the length of the longest common subsequence of random strings of length n is known to be proportional to the square root of n.{{citation

| last1 = Lember | first1 = Jüri

| last2 = Matzinger | first2 = Heinrich

| doi = 10.1214/08-AOP436

| issue = 3

| journal = The Annals of Probability

| mr = 2537552

| pages = 1192–1235

| title = Standard deviation of the longest common subsequence

| volume = 37

| year = 2009| arxiv = 0907.5137

| s2cid = 8143348

}}.

One complication in performing this sort of analysis is that the random variables describing whether the characters at different pairs of positions match each other are not independent of each other. For a more mathematically tractable simplification of the longest common subsequence problem, in which the allowed matches between pairs of symbols are not controlled by whether those symbols are equal to each other but instead by independent random variables with probability 1/k of being 1 and (k − 1)/k of being 0, it has been shown that the distribution of the longest common subsequence length is controlled by the Tracy–Widom distribution.{{citation

| last1 = Majumdar | first1 = Satya N.

| last2 = Nechaev | first2 = Sergei

| doi = 10.1103/PhysRevE.72.020901

| issue = 2

| journal = Physical Review E

| mr = 2177365

| pages = 020901, 4

| title = Exact asymptotic results for the Bernoulli matching model of sequence alignment

| volume = 72

| year = 2005| arxiv = q-bio/0410012

| bibcode = 2005PhRvE..72b0901M

| pmid = 16196539

| s2cid = 11390762

}}.

References

Category:Mathematical constants

Category:Probability theory

Category:Problems on strings