Autocorrelation (words)

{{short description|In combinatorics, the autocorrelation of a word is the set of periods of this word}}

In combinatorics, a branch of mathematics, the autocorrelation of a word is the set of periods of this word. More precisely, it is a sequence of values which indicate how much the end of a word looks likes the beginning of a word. This value can be used to compute, for example, the average value of the first occurrence of this word in a random string.

Definition

In this article, A is an alphabet, and w=w_1\dots w_n a word on A of length n. The autocorrelation of w can be defined as the correlation of w with itself. However, we redefine this notion below.

=Autocorrelation vector=

The autocorrelation vector of w is c=(c_0,\dots,c_{n-1}), with c_i being 1 if the prefix of length n-i equals the suffix of length n-i, and with c_i being 0 otherwise. That is c_i indicates whether w_{i+1}\dots w_n=w_1\dots w_{n-i}.

For example, the autocorrelation vector of aaa is (1,1,1) since, clearly, for i being 0, 1 or 2, the prefix of length n-i is equal to the suffix of length n-i. The autocorrelation vector of abb is (1,0,0) since no strict prefix is equal to a strict suffix. Finally, the autocorrelation vector of aabbaa is 100011, as shown in the following table:

class="wikitable"

!a

!a

!b

!b

!a

!a

!

!

!

!

!

!

a

|a

|b

|b

|a

|a

|

|

|

|

|

|1

|a

|a

|b

|b

|a

|a

|

|

|

|

|0

|

|a

|a

|b

|b

|a

|a

|

|

|

|0

|

|

|a

|a

|b

|b

|a

|a

|

|

|0

|

|

|

|a

|a

|b

|b

|a

|a

|

|1

|

|

|

|

|a

|a

|b

|b

|a

|a

|1

Note that c_0 is always equal to 1, since the prefix and the suffix of length n are both equal to the word w. Similarly, c_{n-1} is 1 if and only if the first and the last letters are the same.

=Autocorrelation polynomial=

The autocorrelation polynomial of w is defined as c(z)=c_0z^0+\dots+c_{n-1}z^{n-1}. It is a polynomial of degree at most n-1.

For example, the autocorrelation polynomial of aaa is 1+z+z^2 and the autocorrelation polynomial of abb is 1. Finally, the autocorrelation polynomial of aabbaa is 1+z^4+z^5.

Property

We now indicate some properties which can be computed using the autocorrelation polynomial.

=First occurrence of a word in a random string=

Suppose that you choose an infinite sequence s of letters of A, randomly, each letter with probability \frac{1}

A
, where |A| is the number of letters of A. Let us call E the expectation of the first occurrence of ?m in s? . Then E equals |A|^nc\left(\frac{1}
A
\right). That is, each subword v of w which is both a prefix and a suffix causes the average value of the first occurrence of w to occur |A|^
v
letters later. Here |v| is the length of v.

For example, over the binary alphabet A=\{a,b\}, the first occurrence of aa is at position 2^2(1+\frac 12)=6 while the average first occurrence of ab is at position 2^2(1)=4. Intuitively, the fact that the first occurrence of aa is later than the first occurrence of ab can be explained in two ways:

  • We can consider, for each position p, what are the requirement for w's first occurrence to be at p.
  • The first occurrence of ab can be at position 1 in only one way in both case. If s starts with w. This has probability \frac14 for both considered values of w.
  • The first occurrence of ab is at position 2 if the prefix of s of length 3 is aab or is bab. However, the first occurrence of aa is at position 2 if and only if the prefix of s of length 3 is baa. (Note that the first occurrence of aa in aaa is at position 1.).
  • In general, the number of prefixes of length n+1 such that the first occurrence of aa is at position n is smaller for aa than for ba. This explain why, on average, the first aa arrive later than the first ab.
  • We can also consider the fact that the average number of occurrences of w in a random string of length l is |A|^{l-n}. This number is independent of the autocorrelation polynomial. An occurrence of w may overlap another occurrence in different ways. More precisely, each 1 in its autocorrelation vector correspond to a way for occurrence to overlap. Since many occurrences of w can be packed together, using overlapping, but the average number of occurrences does not change, it follows that the distance between two non-overlapping occurrences is greater when the autocorrelation vector contains many 1's.

=Ordinary generating functions=

Autocorrelation polynomials allows to give simple equations for the ordinary generating functions (OGF) of many natural questions.

  • The OGF of the languages of words not containing w is \frac{c(z)}{z^n+(1-|A|z)c(z)}.
  • The OGF of the languages of words containing w is \frac{z^n}{(1-|A|z)(z^n+(1-|A|z)c(z))}.
  • The OGF of the languages of words containing a single occurrence of w, at the end of the word is \frac{z^n}{z^{n}+(1-|A|z)c(z)}.

References

  • {{cite book|last1=Flajolet and Sedgewick|title=Analytic Combinatorics|title-link= Analytic Combinatorics |date=2010|publisher=Cambridge University Press|location=New York|isbn=978-0-521-89806-5|pages=[https://archive.org/details/analyticcombinat00flaj_706/page/n71 60]-61}}
  • {{cite web|last1=Rosen|first1=Ned|title=Expected waiting times for strings of coin flips|url=https://www2.bc.edu/ned-rosen/public/CoinFlips.pdf|access-date=3 December 2017}}
  • {{cite journal|last1=Odlyzko|first1=A. M.|last2=Guibas|first2=L. J.|title=String overlaps, pattern matching, and nontransitive games|journal=Journal of Combinatorial Theory|date=1981|volume=Series A 30|issue=2|pages=183–208|doi=10.1016/0097-3165(81)90005-4|doi-access=}}

Category:Formal languages

Category:Combinatorics on words