Brzozowski derivative

{{Short description|Function defined on formal languages in computer science}}

File:Brzozowski derivative.gif

In theoretical computer science, in particular in formal language theory, the Brzozowski derivative u^{-1}S of a set S of strings and a string u is the set of all strings obtainable from a string in S by cutting off the prefix u. Formally:

:u^{-1}S = \{v \in \Sigma^* \mid uv \in S\}.

For example,

:\text{c}^{-1}\{\text{cat}, \text{cow}, \text{dog}\} = \{\text{at}, \text{ow}\}.

The Brzozowski derivative was introduced under various different names since the late 1950s.{{cite journal | url=https://dl.acm.org/doi/10.1145/320924.320930 | author=George N. Raney | title=Sequential functions | journal=Journal of the ACM | volume=5 | number=2 | pages=177–180 | date=Apr 1958 | doi=10.1145/320924.320930 | s2cid=1611992 }}{{cite journal | url=http://www.cse.chalmers.se/~coquand/AUTOMATA/rs.pdf | author=Dana Scott and Michael O. Rabin | title=Finite Automata and Their Decision Problems | journal=IBM Journal of Research and Development | volume=3 | number=2 | pages=114–125 | date=Apr 1959 | doi=10.1147/rd.32.0114 }}{{cite book | url=https://ieeexplore.ieee.org/document/5397288 | doi=10.1109/FOCS.1961.26 | author=C.C. Elgot and J.D. Rutledge | contribution=Operations on finite automata | editor=Robert S. Ledley | title=Proc. AIEE 2nd Ann. Symp. on Switching, Circuit Theory, and Logical Design (SWCT), Detroit | publisher= | series= | volume= | pages=129–132 | date=Oct 1961 }}

Today it is named after the computer scientist Janusz Brzozowski who investigated its properties and gave an algorithm to compute the derivative of a generalized regular expression.{{cite journal| author=Janusz A. Brzozowski| title=Derivatives of Regular Expressions| journal=J ACM| year=1964| volume=11| issue=4| pages=481–494| doi=10.1145/321239.321249| s2cid=14126942| doi-access=free}}

Definition

Even though originally studied for regular expressions, the definition applies to arbitrary formal languages.

Given any formal language S over an alphabet \Sigma and any string u \in \Sigma^*, the derivative of S with respect to u is defined as:{{cite journal| author=Janusz A. Brzozowski| title=Derivatives of Regular Expressions| journal=J ACM| year=1964| volume=11| issue=4| pages=481–494| doi=10.1145/321239.321249| s2cid=14126942| doi-access=free}}

:u^{-1}S = \{v \in \Sigma^* \mid uv \in S\}

The Brzozowski derivative is a special case of left quotient by a singleton set containing only u: \ u^{-1}S = \{u\} \;\backslash\; S.

Equivalently, for all u,v \in \Sigma^*:

:v \in u^{-1}S \;\Leftrightarrow\; uv \in S.

From the definition, for all u, v \in \Sigma^*:

:(uv)^{-1}S = v^{-1}(u^{-1}S)

since for all w \in \Sigma^*, we have {{nowrap|w \in (uv)^{-1}S \Leftrightarrow uvw \in S \Leftrightarrow vw \in u^{-1}S \Leftrightarrow w \in v^{-1}(u^{-1}S).}}

The derivative with respect to an arbitrary string reduces to successive derivatives over the symbols of that string, since for all a \in \Sigma, u \in \Sigma^*:

\begin{align}

(ua)^{-1}S

&= a^{-1}(u^{-1}S) \\

\varepsilon^{-1}S

&= S

\end{align}

A language S \subseteq \Sigma^* is called nullable if and only if it contains the empty string \varepsilon. Each language S is uniquely determined by nullability of its derivatives:

:w \in S \ \Leftrightarrow\ \varepsilon \in w^{-1}S

A language can be viewed as a (potentially infinite) boolean-labelled tree (see also tree (set theory) and infinite-tree automaton). Each possible string w \in \Sigma^* denotes a node in the tree, with label true when w \in S and false otherwise. In this interpretation, the derivative with respect to a symbol a corresponds to the subtree obtained by following the edge a from the root. Decomposing a tree into the root and the subtrees a^{-1}S corresponds to the following equality, which holds for every language S \subseteq \Sigma^*:

:S = (\{\varepsilon\} \cap S) \cup \bigcup_{a \in \Sigma} a(a^{-1}S).

Derivatives of generalized regular expressions

When a language is given by a regular expression, the concept of derivatives leads to an algorithm for deciding whether a given word belongs to the regular expression.

Given a finite alphabet A of symbols,Brzozowski (1964), p.481, required A to consist of the 2n combinations of n bits, for some n. a generalized regular expression R denotes a possibly infinite set of finite-length strings over the alphabet A, called the language of R, denoted L(R).

A generalized regular expression can be one of the following (where a is a symbol of the alphabet A, and R and S are generalized regular expressions):

  • "∅" denotes the empty set: L(∅) = {},
  • "ε" denotes the singleton set containing the empty string: L(ε) = {ε},
  • "a" denotes the singleton set containing the single-symbol string a: L(a) = {a},
  • "RS" denotes the union of R and S: L(RS) = L(R) ∪ L(S),
  • "RS" denotes the intersection of R and S: L(RS) = L(R) ∩ L(S),
  • R" denotes the complement of R (with respect to A*, the set of all strings over A): LR) = A* \ L(R),
  • "RS" denotes the concatenation of R and S: L(RS) = L(R) · L(S),
  • "R*" denotes the Kleene closure of R: L(R*) = L(R)*.

In an ordinary regular expression, neither ∧ nor ¬ is allowed.

=Computation=

For any given generalized regular expression R and any string u, the derivative u−1R is again a generalized regular expression (denoting the language u−1L(R)).Brzozowski (1964), p.483, Theorem 4.1

It may be computed recursively as follows.Brzozowski (1964), p.483, Theorem 3.2

style="text-align:right;" | (ua)−1R= a−1(u−1R)      for a symbol a and a string u
style="text-align:right;" | ε−1R= R

Using the previous two rules, the derivative with respect to an arbitrary string is explained by the derivative with respect to a single-symbol string a.

The latter can be computed as follows:Brzozowski (1964), p.483, Theorem 3.1

style="font-family: 'Nimbus Roman No9 L', 'Times New Roman', Times, serif; line-height: 1; white-space: nowrap; font-variant-numeric: lining-nums tabular-nums; font-kerning:none"
style="text-align:right;" | a−1a= ε
style="text-align:right;" | a−1b= ∅for each symbol ba
style="text-align:right;" | a−1ε= ∅
style="text-align:right;" | a−1= ∅
style="text-align:right;" | a−1(R*)= (a−1R)R*
style="text-align:right;" | a−1(RS)= (a−1R)S ∨ ν(R)a−1S
style="text-align:right;" | a−1(RS)= (a−1R) ∧ (a−1S)
style="text-align:right;" | a−1(RS)= (a−1R) ∨ (a−1S)
style="text-align:right;" | a−1R)= ¬(a−1R)

Here, {{math|ν(R)}} is an auxiliary function yielding a generalized regular expression that evaluates to the empty string ε if R{{'s}} language contains ε, and otherwise evaluates to ∅. This function can be computed by the following rules:Brzozowski (1964), p.482, Definition 3.2

style="font-family: 'Nimbus Roman No9 L', 'Times New Roman', Times, serif; line-height: 1; white-space: nowrap; font-variant-numeric: lining-nums tabular-nums; font-kerning:none"
style="text-align:right;" | ν(a)= ∅for any symbol a
style="text-align:right;" | ν(ε)= ε
style="text-align:right;" | ν(∅)= ∅
style="text-align:right;" | ν(R*)= ε
style="text-align:right;" | ν(RS)= ν(R) ∧ ν(S)
style="text-align:right;" | ν(RS)= ν(R) ∧ ν(S)
style="text-align:right;" | ν(RS)= ν(R) ∨ ν(S)
style="text-align:right;" | ν(¬R)= εif ν(R) = ∅
style="text-align:right;" | ν(¬R)= ∅if ν(R) = ε

=Properties=

A string u is a member of the string set denoted by a generalized regular expression R if and only if ε is a member of the string set denoted by the derivative u−1R.Brzozowski (1964), p.483, Theorem 4.2

Considering all the derivatives of a fixed generalized regular expression R results in only finitely many different languages. If their number is denoted by dR, all these languages can be obtained as derivatives of R with respect to strings of length less than dR.Brzozowski (1964), p.484, Theorem 4.3 Furthermore, there is a complete deterministic finite automaton with dR states that recognises the regular language given by R, as stated by the Myhill–Nerode theorem.

Derivatives of context-free languages

Derivatives are also effectively computable for recursively defined equations with regular expression operators, which are equivalent to context-free grammars. This insight was used to derive parsing algorithms for context-free languages.{{cite conference

| author1=Matthew Might

| author2=David Darais

| author3=Daniel Spiewak

| title=Parsing with derivatives: a functional pearl

| conference=Proceeding of the 16th ACM SIGPLAN international conference on Functional Programming (ICFP)

| year=2011

| pages=189–195

| doi=10.1145/2034773.2034801}}

Implementation of such algorithms have shown to have cubic time complexity,{{cite conference

| author1=Michael D. Adams

| author2=Celeste Hollenbeck

| author3=Matthew Might

| title=Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

| chapter=On the complexity and performance of parsing with derivatives

| conference=Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

| year=2016

| pages=224–236

| doi=10.1145/2908080.2908128| isbn=9781450342612

| doi-access=free

| arxiv=1604.04695

}}

corresponding to the complexity of the Earley parser on general context-free grammars.

See also

References