Brzozowski derivative

{{Short description|Function defined on formal languages in computer science}}

In theoretical computer science, in particular in formal language theory, the Brzozowski derivative $u^{-1}S$ of a set $S$ of strings and a string $u$ is the set of all strings obtainable from a string in $S$ by cutting off the prefix $u$ . Formally:

: $u^{-1}S = \{v \in \Sigma^* \mid uv \in S\}$ .

For example,

: $\text{c}^{-1}\{\text{cat}, \text{cow}, \text{dog}\} = \{\text{at}, \text{ow}\}.$

The Brzozowski derivative was introduced under various different names since the late 1950s.{{cite journal | url=https://dl.acm.org/doi/10.1145/320924.320930 | author=George N. Raney | title=Sequential functions | journal=Journal of the ACM | volume=5 | number=2 | pages=177–180 | date=Apr 1958 | doi=10.1145/320924.320930 | s2cid=1611992 }}{{cite journal | url=http://www.cse.chalmers.se/~coquand/AUTOMATA/rs.pdf | author=Dana Scott and Michael O. Rabin | title=Finite Automata and Their Decision Problems | journal=IBM Journal of Research and Development | volume=3 | number=2 | pages=114–125 | date=Apr 1959 | doi=10.1147/rd.32.0114 }}{{cite book | url=https://ieeexplore.ieee.org/document/5397288 | doi=10.1109/FOCS.1961.26 | author=C.C. Elgot and J.D. Rutledge | contribution=Operations on finite automata | editor=Robert S. Ledley | title=Proc. AIEE 2nd Ann. Symp. on Switching, Circuit Theory, and Logical Design (SWCT), Detroit | publisher= | series= | volume= | pages=129–132 | date=Oct 1961 }}

Today it is named after the computer scientist Janusz Brzozowski who investigated its properties and gave an algorithm to compute the derivative of a generalized regular expression.{{cite journal| author=Janusz A. Brzozowski| title=Derivatives of Regular Expressions| journal=J ACM| year=1964| volume=11| issue=4| pages=481–494| doi=10.1145/321239.321249| s2cid=14126942| doi-access=free}}

Definition

Even though originally studied for regular expressions, the definition applies to arbitrary formal languages.

Given any formal language $S$ over an alphabet $\Sigma$ and any string $u \in \Sigma^*$ , the derivative of $S$ with respect to $u$ is defined as:{{cite journal| author=Janusz A. Brzozowski| title=Derivatives of Regular Expressions| journal=J ACM| year=1964| volume=11| issue=4| pages=481–494| doi=10.1145/321239.321249| s2cid=14126942| doi-access=free}}

: $u^{-1}S = \{v \in \Sigma^* \mid uv \in S\}$

The Brzozowski derivative is a special case of left quotient by a singleton set containing only $u$ : $\ u^{-1}S = \{u\} \;\backslash\; S$ .

Equivalently, for all $u,v \in \Sigma^*$ :

: $v \in u^{-1}S \;\Leftrightarrow\; uv \in S.$

From the definition, for all $u, v \in \Sigma^*$ :

: $(uv)^{-1}S = v^{-1}(u^{-1}S)$

since for all $w \in \Sigma^*$ , we have {{nowrap| $w \in (uv)^{-1}S \Leftrightarrow uvw \in S \Leftrightarrow vw \in u^{-1}S \Leftrightarrow w \in v^{-1}(u^{-1}S)$ .}}

The derivative with respect to an arbitrary string reduces to successive derivatives over the symbols of that string, since for all $a \in \Sigma, u \in \Sigma^*$ :

$\begin{align}
(ua)^{-1}S
&= a^{-1}(u^{-1}S) \\
\varepsilon^{-1}S
&= S
\end{align}$

A language $S \subseteq \Sigma^*$ is called nullable if and only if it contains the empty string $\varepsilon$ . Each language $S$ is uniquely determined by nullability of its derivatives:

: $w \in S \ \Leftrightarrow\ \varepsilon \in w^{-1}S$

A language can be viewed as a (potentially infinite) boolean-labelled tree (see also tree (set theory) and infinite-tree automaton). Each possible string $w \in \Sigma^*$ denotes a node in the tree, with label true when $w \in S$ and false otherwise. In this interpretation, the derivative with respect to a symbol $a$ corresponds to the subtree obtained by following the edge $a$ from the root. Decomposing a tree into the root and the subtrees $a^{-1}S$ corresponds to the following equality, which holds for every language $S \subseteq \Sigma^*$ :

: $S = (\{\varepsilon\} \cap S) \cup \bigcup_{a \in \Sigma} a(a^{-1}S).$

Derivatives of generalized regular expressions

When a language is given by a regular expression, the concept of derivatives leads to an algorithm for deciding whether a given word belongs to the regular expression.

Given a finite alphabet A of symbols,Brzozowski (1964), p.481, required A to consist of the 2ⁿ combinations of n bits, for some n. a generalized regular expression R denotes a possibly infinite set of finite-length strings over the alphabet A, called the language of R, denoted L(R).

A generalized regular expression can be one of the following (where a is a symbol of the alphabet A, and R and S are generalized regular expressions):

"∅" denotes the empty set: L(∅) = {},
"ε" denotes the singleton set containing the empty string: L(ε) = {ε},
"a" denotes the singleton set containing the single-symbol string a: L(a) = {a},
"R∨S" denotes the union of R and S: L(R∨S) = L(R) ∪ L(S),
"R∧S" denotes the intersection of R and S: L(R∧S) = L(R) ∩ L(S),
"¬R" denotes the complement of R (with respect to A*, the set of all strings over A): L(¬R) = A* \ L(R),
"RS" denotes the concatenation of R and S: L(RS) = L(R) · L(S),
"R*" denotes the Kleene closure of R: L(R*) = L(R)*.

In an ordinary regular expression, neither ∧ nor ¬ is allowed.

=Computation=

For any given generalized regular expression R and any string u, the derivative u⁻¹R is again a generalized regular expression (denoting the language u⁻¹L(R)).Brzozowski (1964), p.483, Theorem 4.1

It may be computed recursively as follows.Brzozowski (1964), p.483, Theorem 3.2

style="text-align:right;" \| (ua)⁻¹R	= a⁻¹(u⁻¹R)	for a symbol a and a string u
style="text-align:right;" \| ε⁻¹R	= R

Using the previous two rules, the derivative with respect to an arbitrary string is explained by the derivative with respect to a single-symbol string a.

The latter can be computed as follows:Brzozowski (1964), p.483, Theorem 3.1

style="font-family: 'Nimbus Roman No9 L', 'Times New Roman', Times, serif; line-height: 1; white-space: nowrap; font-variant-numeric: lining-nums tabular-nums; font-kerning:none"
style="text-align:right;" \| a⁻¹a	= ε
style="text-align:right;" \| a⁻¹b	= ∅	for each symbol b≠a
style="text-align:right;" \| a⁻¹ε	= ∅
style="text-align:right;" \| a⁻¹∅	= ∅
style="text-align:right;" \| a⁻¹(R*)	= (a⁻¹R)R*
style="text-align:right;" \| a⁻¹(RS)	= (a⁻¹R)S ∨ ν(R)a⁻¹S
style="text-align:right;" \| a⁻¹(R∧S)	= (a⁻¹R) ∧ (a⁻¹S)
style="text-align:right;" \| a⁻¹(R∨S)	= (a⁻¹R) ∨ (a⁻¹S)
style="text-align:right;" \| a⁻¹(¬R)	= ¬(a⁻¹R)

Here, {{math|ν(R)}} is an auxiliary function yielding a generalized regular expression that evaluates to the empty string ε if R{{'s}} language contains ε, and otherwise evaluates to ∅. This function can be computed by the following rules:Brzozowski (1964), p.482, Definition 3.2

style="font-family: 'Nimbus Roman No9 L', 'Times New Roman', Times, serif; line-height: 1; white-space: nowrap; font-variant-numeric: lining-nums tabular-nums; font-kerning:none"
style="text-align:right;" \| ν(a)	= ∅	for any symbol a
style="text-align:right;" \| ν(ε)	= ε
style="text-align:right;" \| ν(∅)	= ∅
style="text-align:right;" \| ν(R*)	= ε
style="text-align:right;" \| ν(RS)	= ν(R) ∧ ν(S)
style="text-align:right;" \| ν(R ∧ S)	= ν(R) ∧ ν(S)
style="text-align:right;" \| ν(R ∨ S)	= ν(R) ∨ ν(S)
style="text-align:right;" \| ν(¬R)	= ε	if ν(R) = ∅
style="text-align:right;" \| ν(¬R)	= ∅	if ν(R) = ε

=Properties=

A string u is a member of the string set denoted by a generalized regular expression R if and only if ε is a member of the string set denoted by the derivative u⁻¹R.Brzozowski (1964), p.483, Theorem 4.2

Considering all the derivatives of a fixed generalized regular expression R results in only finitely many different languages. If their number is denoted by d_R, all these languages can be obtained as derivatives of R with respect to strings of length less than d_R.Brzozowski (1964), p.484, Theorem 4.3 Furthermore, there is a complete deterministic finite automaton with d_R states that recognises the regular language given by R, as stated by the Myhill–Nerode theorem.

Derivatives of context-free languages

Derivatives are also effectively computable for recursively defined equations with regular expression operators, which are equivalent to context-free grammars. This insight was used to derive parsing algorithms for context-free languages.{{cite conference

| author1=Matthew Might

| author2=David Darais

| author3=Daniel Spiewak

| title=Parsing with derivatives: a functional pearl

| conference=Proceeding of the 16th ACM SIGPLAN international conference on Functional Programming (ICFP)

| year=2011

| pages=189–195

| doi=10.1145/2034773.2034801}}

Implementation of such algorithms have shown to have cubic time complexity,{{cite conference

| author1=Michael D. Adams

| author2=Celeste Hollenbeck

| author3=Matthew Might

| title=Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

| chapter=On the complexity and performance of parsing with derivatives

| conference=Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)

| year=2016

| pages=224–236

| doi=10.1145/2908080.2908128| isbn=9781450342612

| doi-access=free

| arxiv=1604.04695

}}

corresponding to the complexity of the Earley parser on general context-free grammars.

References

Category:Formal languages