matching wildcards

{{short description|Algorithm to compare text strings using wildcard syntax}}

In computer science, an algorithm for matching wildcards (also known as globbing) is useful in comparing text strings that may contain wildcard syntax.{{cite web| title=Wildcard characters| publisher=ScienceDirect| year=2018| url=http://help.sciencedirect.com/Content/st_wildcards.htm}} Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell{{cite book| last= Quigley| first=Ellie| title=UNIX Shell Programming QuickStart| publisher=InformIT.com| year=2005| url=http://www.informit.com/articles/article.aspx?p=350778&seqNum=4}} or Microsoft Windows command-line{{cite web| title=MS-DOS and Windows Wildcard Characters| date=31 May 2018| publisher=Microsoft Developer Network Library| url=https://msdn.microsoft.com/en-us/library/ms690414(v=vs.85).aspx}} or text editor or file manager, as well as the interfaces for some search engines{{cite web| title=Apache Lucene - Query Parser Syntax| publisher=Apache Lucene 2.9.4 Documentation| year=2006| url=https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Wildcard%20Searches}} and databases.{{cite web| title=SQL Wildcards| publisher=W3Schools| year=2018| url=https://www.w3schools.com/sql/sql_wildcards.asp}} Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.{{cite web| last=Goyvaerts| first=Jan| title=Welcome to Regular-Expressions.info| publisher=RegularExpressions.info| year=2018| url=https://www.regular-expressions.info/}}

The problem

A wildcard matcher tests a wildcard pattern p against an input string s. It performs an anchored match, returns true only when p matches the entirety of s.

The pattern can be based on any common syntax (see globbing), but on Windows programmers tend to only discuss a simplified syntax supported by the native C runtime:{{cite web |title=Wildcard Expansion |url=https://docs.microsoft.com/en-us/cpp/cpp/wildcard-expansion |website=docs.microsoft.com |date=8 February 2022 |language=en-us}}

  • No escape characters are defined
  • Wildcards: {{code|?}} matches exactly one occurrence of any character. {{code|*}} matches arbitrary many (including zero) occurrences of any character.

This article mainly discusses the Windows formulation of the problem, unless otherwise stated.

= Definition =

Stated in zero-based indices, the wildcard-matching problem can be defined recursively as:

:

\begin{aligned}

m_{00} &= (p_{0} = t_{0}) \\

m_{0j} &= (p_{j-1} = \text{‘*’}) \land m_{0,j-1}\\

m_{i0} & = \text{false} \\

m_{ij} &=

\begin{cases}

m_{i-1, j-1} & \text{for}\; p_{i-1} = t_{j-1} \lor p_{i-1} = \text{‘?’}\\

m_{i, j-1} \lor m_{i-1, j} & \text{for}\; p_{i-1} = \text{‘*’}\\

\text{false} & \text{for}\; p_{i-1} \neq t_{j-1}

\end{cases} & & \quad \text{for}\; 1 \leq i \le |p|, 1 \leq j \le |t|.

\end{aligned}

where mij is the result of matching the pattern p against the text t truncated at i and j characters respectively. This is the formulation used by Richter's algorithm and the Snippets algorithm found in Cantatore's collection. This description is similar to the Levenshtein distance.

= Related problems =

Directly related problems in computer science include:

  • Pattern matching with don't cares or gaps, an unanchored string search with only the equivalent of {{code|?}} defined.{{cite journal |last1=Iliopoulos |first1=Costas S. |last2=Rahman |first2=M. Sohel |title=Pattern Matching Algorithms with Don't Cares |journal=SOFSEM 2007: Theory and Practice of Computer Science, 33rd Conference on Current Trends in Theory and Practice of Computer Science |date=2007 |url=https://pdfs.semanticscholar.org/fd93/e240369270e3f2958de515dcb42a53647de2.pdf |archive-url=https://web.archive.org/web/20191217184544/https://pdfs.semanticscholar.org/fd93/e240369270e3f2958de515dcb42a53647de2.pdf |url-status=dead |archive-date=2019-12-17 |location=Harrachov, Czech Republic|s2cid=14538871 }}{{cite journal |last1=Clifford |first1=Peter |last2=Clifford |first2=Raphaël |title=Simple deterministic wildcard matching |journal=Information Processing Letters |date=January 2007 |volume=101 |issue=2 |pages=53–54 |doi=10.1016/j.ipl.2006.08.002}}
  • Pattern matching with wildcards, an unanchored string search with the equivalent of both wildcards defined. Has an exponential runtime unless a length-bound is given in the pattern matching with flexible wildcards variant.{{cite journal |last1=Wu |first1=Xindong |last2=Qiang |first2=Ji-Peng |last3=Xie |first3=Fei |title=Pattern Matching with Flexible Wildcards |journal=Journal of Computer Science and Technology |date=12 September 2014 |volume=29 |issue=5 |pages=740–750 |doi=10.1007/s11390-014-1464-3|s2cid=16824910 }}

History

Early algorithms for matching wildcards often relied on recursion, but the technique was criticized on grounds of performance and reliability{{cite magazine| last=Krauss| first=Kirk| title=Matching Wildcards: An Algorithm| magazine=Dr. Dobb's Journal| year=2008| url=http://www.drdobbs.com/architecture-and-design/matching-wildcards-an-algorithm/210200888}} considerations. Non-recursive algorithms for matching wildcards have gained favor in light of these considerations.

Among both recursive and non-recursive algorithms, strategies for performing the pattern matching operation vary widely, as evidenced among the variety of example algorithms referenced below. Test case development and performance optimization techniques have been demonstrably brought to bear on certain algorithms, particularly those developed by critics of the recursive algorithms.

Recursive algorithms

The recursion generally happens on matching * when there is more suffix to match against. This is a form of backtracking, also done by some regular expression matchers.

  • Rich Salz' wildmat algorithm (sh-like syntax){{cite web| last=Salz| first=Rich| title=wildmat.c| publisher=GitHub| year=1991| url=https://github.com/trevor/transmission/blob/master/libtransmission/wildmat.c}}
  • Filip's algorithm{{cite web| author=Filip| title=Compare strings with wildcard| publisher=Stack Overflow| year=2014| url=https://stackoverflow.com/questions/23457305/compare-strings-with-wildcard}} and Vignesh Murugesan's algorithm{{cite web| last=Murugesan| first=Vignesh| year=2014| title=WildCard Matching algorithm| url=https://vigneshmurugesan.wordpress.com/2014/03/13/wildcard-matching-algorithm/}}
  • Martin Richter's algorithm{{cite web| author=Deadlock| title=Wildcard Matching Recursive Algorithm C++| publisher=Stack Overflow| year=2015| url=https://stackoverflow.com/questions/31962321/wildcard-matching-recursive-algorithm-c}} (identical to Snippets and related to the 7-zip algorithm)
  • C library fnmatch implementations (supports [...] and multibyte character sets):
  • Guido van Rossum's BSD libc fnmatch,{{cite web |last1=van Rossum |first1=Guido |title=freebsd/lib/libc/gen/fnmatch.c |website=GitHub |url=https://github.com/lattera/freebsd/blob/master/lib/libc/gen/fnmatch.c |accessdate=21 November 2019 |date=20 November 2019}} also part of Apple libc{{cite web| title=fnmatch.c| publisher=opensource.apple.com| year=1999| url=https://opensource.apple.com/source/Libc/Libc-167/gen.subproj/fnmatch.c.auto.html}}
  • Glibc fnmatch{{cite web |title=fnmatch_internal.c |url=https://github.com/bminor/glibc/blob/master/posix/fnmatch_loop.c |publisher=Beren Minor's Mirrors |date=21 November 2019}}

The general form of these algorithms are the same. On recursion the algorithm slices the input into substrings, and considers a match to have happened when ONE of the substrings return a positive match. For {{code|dowild("*X", "abcX")}}, it would greedily call {{code|dowild("X", "abcX")}}, {{code|dowild("X", "bcX")}}, {{code|dowild("X", "cX")}} and {{code|dowild("X", "X")}}. They usually differ by less-important things like support for features and by more important factors such as minor but highly effective optimizations. Some of them include:

  • The ABORT signal against over-recursion (Lars Mathiesen 1991). While it is correct to naively recurse by the entire rest of the strings (pattern and text) on * and making sure that ONE of the substrings return a positive match, the running time becomes exponential for rejecting a match with many * in the text. Lars Mathiesen changes the return to three classes, match, no-match, and ABORT (no match possible at all for asterisk recursion.) The ABORT value is returned when the text is consumed too early or when another asterisk match has failed, guaranteeing a linear performance with respect to the number of asterisks. (The overall complexity is additionally quadratic to the number of characters left to match.) Git/Rsync's wildmatch ABORT also covers invalid inputs.{{cite web |title=git/git: wildmatch.c |url=https://github.com/git/git/blob/master/wildmatch.c |website=GitHub|date=2020-01-20 }} The new INN uwildmat does the same.
  • Asterisk advancement in recursion. This wildmatch tweak is relatively more minor. It applies to when the recursion wants to match "*X" on "abcX": when an asterisk is followed by a literal like "X", it is obvious that only the last comparison with equal lengths would have a chance of producing a match. This is seen earlier in uwildmat in 2000{{cite web |title=uwildmat.c in trunk/lib – INN |url=https://inn.eyrie.org/trac/browser/trunk/lib/uwildmat.c |website=inn.eyrie.org |accessdate=27 November 2019}} and more implicitly in van Rossum's fnmatch for {{code|FNM_PATHNAME}}.

Martin Richter's algorithm is an exception to this pattern, although the overall operation is equivalent. On * it recurses into increasing either of the indexes, following the dynamic programming formulation of the problem. The "ABORT" technique is applicable to it as well. On typical patterns (as tested by Cantatore) it is slower than the greedy-call implementations.

The recursive algorithms are in general easier to reason about, and with the ABORT modification they perform acceptably in terms of worst-case complexity. On strings without * they take linear-to-string-size time to match since there is a fixed one-to-one relation.

Non-recursive algorithms

The following are developed by critics of the recursive algorithms:

  • Kirk J. Krauss's wildcard-matching algorithm, used by IBM{{cite web| last=Krauss| first=Kirk| title=Matching Wildcards: An Improved Algorithm for Big Data| publisher=Develop for Performance| year=2018| url=http://www.developforperformance.com/MatchingWildcards_AnImprovedAlgorithmForBigData.html}}
  • Alessandro Cantatore's collection of wildcard matching algorithms{{cite web| last=Cantatore| first=Alessandro| title=Wildcard matching algorithms| year=2003| url=http://xoomer.virgilio.it/acantato/dev/wildcard/wildmatch.html}}
  • Dogan Kurt's iterative matcher and slower NFA matcher.{{cite web |last1=Kurt |first1=Dogan |title=Wildcard Matching Methods |url=http://dodobyte.com/wildcard.html}}
  • Siler's incorrect algorithm (fails {{code|MATCH("da*da*da*", "daaadabadmanda")}}){{cite web| author=Siler| title=Recursive solutions for glob pattern matching| publisher=Stack Overflow| year=2013| url=https://stackoverflow.com/questions/30823596/recursive-solutions-for-glob-pattern-matching}}

The following is not:

  • Jack Handy's incorrect algorithm{{cite web| last=Handy| first=Jack| title=Wildcard string compare (globbing)| publisher=Code Project| year=2005| url=https://www.codeproject.com/Articles/1088/Wildcard-string-compare-globbing}} (fails {{code|MATCH("*?", "xx")}})

The iterative functions above implement backtracking by saving an old set of pattern/text pointers, and reverting to it should a match fails. According to Kurt, since only one successful match is required, only one such set needs to be saved.

In addition, the problem of wildcard matching can be converted into regular expression matching using a naive text-replacement approach. Although non-recursive regular expression matchers such as Thompson's construction are less used in practice due to lack of backreference support, wildcard matching in general does not come with a similarly rich set of features. (In fact, many of the algorithms above only has support for {{code|?}} and {{code|*}}.) The Russ Cox implementation of Thompson NFA can be trivially modified for such.{{cite web |last1=Cox |first1=Ross |title=Regular Expression Matching Can Be Simple And Fast |url=https://swtch.com/~rsc/regexp/regexp1.html}} Gustavo Navarro's {{abbr|BDM|Backward DAWG Match}}-based nrgrep algorithm provides a more streamlined implementation with emphasis on efficient suffixes.{{cite journal |last1=Navarro |first1=Gonzalo |title=NR-grep: a fast and flexible pattern-matching tool |journal=Software: Practice and Experience |date=10 November 2001 |volume=31 |issue=13 |pages=1265–1312 |doi=10.1002/spe.411 |s2cid=3175806 |url=https://users.dcc.uchile.cl/~gnavarro/ps/spe01.pdf}} See also {{section link|regular expression|Implementations}}.

See also

References