Help:Searching/Regex#Regular expressions

{{about | Cirrus regular expressions usable in Advanced search

|the main Wikimedia page about Cirrus regex search | mw:Help:CirrusSearch#Regular expression searches

|regular expressions with AutoWikiBrowser | Wikipedia:AutoWikiBrowser/Regular expression}}

To perform a regex search, use the ordinary search box with the syntax Help:Searching#insource: or Help:Searching#intitle:.

= Use regexes responsibly =

Because regex searching scans each page character by character, it is generally much slower than an index search. You can—and should—add additional search terms when using insource:/{{var|regex}}/ to reduce the amount of text being processed. For example:

  • polish insource:/polish/ finds pages that match a case-insensitive stemmed keyword search for "polish" (including "polished" or "polishing"); then does a case-sensitive regex search within those pages. Only pages that match both filters are returned.
  • insource:polish insource:/polish/ is similar, but starts with a case-insensitive search of the source markup instead of the rendered page (so it will find usages like Poles, and not find transclusions).
  • intitle:, incategory:, and linksto: are excellent filters.{{huh|date=August 2023}}
  • hastemplate: is a good filter.{{huh|date=August 2023}}

Adding an index-based search term to reduce the amount of text being scanned is important simply to make your own regex search finish in a reasonable amount of time. Regex searches that take too long will "time out" and return only partial results. Overuse of slow regex searches might cause temporary throttling of the feature for yourself and/or everyone on Wikipedia. (However, you cannot affect the site performance of Wikipedia as a whole simply by abusing regex search.) Remember that a single regex search can take multiple seconds, and there are currently {{NUMBEROFUSERS}} registered users on Wikipedia. Use regex search responsibly.

= Metacharacters =

MediaWiki's regular expression syntax works like this:

  • Most characters represent themselves. For example, insource:/C-3p0/ will search for pages containing the literal string "C-3p0" (case-sensitive).
  • The following metacharacters are treated specially: . + * ? | { [ ] ( ) " \ # @ < ~. Any metacharacter can be escaped by preceding it with a backslash \. Preceding any other character with a backslash is harmless. For example, insource:/yes\.\no/ will search for pages containing the literal string "yes.no" (case-sensitive). Regex experts should note that \n does not mean "newline," \d does not mean "digit," and so on: In MediaWiki syntax, the only use of \ is to escape metacharacters.
  • / is special because it indicates the end of the regex. For example, insource:/yes/no/ is treated the same as insource:/yes/ no (because the keyword search for no/ ignores punctuation). The / character must be backslash-escaped everywhere it appears inside a regex – even inside square brackets or quotation marks.
  • . matches any single character. For example, insource:/yes.no/ is matched by yes/no, yes no, yesuno, etc.
  • ( ) group a sequence of characters into an atomic unit.
  • | goes between two sequences and matches either of them. For example, insource:/a(g|ch)e/ matches either age or ache.
  • + matches the preceding character or group one or more times. For example, insource:/ab+(cd)+/ is matched by abcd, abbbcd, abbcdcd, etc. insource:/a(g|ch)+e/ matches agge, achgchchggche, etc.
  • * matches the preceding character or group any number of times (including zero). For example, insource:/ab*(cd)*/ is matched by a, abbb, acdcd, etc.
  • ? matches the preceding character or group exactly zero or one times.
  • { } match the preceding character or group a fixed number of times. For example, insource:/[a-z]{2}/ matches exactly 2 lowercase letters in a row. insource:/[a-z]{2,4}/ matches any string of 2, 3, or 4 lowercase letters. insource:/[a-z]{2,}/ matches any string of 2 or more lowercase letters.
  • [ ] introduce a character class, which matches a single instance of any of the characters in the class. For example, insource:/[Pp]olish/ matches both Polish and polish. Characters inside square brackets generally don't have to be escaped, although escaping them remains harmless, and / still needs to be escaped everywhere. For example, insource:/[.\/\]\n]/ matches a single instance of ., /, ], or n.
  • Inside a character class, the character ^ (if it appears first of all) represents negation, and the character - (unless it appears first or last) represents a range. For example, insource:/[A-Za-z0-9_]/ matches any alphanumeric character or underscore, and insource:/[^A-Za-z]/ matches any non-alphabetic character.
  • < > stand for numbers treated as numbers, not characters. For example, insource:/AD <476-1453>/ is matched by AD 476, AD 477, ... AD 1452, AD 1453, but not AD 1474. (But it will also match the first six characters of AD 4760.)
  • ~ "looks ahead" and negates the next character or group. For example, insource:/crab~(cake)c/ should match the first five characters of crabclaw but not the first five characters of crabcake.{{huh|date=August 2023}}

There are a few additional quirks of the syntax:

  • The metacharacter @ is a synonym for .* (match any sequence of characters at all).
  • A search insource:/0/ fails, although insource:/1/ and insource:/\0/ both succeed.
  • " " are an escape mechanism, like square brackets or the backslash. For example, insource:/".*"/ means the same thing as insource:/\.\*/.
  • The character # is also a metacharacter and must be escaped.{{huh|date=August 2023}}
  • Regex experts should note that \n does not mean "newline," \d does not mean "digit," and so on.
  • Regex experts should note that ^ does not mean "beginning of text" and $ does not mean "end of text." Searching from the beginning or end of a Wikipedia page is not generally useful.

==Workarounds for some character classes==

Although character classes \n, \s, \S are not supported, you may use these workarounds:

class="wikitable"

|+

!scope="col"| PCRE

!scope="col"| MediaWiki

!scope="col"| Description

\n[^ -􏿽]A newline (also a tabulation character can be found{{ref|a}})
[^\n][ -􏿽]Any character except a newline and tabulation
\s[^!-􏿽]A whitespace character: space, newline, or tabulation
\S[!-􏿽]Any character except whitespace

{{note|a||To exclude the tabulation character as well, [https://codepoints.net/U+0009 copy it] and add it to the character set.}}

In these ranges, " " (space) is the character immediately following the control characters, "!" is the character immediately following space, and "􏿽" is U+10FFFF, the last character in Unicode. Thus, the range from " " to "􏿽" includes all characters except for control characters (of which articles may contain newlines and tabulation), while the range from "!" to "􏿽" includes all characters except for control characters and space.

Notes