Difference between revisions of "Regular expressions"

From Organic Design wiki
m (Engines)
(Undo revision 128472 by Nad (talk))
(Tag: Undo)
 
(24 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Category:dev]][[Category:links]]
+
In [[W:computing|computing]], a '''regular expression''' is a [[W:String (computer science)|string]] that is used to describe or match a [[W:set (computer science)|set]] of strings, according to certain [[W:syntax|syntax]] rules. A regular expression is simply a string that describes a pattern.
  
*[[w:Regular expression|Wikipedia: Regular expressions]]
+
There are two types, text-directed or regex-directed engine. You can do the test by applying the regex <span class="regex">regex|regex not</span> to the string <span class="string">regex not</span>. If the resulting match is only <span class="match">regex</span>, the engine is regex-directed. If the result is <span class="match">regex not</span>, then it is text-directed. The reason behind this is that the regex-directed engine is "eager". Perl and PHP are regex-directed engines, they will always match at the earliest possible point in the string. See [http://perl.plover.com/Regex/article.html How Regexes Work] for details.
*[http://www.regular-expressions.info/engine.html How regular expression engines work]
+
 
*[http://regexlib.com/DisplayPatterns.aspx Useful regular expression pattern library repository]
+
Regular expressions generally involve two stages in [[W:Parsing|parsing]], the first stage involves variable substitution, in the second stage the pattern and string are sent to the RE engine.
 +
 
 +
Regular expression engines are [[W:Lexical analysis|token engines]] where a token is an [[W:Assertion (computing)|assertion]] or an atom. An assertion tests some property of the string, zero width assertions such as ''\b'' are termed assertions whereas assertions with nonzero-width are atoms, such as any [[W:ASCII|ASCII]] or binary character. Internally the token engine matches from left to right a string to a regular expression pattern consuming the string whether the match is satisfied or not.
 +
 
 +
;When a regexp can match a string in several different ways, we can use the principles above to predict which way the regexp will match:
 +
 
 +
*Principle 0: Taken as a whole, any regexp will be matched at the earliest possible position in the string.
 +
:
 +
*Principle 1: In an alternation a|b|c... , the leftmost alternative that allows a match for the whole regexp will be the one used.
 +
:
 +
*Principle 2: The maximal matching quantifiers ?, * , + and {n,m} will in general match as much of the string as possible while still allowing the whole regexp to match.
 +
:
 +
*Principle 3: If there are two or more elements in a regexp, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole regexp to match. The next leftmost greedy quantifier, if any, will try to match as much of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.
 +
 
 +
== An example ==
 +
<source lang="perl">
 +
"Hello World" =~ m/World/;  # matches
 +
</source>
 +
What is this perl statement all about? "Hello World" is a simple double quoted string. World is the regular expression and the // enclosing m/World/ tells perl to search a string for a match. The operator =~ associates the string with the regexp match and produces a true value if the regexp matched, or false if the regexp did not match. In our case, World matches the second word in "Hello World" , so the expression is true.
 +
 
 +
Wikipedia provides a useful overview of [http://en.wikipedia.org/wiki/Regular_expression regular expressions] and their history.
  
==Tutorials==
+
== Tutorials ==
 +
*[https://regex101.com/ Regex101.com] ''- awesome site for regex learning and testing''
 
*[http://perldoc.perl.org/perlrequick.html Perl regular expressions quick start]
 
*[http://perldoc.perl.org/perlrequick.html Perl regular expressions quick start]
 
*[http://perldoc.perl.org/perlretut.html Perl regular expressions tutorial]
 
*[http://perldoc.perl.org/perlretut.html Perl regular expressions tutorial]
 +
*[http://www.regular-expressions.info/engine.html Regular expression info]
 +
*[http://www.johndcook.com/blog/2008/01/14/tips-for-learning-regular-expressions/ Tips for learning regular expressions]
  
==Reference books==
+
== Reference books ==
 
*[http://www.oreilly.com/catalog/regex/ Mastering regular expressions]
 
*[http://www.oreilly.com/catalog/regex/ Mastering regular expressions]
  
==Engines==
+
== Engines ==
 
*[http://www.pcre.org/ Perl compatible regular expressions (PCRE)]
 
*[http://www.pcre.org/ Perl compatible regular expressions (PCRE)]
 
*[http://laurikari.net/tre/ TRE Regular expression engine in C]
 
*[http://laurikari.net/tre/ TRE Regular expression engine in C]
 
*[http://www.regexlab.com/en/deelx/download.htm DEELX Regular expression engine in C++]/[http://www.codeproject.com/useritems/deelx.asp DEELX]
 
*[http://www.regexlab.com/en/deelx/download.htm DEELX Regular expression engine in C++]/[http://www.codeproject.com/useritems/deelx.asp DEELX]
 +
 +
== Perl ==
 +
*[[Wikipedia:Regular expression]]
 +
*[[w:Perl_6_rules|Wikipedia Perl 6 Rules]]
 +
*[http://perl.plover.com/Regex/article.html How Regexes Work]
 +
*[http://www.programmersheaven.com/2/Perl6-FAQ-Regex Perl 6 regex FAQ]
 +
*[http://perldoc.perl.org/functions/quotemeta.html Quotemeta]
 +
 +
== See also ==
 +
*[http://perldoc.perl.org/perlrun.html How to execute the Perl interpreter]
 +
*[http://regexlib.com/DisplayPatterns.aspx Useful regular expression pattern library repository]
 +
*[http://txt2re.com txt2re.com] ''- a useful regular expression generator tool''
 +
*[http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/ Regular expressions cheatsheet]
 +
*[http://montreal.pm.org/tech/neil_kandalgaonkar.shtml A regular expression to check if a number is prime]
 +
*[http://www.johndcook.com/blog/2010/10/20/good-old-regular-expressions/ Good old regular expressions] ''- two awesome regex examples from [http://www.amazon.com/gp/product/013937681X?ie=UTF8&tag=theende-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=013937681X The Unix Programming Environment] book''
 +
[[Category:Tutorials]][[Category:PERL]]

Latest revision as of 07:17, 17 February 2021

In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules. A regular expression is simply a string that describes a pattern.

There are two types, text-directed or regex-directed engine. You can do the test by applying the regex regex|regex not to the string regex not. If the resulting match is only regex, the engine is regex-directed. If the result is regex not, then it is text-directed. The reason behind this is that the regex-directed engine is "eager". Perl and PHP are regex-directed engines, they will always match at the earliest possible point in the string. See How Regexes Work for details.

Regular expressions generally involve two stages in parsing, the first stage involves variable substitution, in the second stage the pattern and string are sent to the RE engine.

Regular expression engines are token engines where a token is an assertion or an atom. An assertion tests some property of the string, zero width assertions such as \b are termed assertions whereas assertions with nonzero-width are atoms, such as any ASCII or binary character. Internally the token engine matches from left to right a string to a regular expression pattern consuming the string whether the match is satisfied or not.

When a regexp can match a string in several different ways, we can use the principles above to predict which way the regexp will match
  • Principle 0: Taken as a whole, any regexp will be matched at the earliest possible position in the string.
  • Principle 1: In an alternation a|b|c... , the leftmost alternative that allows a match for the whole regexp will be the one used.
  • Principle 2: The maximal matching quantifiers ?, * , + and {n,m} will in general match as much of the string as possible while still allowing the whole regexp to match.
  • Principle 3: If there are two or more elements in a regexp, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole regexp to match. The next leftmost greedy quantifier, if any, will try to match as much of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.

An example

"Hello World" =~ m/World/;  # matches

What is this perl statement all about? "Hello World" is a simple double quoted string. World is the regular expression and the // enclosing m/World/ tells perl to search a string for a match. The operator =~ associates the string with the regexp match and produces a true value if the regexp matched, or false if the regexp did not match. In our case, World matches the second word in "Hello World" , so the expression is true.

Wikipedia provides a useful overview of regular expressions and their history.

Tutorials

Reference books

Engines

Perl

See also