Regular expression

The educational technology and digital learning wiki
Jump to navigation Jump to search

Draft

Definition

Regular expressions (regexps) provide a formalism to identify patterns in some text and any sort of other code. E.g. programmers when creating computer code use to find/replace text in some code, computer program scripts can use regexps to translate code from one form into another (e.g. HTML to Wiki), JavaScript programs may use regexps to check if user data entered in HTML form is correct, etc.


In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

The following examples illustrate a few specifications that could be expressed in a regular expression:

  • the sequence of characters "car" in any context, such as "car", "cartoon", or "bicarbonate"
  • the word "car" when it appears as an isolated word
  • the word "car" when preceded by the word "blue" or "red"
  • a dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits

Regular expressions can be much more complex than these examples.

(Wikipedia, retrieved 16:52, 29 August 2008 (UTC)).

There exist several definitions / standards / implementations for regeps. They share a common core. The most popular ones are (see the Wikipedia article for details.

  • POSIX Basic Regular expressions (BRE)
  • POSIX Extended Regular expressions (ERE)
  • Perl-derivative regular expressions

Note: Regular expressions (although useful) are difficult to learn and usually only computer programmers use these. However, HTML and XML coders may consider learning some. E.g. if plan to use JavaScript form validation code it's a good thing to know some.

Examples

Removing HTML code

Identifies both img and a begin tags ([http://stackoverflow.com/questions/3790681/regular-expression-to-remove-html-tags StackOverflow)

<(img|a)[^>]*>

Removes span (begin tag)

<span[^>]*>

Zip code

The following defines a somewhat legal Swiss Zip code:

CH-[0-9]{4,}

The following one defines a valid email address (example from http://www.regular-expressions.info/email.html)

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

Replacing Wiki text

Administrators of this wiki can use RegExps to make mass changes to pages.

Example one

The example from the french version below shows how to replace

[[Flash CS4 - Composant bouton]]/

by

[[Flash CS5 - Composant bouton]] ([[Flash CS4 - Composant bouton|CS4]])

Note how we had to quote the [[ ]]

/\[\[Flash CS4 - Composant bouton\]\]/

Example two

Removing pageby + args tag

Search for:

/<pageby nominor="false" comments="false"\/>/

Search for (a bit dangerous):

/<pageby.*\/>/

Replace with:

<!--  -->

Mass editing HTML files with Perl

Let's assume that you would like to add some line after each <body .....> tag.

Let's find all files that have body tags:

find . -type f -print | xargs grep -l "<body\(.*\)>"

Replace (that's more hairy)

find . -type f -print | xargs perl -i~ -pe "s:<body(.*)>:<body \\1> <p>Something new</p>:g"

Explanations:

  • -pe means to execute a one liner Perl command (i.e. the expression within the " .....")
  • -i~ means to replace the orginal file, but create a backup copy with a "~" appended
  • The pattern s/search regexp/replacement/g defines a search/replace pattern
  • Since we got a / in the closing p tag, we will use ":" to separate the two
  • The final :g means to replace all occurences (not needed in our case actually).
  • The grouping () don't need to be escaped in Perl. I hate regexps :)

Software

Links

Overviews

Tutorials

Programming languages

In JavaScript

Cheatsheets

Online tools