::Stuff for the multi-spec coder;

Coding, formats, standards, and other practical things.

 Home  //  Guides & Articles  //  Regular Expressions 

<!-- Guides & Articles

1  2   3   Next Page 

An Introduction to Regular Expressions


An Effective Programming Concept


Having its root in the theoretical Computer Science, regular expressions offer an effective and efficient tool to search down a set of strings matching a given criteria. Be you a programmer or an office assistant or an end user of a Personal Computer for other non-employment reasons, fact remains that you invariably are faced with a situation that demands from you to search down a certain text or number from a sea of textual information. It can be a populated directory or a lengthy word document or an extensive programming language code et al, at some point of time or the other, you are faced with a very tedious task of finding out a certain string, or number or alphanumeric character, through a large accumulation of textual or binary files. What do you do when faced with such a situation? You type that particular string and find it out with find command. Of course if you have any knowledge of wild cards like question mark (?) and asterisk (*), you use them to get closer and exhaustive results, don't you? Well, it is basically the theme of 'regular expressions'! For example, wild card expression *.com is equivalent to .*\.com as a regular expression.

Often abbreviated or trimmed up to be called regexp or regex in singular form and regexps, regexes or regexen in plural form, 'Regular Expressions' are an endowment of the automata theory and formal language theory of theoretical Computer Science. A regular expression may be defined as a string that puts forth criteria for matching a set of strings, subject to certain syntax rules. Utilized extensively by many text editors and utilities to search and manipulate bodies of text based on certain patterns, regular expressions have gradually become indispensable and integral to manipulation of strings in most of the programming languages. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities, inclusive of the editor sed and the filter grep, provided by Unix distributions, were the first to bring in fashion the notion of regular expressions.

Also known as a pattern or a short and sweet expression (description) for a set of strings, a regular expression is usually used to render an otherwise tedious and hectic task into an easy job. A regular expression effectively replaces the task of listing all elements one by one in the search criteria with a crisp and concise description of the set. For example, suppose you wish to fetch results containing the three strings - able, table and tablet. You can fetch these three strings by tiresomely typing each string one by one, or you can use a regular expression or a pattern describing the set - t?ablet?. The pattern t?ablet? simply means that the targeted string constitutes of zero or one 't' preceding able, followed by zero or one 't'. In other words, it is said that the pattern matches each of the three strings.

To quote another example, say a set containing five strings Pitcher, Pitter, Patter, Platter and Putter needs to be searched down from a huge pool of text. The pattern that will describe the set will be P(i|l?a|u)t(t?|ch)er. The pattern means a string containing starting letter 'P', followed by either i or l?a or 'u', a letter 't', either t? or ch, and at last succeeded by er, wherein '?' stands for zero or one occurrence of the previous character. Further, the regex can also be replaced by either P(i|l?a|u)t(t?|ch)er or P(i|la|a|u)t(t?|ch)er as valid patterns, which will both match the same five strings.

The pattern ((great )*grand )?(father|mother) matches any ancestor: father, mother, grand father, grand mother, great grand father, great grand mother, great great grand father, great great grand mother, great great great grand father, great great great grand mother and so on.

The regular expression \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b is a pattern that does the magic trick of searching down any email address in an entire document for a specialized text processing tool, like PowerGREP ( http://www.powergrep.com/ ).
Minor alterations in the syntax, like replacing first \b with ^ and the last \b with $, realizes a further magical output in terms of an effective tool to check whether a user has entered a properly formatted email address.

All the examples stated above essentially follow a strict syntax code that varies among tools and application areas. In most formalism, if there is any regex that matches a particular set then there are an infinite number of such expressions. In most formalism, it has been observed that certain operations are identically used to construct regular expressions. The following operations may be taken as typical examples.

Alternation

Most of the formalisms use a vertical bar (or |) to separate alternatives. For instance, tusk|task is a regex to match tusk and task, which is more commonly used in a shortened format t(u|a)sk.

Grouping

Parentheses are another popularly used operation that explicitly defines the scope and precedence of the operators. For example, tusk|task and t(u|a)sk are different patterns, but they both describe the set containing tusk and task.

Quantification

Most of the time, regex is required for strings of varying length. Thus, arises the need of a quantifier.
A quantifier succeeding a character or group indicates how many times the preceding letter or group is anticipated to occur in the searched alphanumeric text. The most prevalent quantifiers are ?, *, and +, which coincidently are also famous wild card characters, especially ? and *. To elaborate their meanings per se:
Question mark (or '?') indicates that there can be zero (0) or one (1) occurrence of the previous alphanumeric character in the string. For example, favou?r matches both favor and favour.
Asterisk (or '*') indicates that there can be zero (0), one (1) or many (n number) occurences of the previous alphanumeric character in the searched string; for example, ya*hoo matches "yhoo", "yahoo", "yaahoo", "yaaahoo", etc.
Plus sign (or '+') indicates that there is at least one (1) occurrence of the previous alphanumeric character in the target string. For example, ya+hoo matches "yahoo", "yaahoo", "yaaahoo", etc, but will not be equal to "yhoo".

Further it is noteworthy that, these constructions can be combined to form arbitrarily complex expressions, very much like one can construct arithmetical expressions from the numbers and the operators +, -, * and /.

Metacharacters in regular Expressions


In essence, a 'metacharacter' means a character with a special meaning; metacharacters are the essence of Regular Expressions.

Per se, any single letter or character naturally matches itself. A group of characters matches an identical group of character in the matching sequence without a problem. So "sequel" matches "sequel" without a hitch! However, there are instances when a single character or a special sequence of a series of characters does not match itself, but targets to a host of logical results in terms of alphanumeric text, in a sea of textual information. These instances hint at nothing else but the use of the phenomena called 'metacharacters'.

Metacharacters can be made to function like normal characters by escaping their default function or value, and thus make the system to take them literally. This can be achieved easily by using a useful character called a backslash (or '\') by preceding the escape sequences or metacharacters with it.
To explain through instances, ^ is a metacharacter that matches the beginning of string. If you precede ^ with a backslash (\), the string \^ will match the letter ^. Likewise, \\ will match the character '\'; to put it in a more functional example, suppose you are searching for a string say ^tentative. Now the regular expression to overpower the metacharacter to subdue to its literal value will be \^tentative and so on.

There are different types of metacharacters that eventually make possible Regular Expressions. Let us explore various Metacharacters, one by one.

Metacharacters - line separators


To begin with, there are some metacharacters that may also be termed or grouped as 'line separators' based on their utility. For example, the character '^' is a metacharacter that matches the beginning of the input string or text. The dollar sign ('$') is another metacharacter that matches only the end of the input string or text. Line separators embedded within the input text or string will not be identified by either of the metacharacters.
The following table enumerates some line separators:

^ Start of line
$ End of line
\A Start of text
\Z End of text
. Any character in line

Examples showcasing the usage of the line separators can be examined in the table given below:

^tentative Matches string 'tentative' only if it's at the beginning of line
tentative$ Matches string 'tentative' only if it's at the end of line
^tentative$ Matches string 'tentative' only if it's the only string in line
tentat.ve Matches strings like 'tentative', 'tentatave', 'tentatuve', 'tentat1ve' and so on

1  2   3   Next Page 


Return to the Guides & Articles section, or go the to Main page.





Looking for the old guiStuff?

It's still here, the old content didn't go anywhere.