In this part of the PHP tutorial, we will talk about regular expressions in PHP.
Regular expressions are used for text searching and more advanced text manipulation. Regular expressions are built-in tools like grep, sed, text editors like vi, emacs, programming languages like Tcl, Perl, Python. PHP has a built-in support for regular expressions too.
In PHP, there are two modules for regular expressions. The POSIX Regex and the PCRE. The POSIX Regex is depreciated. In this chapter, we will use the PCRE examples. PCRE stands for Perl compatible regular expressions.
Two things are needed, when we work with regular expressions. Regex functions and the pattern.
A pattern is a regular expression, that defines the text, we are searching for or manipulating. It consists of text literals and metacharacters. The pattern is placed inside two delimiters. These are usually //, ##, or @@ characters. They inform the regex function where the pattern starts and ends.
Here is a partial list of metacharacters used in PCRE.
In the next example, we will look if a string will be located at the beginning of a sentence.
(?) The question mark indicates there is zero or one of the preceding element.
We can also use shorthand metacharacters for character classes. \w stands for alphanumeric characters, \d for digit, \s whitespace characters.
Regular expressions are used for text searching and more advanced text manipulation. Regular expressions are built-in tools like grep, sed, text editors like vi, emacs, programming languages like Tcl, Perl, Python. PHP has a built-in support for regular expressions too.
In PHP, there are two modules for regular expressions. The POSIX Regex and the PCRE. The POSIX Regex is depreciated. In this chapter, we will use the PCRE examples. PCRE stands for Perl compatible regular expressions.
Two things are needed, when we work with regular expressions. Regex functions and the pattern.
A pattern is a regular expression, that defines the text, we are searching for or manipulating. It consists of text literals and metacharacters. The pattern is placed inside two delimiters. These are usually //, ##, or @@ characters. They inform the regex function where the pattern starts and ends.
Here is a partial list of metacharacters used in PCRE.
. | Matches any single character. |
* | Matches the preceding element zero or more times. |
[ ] | Bracket expression. Matches a character within the brackets. |
[^ ] | Matches a single character, that is not contained within the brackets. |
^ | Matches the starting position within the string. |
$ | Matches the ending position within the string. |
| | Alternation operator. |
PRCE functions
We define some PCRE regex functions. They all have a preg prefix.preg_split()
- splits a string by regex patternpreg_match()
- performs a regex matchpreg_replace()
- search and replace string by regex patternpreg_grep()
- returns array entries that match the regex pattern
php > print_r(preg_split("@\s@", "Jane\tKate\nLucy Marion"));We have four names divided by spaces. The \s is a character class, which stands for spaces. The
Array
(
[0] => Jane
[1] => Kate
[2] => Lucy
[3] => Marion
)
preg_split()
function returns the split strings. php > echo preg_match("#[a-z]#", "s");The
1
preg_match()
function looks if the 's' character is in the character class [a-z]. The class stands for all characters from a to z. It returns 1 for success. php > echo preg_replace("/Jane/","Beky","I saw Jane. Jane was beautiful.");The
I saw Beky. Beky was beautiful.
preg_replace()
function replaces all occurrences of the word 'Jane' for the word 'Beky'. php > print_r(preg_grep("#Jane#", array("Jane", "jane", "Joan", "JANE")));The
Array
(
[0] => Jane
)
preg_grep()
function returns an array of words, that match the given pattern. In the script example, only one word is returned in the array. This is because by default, the search is case sensitive. php > print_r(preg_grep("#Jane#i", array("Jane", "jane", "Joan", "JANE")));In this example, we perform a case insensitive grep. We put the i modifier after the right delimiter. The returned array has three words.
Array
(
[0] => Jane
[1] => jane
[3] => JANE
)
The dot metacharacter
The . (dot) metacharacter stands for any single character in the text.<?phpIn the $words array, we have five words.
$words = array("Seven", "even", "Maven", "Amen", "Leven");
$pattern = "/.even/";
foreach ($words as $word) {
if (preg_match($pattern, $word)) {
echo "$word matches the pattern\n";
} else {
echo "$word does not match the pattern\n";
}
}
?>
$pattern = "/.even/";Here we define the search pattern. The pattern is a string. The regular expression is placed within delimiters. The delimiters are not optional. They must be present. In our case, we use forward slashes / / as delimiters. Note, we can use different delimiters, if we want. The dot character stands for any single character.
if (preg_match($pattern, $word)) {We test all five words, if they match with the pattern.
echo "$word matches the pattern\n";
} else {
echo "$word does not match the pattern\n";
}
$ php single.phpThe Seven and Leven words match our search pattern.
Seven matches the pattern
even does not match the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern
Anchors
Anchors match positions of characters inside a given text.In the next example, we will look if a string will be located at the beginning of a sentence.
<?phpWe have two sentences. The pattern is ^Jane. The pattern asks, is the 'Jane' string located at the beginning of the text?
$sentence1 = "Everywhere I look I see Jane";
$sentence2 = "Jane is the best thing that happened to me";
if (preg_match("/^Jane/", $sentence1)) {
echo "Jane is at the beginning of the \$sentence1\n";
} else {
echo "Jane is not at the beginning of the \$sentence1\n";
}
if (preg_match("/^Jane/", $sentence2)) {
echo "Jane is at the beginning of the \$sentence2\n";
} else {
echo "Jane is not at the beginning of the \$sentence2\n";
}
?>
$ php begin.php
Jane is not at the beginning of the $sentence1
Jane is at the beginning of the $sentence2
php > echo preg_match("#Jane$#", "I love Jane");The Jane$ pattern matches a string, in which the word Jane is at the end.
1
php > echo preg_match("#Jane$#", "Jane does not love me.");
0
Exact word match
In the following examples, we are going to show, how to look for exact word matches.php > echo preg_match("/mother/", "mother");The mother pattern fits the words mother, motherboard and motherland. Say, we want to look just for exact word matches. We will use the aforementioned anchor ^, $ characters.
1
php > echo preg_match("/mother/", "motherboard");
1
php > echo preg_match("/mother/", "motherland");
1
php > echo preg_match("/^mother$/", "motherland");Using the anchor characters, we get an exact word match for a pattern.
0
php > echo preg_match("/^mother$/", "Who is your mother?");
0
php > echo preg_match("/^mother$/", "mother");
1
Quantifiers
A quantifier after a token or group specifies how often that preceding element is allowed to occur.? - 0 or 1 matchThe above is a list of common quantifiers.
* - 0 or more
+ - 1 or more
{n} - exactly n
{n,} - n or more
{,n} - n or less (??)
{n,m} - range n to m
(?) The question mark indicates there is zero or one of the preceding element.
<?phpWe have four words in the $words array.
$words = array("jar", "jazz", "jay", "java", "jet");
$pattern = "/ja.?/";
foreach ($words as $word) {
if (preg_match($pattern, $word)) {
echo "$word matches the pattern\n";
} else {
echo "$word does not match pattern\n";
}
}
?>
$pattern = "/colo.?r/";This is the pattern. The .? combination means, zero or one arbitrary single character.
$ php zeroormore.phpThe * metacharacter matches the preceding element zero or more times.
Seven matches the pattern
even matches the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern
<?phpIn the above script, we have added the * metacharacter. The .* combination means, zero, one or more single characters.
$words = array("Seven", "even", "Maven", "Amen", "Leven");
$pattern = "/.*even/";
foreach ($words as $word) {
if (preg_match($pattern, $word)) {
echo "$word matches the pattern\n";
} else {
echo "$word does not match the pattern\n";
}
}
?>
$ php zeroormore.phpNow the pattern matches three words. Seven, even and Leven.
Seven matches the pattern
even matches the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern
php > print_r(preg_grep("#o{2}#", array("gool", "root", "foot", "dog")));The o{2} pattern matches strings, that have exactly two 'o' characters.
Array
(
[0] => gool
[1] => root
[2] => foot
)
php > print_r(preg_grep("#^\d{2,4}$#", array("1", "12", "123", "1234", "12345")));We have this ^\d{2,4}$ pattern. The \d is a character set. It stands for digits. So the pattern matches numbers, that have 2,3 or 4 digits.
Array
(
[1] => 12
[2] => 123
[3] => 1234
)
Alternation
The next example explains the alternation operator (|). This operator enables to create a regular expression with several choices.<?phpWe have 8 names in the $names array.
$names = array("Jane", "Thomas", "Robert", "Lucy",
"Beky", "John", "Peter", "Andy");
$pattern = "/Jane|Beky|Robert/";
foreach ($names as $name) {
if (preg_match($pattern, $friend)) {
echo "$name is my friend\n";
} else {
echo "$name is not my friend\n";
}
}
?>
$pattern = "/Jane|Beky|Robert/";This is the search pattern. It says, Jane, Beky and Robert are my friends. If you find either of them, you have found my friend.
$ php friends.phpOutput of the script.
Jane is my friend
Thomas is not my friend
Robert is my friend
Lucy is not my friend
Beky is my friend
John is not my friend
Peter is not my friend
Andy is not my friend
Subpatterns
We can use square brackets () to create subpatterns inside patterns.php > echo preg_match("/book(worm)?$/", "bookworm");We have the following regex pattern: book(worm)?$. The (worm) is a subpattern. The ? character follows the subpattern, which means, that the subpattern might appear 0,1 times in the final pattern. The $ character is here for the exact end match of the string. Without it, words like bookstore, bookmania would match too.
1
php > echo preg_match("/book(worm)?$/", "book");
1
php > echo preg_match("/book(worm)?$/", "worm");
0
php > echo preg_match("/book(shelf|worm)?$/", "book");Subpatterns are often used with alternation. The (shelf|worm) subpattern enables to create several word combinations.
1
php > echo preg_match("/book(shelf|worm)?$/", "bookshelf");
1
php > echo preg_match("/book(shelf|worm)?$/", "bookworm");
1
php > echo preg_match("/book(shelf|worm)?$/", "bookstore");
0
Character classes
We can combine characters into character classes with the square brackets. A character class matches any character, that is specified in the brackets.<?phpWe define a character set with two characters.
$words = array("sit", "MIT", "fit", "fat", "lot");
$pattern = "/[fs]it/";
foreach ($words as $word) {
if (preg_match($pattern, $word)) {
echo "$word matches the pattern\n";
} else {
echo "$word does not match the pattern\n";
}
}
?>
$pattern = "/[fs]it/";This is our pattern. The [fs] is the character class. Note, that we work only with one character at a time. We either consider f, or s. Not both.
$ php chclass.phpThis is the outcome of the script.
sit matches the pattern
MIT does not match the pattern
fit matches the pattern
fat does not match the pattern
lot does not match the pattern
We can also use shorthand metacharacters for character classes. \w stands for alphanumeric characters, \d for digit, \s whitespace characters.
<?phpIn the above script, we test for words consisting of alphanumeric characters. The \w{6} says, six alphanumeric characters match. Only the word mitt## does not match, because it contains non-alphanumeric characters.
$words = array("Prague", "111978", "terry2", "mitt##");
$pattern = "/\w{6}/";
foreach ($words as $word) {
if (preg_match($pattern, $word)) {
echo "$word matches the pattern\n";
} else {
echo "$word does not match the pattern\n";
}
}
?>
php > echo preg_match("#[^a-z]{3}#", "ABC");The #[^a-z]{3}# pattern stands for three characters, that are not in the class a-z. The "ABC" characters match the condition.
1
php > print_r(preg_grep("#\d{2,4}#", array("32", "234", "2345", "3d3", "2")));In the above example, we have a pattern, that matches 2,3,4 digit numbers.
Array
(
[0] => 32
[1] => 234
[2] => 2345
)
Email example
Next have a practical example. We create a regex pattern for checking email addresses.<?phpNote that this example provides only one solution. It does not have to be the best one.
$emails = array("luke@gmail.com", "andy@yahoocom", "34234sdfa#2345", "f344@gmail.com");
# regular expression for emails
$pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,5}$/";
foreach ($emails as $email) {
if (preg_match($pattern, $email)) {
echo "$email matches \n";
} else {
echo "$email does not match\n";
}
}
>?
$pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,5}$/";This is the pattern. The first ^ and the last $ characters are here to get an exact pattern match. No characters before and after the pattern are allowed. The email is divided into five parts. The first part is the local part. This is usually a name of a company, individual or a nickname. The [a-zA-Z0-9._-]+ lists all possible characters, we can use in the local part. They can be used one or more times. The second part is the literal @ character. The third part is the domain part. It is usually the domain name of the email provider. Like yahoo, gmail etc. [a-zA-Z0-9-]+ It is a character set providing all characters, than can be used in the domain name. The + quantifier makes use of one or more of these characters. The fourth part is the dot character. It is preceded by the escape character. (\.) This is because the dot character is a metacharacter and has a special meaning. By escaping it, we get a literal dot. Final part is the top level domain. The pattern is as follows: [a-zA-Z.]{2,5}Top level domains can have from 2 to 5 characters. Like sk, net, info, travel. There is also a dot character. This is because some top level domains have two parts. For example, co.uk.
Recap
Finally, we provide a quick recap of the regex patterns.Jane the 'Jane' stringIn this chapter, we have covered the regular expressions in PHP.
^Jane 'Jane' at the start of a string
Jane$ 'Jane' at the end of a string
^Jane$ exact match of the string 'Jane'
[abc] a, b, or c
[a-z] any lowercase letter
[^A-Z] any character that is not a uppercase letter
(Jane|Becky) Matches either 'Jane' or 'Becky'
[a-z]+ one or more lowercase letters
^[98]?$ digits 9, 8 or empty string
([wx])([yz]) wy, wz, xy, or xz
[0-9] any digit
[^A-Za-z0-9] any symbol (not a number or a letter)
0 comments:
Post a Comment