Home Links
Home Page
Use XML in PHP
Compression of the data on PHP
Use mod_perl
Style of coding on PHP
Perl and XML. Library of the programmer
Access to databases under management SUBD POSTGRES95
Parsing on Perl
XMLHttpRequest (AJAX) - sending and processing of answers of http-searches with help JavaScript.
Subsys_JsHttpRequest: pumping of the data without perezagruzki pages (AJAX)
The brief description of regular expressions: POSIX and PCRE
Optimization of searches in MySQL
Wound of treelike structures in Databases (Nested Sets)
Oracle / PHP FAQ
The specification and functions DOM in PHP
Not kehshirovat`!
Report PPP
Useful advice{councils} on optimization of ASP-applications
XML: time has come
 

The brief description of regular expressions: POSIX and PCRE

Part 1: Regular expressions


I shall start with that php supports two standards of regular expressions: POSIX and, since the fourth version, compatible with Perl. The first standard is used also by Apache server in mod_rewrite and as... MySQL in the searches (look for a word "REGEXP" in a management{manual} on mysql, can at once will understand, and I shall tell about it later). The second as it is clear from the name, is used in system perl. these two standards differ nesil`no - in the second there are the special symbols replacing most often used classes of symbols (for example, figures - d, and letters and figures - w) and special parameters of the patterns, allowing to define{determine} registrozavisimost` search, binding by the ends of lines and t.d (in functions of standard POSIX registrozavisimost` is realized simply: there are functions ereg and ereg_eeplace, is eregi (insensitive) and eregi_replace). In the rest both standards are compatible, and receptions of a spelling of patterns identical.


If you worked with Norton/Volkov/Windows Commander or Far know such thing as wildcards. For example: delete c:windows*.* deletes all files from the specified directory.:) In names of files of special refinements to do{make} it is not necessary, therefore system simple: the symbol * means any character set, including empty (*.txt), a symbol? - any symbol or any symbol (document?.txt) and more any designations for letters and figures (I, to tell the truth, for a long time did not use them, therefore so I shall not recollect).


In regular expressions the approach other. The system first of all is universal and should be able to find conformity of lines to any most complex  searches (strange, that I speak " should be able ", the system in fact already "is able". I hope the reader will excuse to me such phrases, in fact all of them concern to already working system of regular expressions.) Now I shall name terms, which I shall use further to avoid extensible (in direct and figurative sense) definitions.


So, a problem  of system - besides precisely set symbols (" Vasja (. *) Pupkin ") to allow to specify to the user search of the set quantity{amount} of the set symbols. In the resulted example with Vasej Pupkin between words any quantity{amount} of any symbols is set. If it is necessary to find six figures we write " [0-9] {6} " (if, for example, from six up to eight figures then " [0-9] {6,8} "). To what all this? To that as against wildcard from operational system, such things as the index of a character set and the index of necessary quantity{amount} here are divided{shared}: Instead of a character set the designation of any symbol - a point can be used, the concrete character set (sequences - mentioned "0-9" are supported) can be specified. Can be specified " except for the given character set ".


The index of quantity{amount} of symbols in the official documentation on php is called "kvantifikator". The term convenient also does not bear{carry} in itself idle talks. So, kvantifikator can have as concrete value - or one fixed (" {6} "), or as a numerical interval (" {6,8} "), and abstract " any number, including 0 " ("*") ", any natural number " - from 1 indefinitely ("+": "document[0-9]+.txt"), " either 0, or 1 " ("?"). By default kvantifikator for the given character set it is equal to unit (" document [0-9] .txt ").


Certainly, for more floppy search of combinations this of a sheaf " - kvantifikator " can be united a character set in metastructures.


As any floppy tool, regular expressions are floppy, but is not absolute: the zone of their application is limited. For example, if you should replace in the text one fixed line with another, fixed besides, use str_replace. Developers php plaintively beg to not use for the sake of it complex  functions ereg_replace or preg_replace, in fact by their call a process of interpretation of a line, and it seriously consumes there is resources of system. Unfortunately, it is a liked raker of beginning{starting} php-programmers (even I, voleju destinies, have at first seen in a management{manual} function ereg_replace but only then, later, str_replace).


Use functions of regular expressions only if you do not know precisely, what "there" a line. From examples: a search code of this site in which service symbols and short words are cut out from a line of search and as superfluous blanks (are cut out more correctly, all blanks are compressed: " + " It is replaced with one blank). By means of these functions I check email the user leaving the response. A lot of useful it is possible to make, but important to mean: regular expressions are not omnipotent. For example, in the big text it is better to not do{make} complex  replacement with them. In fact, for example, a combination " (. *) " in the program plan means perebor all symbols of the text. And if the pattern is not adhered to the beginning or the end of a line also the pattern "moves" the program through the text, and it turns out double perebor, is more correct perebor in a square. It is uneasy to guess, that one more combination " (. *) " means perebor cubed, and so on. Erect in a third degree, say, 5 kilobyte of the text. It turns out 125000000000 (in words: hundred twenty five billions operations). Certainly, if to approach{suit} strictly, there stol`kikh operations will not be, and will be time in four - eight less, but the order of figures is important.


So, principles, merits and demerits are described, now it is necessary to pass to a reality. Two (it is possible, at once the following) release will be devoted to two standards of regular expressions - POSIX and PCRE. The description of base principles and concepts of job of regular expressions.

Part 2: POSIX


We continue our conversation. The previous release was introduction, the theory. Today as though the basic part of the story - standard POSIX. In the following release I shall describe distinctions, to say superstructures of the standard compatible with perl more correctly. So, about all under the order.

Character set



. A point any symbol

[<Symbols>] square brackets a class of symbols (" any of ")

[^ <symbols>] a negative class of symbols (" any except for ")

- A dash a designation of sequence in a class of symbols ("[0-9]" - figures)


Especially to explain it is necessary nothing. Unless the following: do not use a class of symbols for a designation only one (instead of " [] + " will quite leave "+"). Do not write in a class of symbols a point is in fact any symbol then other symbols in a class will be simply superfluous (and in a negative class denying all symbols will turn out).



Kvantifikator


It as I already wrote, the index of quantity{amount} of the set symbols. Kvantifikatorom it is possible to specify both concrete value, and limits. If the number set falls under limits kvantifikatora, the fragment of expression is considered concurrent with razbiraemoj in the line. Syntax: {<quantity{amount}>} or {<minimum>, <maximum>}


If it is necessary to specify only a necessary minimum, and the maximum no, simply we put a point and we do not write the second: " {5}, " (" a minimum 5 "). For most often used kvantifikatorov there are special designations:



* "Asterisk" or a sign on multiplication {0},

+ Plus {1},

? A question mark {0,1}


In practice such symbols are used more often, than braces.

Anchors

^ binding to the beginning of a line

$ Binding by the end of a line


These symbols should stand accordingly right at the beginning and right at the end of a line. That the interpreter has correctly understood a symbol $ in the end, it is desirable to add to it  return slehsh: ereg (" foo $ ", $bar)



Structure


Now there will be a complex  description, it and itself is not pleasant to me. This thing is necessary for complex  searches. For example, it is necessary to you, that in the text there were or only small letters, either only big, or only figures. The class of symbols "[a-zA-Z0-9]" does not approach. Then we write such:



if (ereg (" [a-z] + | [A-Z] + | [0-9] + ", $text))...


Vertical feature - a sign "or" regular expressions (the sign "and", naturally, does not exist is and there is a regular expression). The patterns divided{shared} by vertical feature in the official documentation are called as alternative branches (it means branching, i.e. presence of the enclosed alternative branches). The program compares to line all branches (being passed on their number{line} from left to right), before the first concurrence (it is important for taking into account if at you complex  expression with the enclosed branches). For division of levels and branches of this tree of alternatives from other pattern are used usual brackets. If it is necessary to search for the same big / small letters / figures inside the container tegov:



if (ereg (" <tag> ([a-z] + | [A-Z] + | [0-9] +) </tag> ", $text))...


From complex  it, apparently, all. Now about more simple. Brackets in a scientific way are called subpattern (the enclosed pattern). Also are used not only for complex  variants of patterns, but also for floppy replacement of fragments of the text or their reception in a variable. For example, for the printed version of the text we duplicate addresses of links the text in brackets:



ereg_replace (" <a href = ([^>] +)> [^ <] + </a> ", " \0 [\1] ", $text);


The first brackets - - can be received the first enclosed pattern " on an output{exit} " through a designation "n" (as return slehsh in php and many other languages is used for specsimvolov, it is necessary to put before it  one more same that the proscale took it  literally). At zero number{room} - all concurrent line. At myself in the printed version of clause{article} I do not write the link at once in the text, and I do{make} their list in the end approximately so:



if (ereg (" <a href = ([^>] +)> ([^ <] +) </a> ", $text, $match)) {

  for ($a=0; $a <sizeof ($match [0]); $a ++) {

    $b = $a+1;

    $text = str_replace ($match [0] [$a], $match [0] [$a]. " [$b] ", $text);

    $match [1] [$a] = " $b) ". $match [1] [$a];

};

  $text. = " <br> <h2> the Links used in release: </h2> ". implode ("<br>", $match [1]);

};


Function ereg (and eregi) if her to specify in the third parameter a variable all there will be written down podstroki as a two-dimensional file.


It, actually, all. It is necessary only to be able to make patterns further. I shall result some examples.


* Copying addresses by Apache server (as I have already noted, Apache works with standard POSIX).

* Search on a database: the sql-search is done{made} of the user search search. If to reject creation of statistics of search (how much it is found everything, how much by each word) it will turn out, that is necessary only 6-7 lines of a code. In the same place illumination of words is described also as a result of search. By the way, the important remark: before cutting out short words from a line I replace blanks between words on double. Why? Because lines conterminous to a pattern should not run against each other.


I shall explain more in detail. If in a pattern there are no anchors, the system is passed under the text from left to right and if concurrence is found, throws it  in any variables, and then jumps to the following symbol after the concurrent fragment. We search on a pattern " for a blank, two neprobela, a blank ", and blanks single. The program finds " blank - short a word - blank ", replaces it with one blank, and then jumps on... The first letter of the following word. It not a blank, therefore even if the following word too short, it under a pattern will not approach. Therefore also it is necessary to replace preliminary blanks on double.

* How to store{keep} news in files and to not run a cycle by date of:



$handle=opendir ($newsdir);

      while ($file = readdir ($handle)) {

        if (is_file ($file) ** ereg (" ^ [0-9] {6} .txt $ ", $file))

          print (" <p align=justify> <b> ".

          ereg_replace (" ^ ([0-9] {2}) ([0-9] {2}) ([0-9] {2}) .txt $ ", "\1.\2.20\3", $file). 

          "</b>" .implode (" ", file ($file)). "</p>");

      closedir ($handle);


4. Check of a correct spelling email-b:



if (! eregi (" ^ [a-z0-9._-] + [a-z0-9._-] +. [a-z] {2,4} $ ", $email))

        print (" Bad email: " $email " ");


On it all. In the following release - standard PCRE, is more exact additional opportunities which he gives.



Part 3: PCRE


And, at last, a series of releases about regular expressions comes to an end. We shall talk about regular expressions compatible with Perl (Perl compatible regular expressions - PCRE).


Their most important advantage before POSIX as to me already prompt - an opportunity of "greedy" search. The question mark in PCRE acts also as minimizator kvantifikatora:. *? Will find the minimal suitable line. Like anything especial? No, it is very especial thing. For example, I resulted what example in the last release about the printed version of the text?



$text = ereg_replace (" <a +href = ([^>] +)> [^ <] + </a> ", " \0 [\1] ", $text);


That is, vnuri links should not be tegov (for example " <a href =...> <b...> </b> </a> "). If to make so:



$text = ereg_replace (" <a +href = ([^>] +)>. * </a> ", " \0 [\1] ", $text);


Then we shall receive... Correctly, the text between the beginning of the first and the end of last link. All problems are removed with greedy search.



$text = preg_replace (" / <as+href = (. *?)>. *? </a> / ", " \0 [\1] ", $text);


The program will pick up for all links the minimal suitable line, i.e. only up to tega "</a>". To describe value of such feature PCRE no has washed off - it huge.:) we Go further.


Figures now can be designated not as "[0-9]", and it is simple "d". Not - figures

(" [^0-9] ") as "D." very conveniently. Other designations:



w [a-z0-9]

      W [^a-z0-9]

      s []

      S [^]


I recommend to have a look in releases about search - there these symbols are used.


The line of a pattern as you have already noticed, begins and comes to an end slehshami. For what the first is necessary slehsh, I do not know. Last is necessary for branch of a pattern from parameters. Parameters which I have understood, are those:

i registronezavisimyj search

m a multiline mode. By default PCRE searches sopvadenija with a pattern only inside one line, and symbols "^" and "$" coincide only with the beginning and the end of the text. When this parameter is established, "^" and "$" coincide with the beginning and the end of separate lines.

s a symbol "." (Point) coincides and with carry of a line (by default - no)

A binding to the beginning of the text

E forces a symbol "$" to coincide only with the end of the text. It is ignored, if it is established paramert m.

U Inverts "greed" for everyone kvantifikatora (if the ambassador kvantifikatora costs{stands} "?", this kvantifikator ceases to be "greedy").


Naturally, the register in parameters matters. The rest about them can be read in a management{manual} on php.



Now about functions PCRE.


Function preg_match as against ereg searches only for the first concurrence. If it is necessary to find all concurrences and somehow to process their results (but not directly through preg_replace), it is necessary to use preg_match_all. Parameters of this the same functions.


From useful I shall note function preg_quote which inserts slehshi before all service symbols (for example, brackets, square brackets, etc.) that those were perceived literally. If you have any input of the information by the user, and you check it  through PCRE, it is better before it zakommentirovat` service symbols in the come variable (whether a little that he there will write, it in fact by definition a spiteful hacker).


All this, that I can say about regular expressions. Further - only art of a combination of lines and a spelling of algorithms.


It is remembered, in one of prishlykh releases I have described rassyl`hhik mails on classes. Now I have added there storage of addresses in files and acknowledgement{confirmation} of a subscription. Certainly, various checks of addresses, reception of the list active and so forth - all works on PCRE. Unfortunately, time for testing and operational development was not, rassyl`hhik "crude".