Regular expression to check for content between tags

Published on and tagged with php  regular expression

Regular expressions are something I often struggle with. I can usually draw the state machine which represents the regular expression, but to translate it to a regular expression is sometimes a hard task.

Today I wanted to write a regular expression to check whether a certain tag is between the head tags of a HTML document. The regular expression itself was not difficult, but as I forgot the s modifier it took quite a while to make it work…

Anyway, here is the pattern:

$this->assertPattern('#<head>(.)*<link rel="xy" href="some_url" />(.)*</head>#s', $html);

7 comments baked

  • nate

    Whenever I’m doing any HTML-related regular expressions, I always use the /msi modifiers, just to make sure I’m covering all the bases.

  • BurntSushi

    Be careful with the “m” modifier, it changes the meaning of the “^” and “$” characters. Instead of meaning “beginning of string” and “end of string”… It can match the beginning and ending of each new line in the string, respectively.

    If you want some help with regular expressions, check out the Regex Coach. It’s a great little utility to test drive expressions:

    http://weitz.de/regex-coach/

    Great resource:
    http://www.regular-expressions.info/

    And of course, the best book ever written on the topic:
    http://regex.info/

    Not only does it teach regular expressions, but it teaches how to write them efficiently (it really makes a difference on large samples), plus it goes extremely in depth about how the engine works… It makes understanding them a great deal easier. (And fun, for me at least.)

  • BurntSushi

    I just did some fine tuning on your regex, and for example, this should be more efficient:

    #.*?.*?#s

    Basically, the general idea is that you never want to have syntax looking like “.*” all alone. What that does is match everything, all the way to the end of the string. Then it has to backtrack through the whole target string, and the whole process ends up being pretty inefficient. Adding the “?” modifier after the “*” makes it “ungreedy” (when it is by default greedy). Ungreedy means it matches one character at a time and tries the next pattern to match- instead of gobbling up the entire string all at once. So there’s no backtracking involved. The ungreedy modifier can be applied to “*”, “+”, and “?”… So yes, you could have something like “a??”… Meaning it would try and _not_ match the “a” first.

    Also, I got rid of your capturing parathesis. It takes extra time for the regex engine to save the data between the parathesis. If you need to group something and not capture what it matches, you can use, “(?:match-me-but-don’t-save-me)”

    Also, if the ungreedy modifier isn’t available (which in some flavors, it isn’t), or if you’re doing something more heavy duty, you might have to revert to lookarounds: (still more efficient than backtracking, it makes use of matching the position, instead of the actual characters)

    #(?:(?:#s

    Obviously, any speed increase would be more evident as your HTML grows in size.

    Also, PHP string functions (without regex) are almost always more efficient… So you might be able to use strpos or something here too.

  • BurntSushi

    Ack, sorry for the triple post here, but your filter stripped out the regex because of the HTML. Here’s a quick text file:

    http://www.burntsushi.net/stuff/regex_htmltag_help.txt

  • Brendon Kozlowski

    If you don’t need the regular expression to store the enclosed value in your parenthesis, use the following:

    (?:.)*

    The ?: tells the Regex not to store this in the lookup table.

  • cakebaker

    @all: Thanks for your comments!

    @nate: Yes, the i modifier is useful when you parse HTML from the “wild”. In the concrete use case where I use this regex I control the layout so this modifier is not needed.

    @BurntSushi: Thank you for the links, the book tip and your explanations!

    Here your regexes which were stripped out:

    First regex:
    
    #<head>.*?<link rel="xy" href="some_url" />.*?</head>#s
    
    Second regex with look-around:
    
    #<head>(?:<(?!link)|[^<]*)*<link rel="xy" href="some_url" />(?:<(?!/head)|[^<]*)*</head>#s
    

    @Brendon: Thanks for this tip!

  • Nils Hitze

    @BurntSushi
    I don’t go anywhere without my RegexCoach :)

Bake a comment




(for code please use <code>...</code> [no escaping necessary])

© daniel hofstetter. Licensed under a Creative Commons License