• Hello,

    First off, thanks to Kadence Themes for the great theme! Compared to my experience with themes back in the WordPress 2.0 days (the last time I seriously deployed WordPress with a theme other than the default), this is a really solid and feature-filled theme (even the free version!). I’ve been able to set up my website with a lot less effort and munging about with custom code than I ever have, and the structure plus the CSS box has allowed me to do all my little visual tweaks not already part of the theme options in a very clean manner.

    I’d like to report a few minor bugs involving one of the theme filters, which I discovered while debugging a more serious issue (said issue turned out to be PHP’s configuration on my server, not the Virtue theme).

    I’m using Virtue 2.4.7 (free version, although if the same code is used in the paid version this may also be an issue there). The issue is at lib/cleanup.php:44 (quoted below):

    function kadence_addlightboxrel($content) {
    		   global $post;
    		   $pattern ="/<a(.*?)href=('|\")(.*?).(bmp|gif|jpeg|jpg|png)('|\")(.*?)>/i";
    		   $replacement = '<a$1href=$2$3.$4$5 rel="lightbox" $6>';
    		   $content = preg_replace($pattern, $replacement, $content);
    		   return $content;
    	}

    The pattern has a few problems with it that can cause an incorrect match, and hence some subtle bugs. The issues are:

    * The dot before the file extensions is not escaped, so it’s acting like the regex “any single character” match. This incorrectly matches <a href="page/notapng">...</a>.

    * (.*?) is too wide a match. It’s a lazy operator, so given multiple possible matches, (.*?) will match the shortest possible string. However, if it has only one possibility, even if that possibility crosses through several tags, it will match it. A few examples where this regex will match too much (one for each of the (.*?) used):

    Find us at <address>123 Main Street</address> (<a href="img/map.png">see the map</a>) (I know this example is silly in this day and age when we have interactive maps =P This could be a longer paragraph though that happens to link an image later on! Note how (.*?) not only matches too much, but also matches on a tag that isn’t just <a> but <address> because no whitespace is first found. However, since you add rel="lightbox" AFTER the match, this still works despite it matching incorrectly… but had you simply added the rel before, it would have broken.)

    Blah blah <a href="/not/an/image">link</a> and then we put an image like <div class="icon"><img src="img/icon-pdf.png" /></div> (This will insert rel="lightbox" into the <img> tag, not the <a> tag)

    This potato is <a href="/an/image.png" title="This potato is > 3 metres">the largest naturally grown spud</a>. (Will match the > inside the title attribute—if I’m not mistaken, title contains CDATA and therefore > is allowed there. This is one case where regexes just really fail at parsing HTML.)

    * Single quotes are “safer” than double quotes (because PHP allows more escape sequences in double quotes than in single quotes, so you’re more likely to miss the fact that PHP “consumes” a backslash and it doesn’t get passed to the PCRE/preg_ engine)

    * The starting and ending quotes don’t necessarily match. It will match this invalid HTML: <a href="/an/image.png'>. Solution, use a backreference in the pattern.

    * Image files could have a query string afterwards: image.png?v=123 (either because it’s generated via PHP/ImageMagick, or to trick the browser into reloading it just like how WordPress deals with Virtue’s CSS files). In the extreme case, they could have more path parts: /make-image.php/image.png/2048 – in this case a script is generating it. I think the first case is an edge case that should be dealt with, the second case is probably too complex to reliably detect if users decide to use that kind of image-generation script.)

    * Maybe it would be a good idea to catch PCRE errors? I had accidentally set my backtrace limit to 10000 in php.ini (the issue I mentioned earlier with my server), which caused preg_replace() to fail silently on a big/link-heavy page. preg_last_error() should perhaps be checked and if non-zero, an error should be triggered in order to alert the administrator/webmaster to a potential configuration issue?

    My suggested fix to the pattern (while keeping the regex solution) is below. It’s not 100% (it fails my third example, for instance), and I haven’t tested extensively (though it seems to work); it’s not what I recommend for this particular situation, but is the best I can do with a relatively readable regex solution.

    $pattern = '/<a(\s[^>]*?)href=(\'|")([^\'">]*?)\.(bmp|gif|jpeg|jpg|png)(\?[^\'">]*?)?(\2)([^>]*?)>/i';

    (https://regex101.com/ is a nice utility for quickly validating regexes against test cases, if you don’t already have a test framework for these functions.)

    However, it’s generally not a good idea to use regex to parse HTML, unless it’s a VERY simple situation you can control precisely (for example, if you know you always have <div class="main"> and </div> EXACTLY). In this case, you can’t control it precisely because it’s user-input HTML. My third example above illustrates one case where regex fails for HTML, and in general the big problem with regex is dealing with nested tags (in this case it’s detecting one tag, so not at issue).

    My recommended solution and the one I’d probably use if this were my own code is as follows for the entire function. It uses a DOM parser, so no work trying to parse any possible user-input HTML links, and no bugs related to that (unless the bug is in libXML, which is unlikely given its maturity). Unfortunately, the DOM parser in PHP before 5.4 adds a DOCTYPE, html, head, body, etc. elements that we have to deal with (5.4 adds an option not to add these so we can work more easily with only an HTML snippet, but WordPress requires only 5.2.4).

    function kadence_addlightboxrel($content) {
      $dom = new DOMDocument();
      $contentEntities = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'); // fix for UTF-8 not being correctly handled by DOMDocument
      @$dom->loadHTML('<div>' . $contentEntities . '</div>'); // div wrapper to allow easy extracting from <html><body> etc. afterwards
      $x = new DOMXPath($dom);
    
      foreach($x->query('//a') as $node)
      {
          $href = $node->getAttribute('href');
        // If it seems to be an image, add a lightbox rel
        if(preg_match('/(bmp|gif|jpe?g|png)((\?|&).*)?$/i', $href) === 1) {
          $node->setAttribute('rel', 'lightbox');
        }
      }
    
      // Get the HTML - we need to extract it out of the <html><head><body> etc. skeleton that DOMDocument adds
      // (PHP 5.4+ has a clean way of working with a fragment of an HTML document without adding this stuff, but WordPress supports all the way back to 5.2.4)
      $dom->removeChild($dom->doctype);
      $dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild); // extract our wrapper <div> into the root of the document
      $newContent = $dom->saveHTML();
      $start = strpos($newContent, '<div>') + 5; // find the outer <div> (guaranteed since we manually added it)
      $end = strrpos($newContent, '</div>'); // find the outer </div> (guaranteed since we manually added it)
      return substr($dom->saveHTML(), $start, $end-$start); // remove <div> ... </div> - we search to avoid issues like trailing whitespace
    }

    This is working on my site at the moment (having patched it into Virtue), but I have not done any rigorous testing. That said, since it relies on a standard libxml implementation, it should at the minimum not have any issues with parsing complex HTML, unlike the regex.

    Best regards,

    Laogeodritt

Viewing 4 replies - 1 through 4 (of 4 total)
  • Hey,
    Thanks for the long write up. I’m going to dig into this a bit. Up till now I haven’t had anyone tell me they hit an issue. I did a quick search and found this: https://www.deferredprocrastination.co.uk/blog/2011/add-rel-lightbox-to-wordpress-images/ might have to look at it’s option of only using images that are added with the wordpress class.

    I feel like www.remarpro.com review didn’t like the Dom option but it’s been a while. I know I started there and had to move off for some reason can’t remember if it was a theme review or if I just didn’t get it working right. Thanks for the code though I will test.

    Kadence Themes

    Thread Starter laogeodritt

    (@laogeodritt)

    Hi,

    Hope the length wasn’t too tedious! I figured since I had done the investigations/attempted patches, might as well document it in case it’s useful for you.

    Like I mentioned, the major issue we had was a php.ini problem (too low backtrack limit), that caused posts to disappear off the page. Maybe one improvement for the Virtue theme would be that whenever preg_match or preg_replace are called, check if errors occurred by calling preg_last_error() and log those issues? (They don’t get logged to the error log by default.)

    Otherwise for the regex itself, I guess a lot of the failure cases are edge cases, and half of them will work anyway despite technically failing to match correctly (depends on if the “wrong” matched tag is first or last, since rel gets added at the end). It’s not exactly a blocker, but not too improbable to hit a bug in the wild.

    As for the DOM option, I’m really surprised at how bad PHP’s implementation of the DOM parser is pre-5.4, and admittedly I don’t know how badly it can affect performance (a high-traffic site should be using caching though anyway).

    Just to note, also, that in the linked article, (.*?) will likely have the same problem as it did here. That capture group is quite a dangerous one for capturing unexpected text!

    Best regards,

    Laogeodritt

    For the linked article, can you think of text that would be caught by this? It would pass all your tests from above.

    I’ll think about a good way to post an error. Seems crazy I haven’t had anyone tell me they hit the low backtrack limit before.

    Kadence Themes

    Thread Starter laogeodritt

    (@laogeodritt)

    A couple variations on my test cases earlier could get caught up with the linked article’s regex. Since it’s more specific, it’s less likely to happen “in the wild”, but the same idea of “the tag doesn’t match, but (.*?) can match the closing of a tag and opening of a new one” applies.

    The code in the linked article adds the rel after the detected href/anchor file type. This position is important to determine what cases will not quite match properly but substitute properly, versus those that will actually fail.

    <a href="(.*?).(jpg|jpeg|png|gif|bmp|ico)"><img(.*?)class="(.*?)wp-image-(.*?)" \/><\/a>

    A few examples:

    <a href="/not/an/image">text</a> more text... later on, an <a href="/image.jpg"><img src="/wp-content/upload/blah.jpg" class="wp-image-111" /></a> – this one technically matches ALL of this text (which is not correct), but still works because the rel gets added to the end of the second <a> opening tag.

    <a href="/image.jpg"><img src="image.jpg" /></a> shouldn't be lightboxed (the img doesn't have a class) but is, and then later there's <a href="/image2.jpg"><img src="/image2.jpg" class="wp-image-111" /></a> one of these that should be lightboxed but isn't.

    <a href="/image.jpg"><img src="image.jpg" class="not-the-right-class" /></a> shouldn't be lightboxed (the img has the wrong class) but is, and then later there's <a href="/image2.jpg"><img src="/image2.jpg" class="wp-image-111" /></a> one of these that should be lightboxed but isn't.

    You can find another example that forces the last (.*?) to match too widely again. The “simple” (but again not 100%) solution to fix this kind of regex would be to use a character class instead of the dot: for example, if you’re inside quotation marks, ([^"]*?) would cover a much larger number of situations (since it can never match outside the ending quotation mark). If you’re inside an HTML tag, ([^>]*?) will never go beyond the tag’s end (but if an attribute value contains a >, which is valid IIRC for some attributes, this one will break).

    Also note that this regex does not match single quotes around an attribute value, which are valid in HTML5.

    I would honestly recommend not matching only “wp-image-*” classes, though, as power users may write their own markup or include images uploaded elsewhere via FTP, rather than necessarily use the Media Manager. (For example, I personally use the Media Manager for images included in blog posts, but I manually upload images that go into Pages so I can better organize them, while still wanting the lightbox plugin to affect it.)

    As for the backtrack limit, honestly it’s not very likely for most users, since the default is 100,000 (kind of low, but not going to be hit unless someone has a gigantic article), and in I think PHP 5.3+ it’s 1 or 10 million (probably not even a concern for any realistic article size). It’s good practice to catch errors if they do happen (as in my case), but the low backtrack limit was just an oversight on my part when I was editing the php.ini settings.

    For the error, trigger_error with a warning might be good enough for technically experienced WordPress/PHP users. However, if you can somehow report it on the page or in the admin panel (and take measures so that the error doesn’t break that page entirely), that’d probably be better for the majority of users.

    Best regards,

    Laogeodritt

Viewing 4 replies - 1 through 4 (of 4 total)
  • The topic ‘Regex bugs in kadence_addlightboxrel()’ is closed to new replies.