Regex bugs in kadence_addlightboxrel()
-
Hello,
First off, thanks to Kadence Themes for the great theme! Compared to my experience with themes back in the WordPress 2.0 days (the last time I seriously deployed WordPress with a theme other than the default), this is a really solid and feature-filled theme (even the free version!). I’ve been able to set up my website with a lot less effort and munging about with custom code than I ever have, and the structure plus the CSS box has allowed me to do all my little visual tweaks not already part of the theme options in a very clean manner.
I’d like to report a few minor bugs involving one of the theme filters, which I discovered while debugging a more serious issue (said issue turned out to be PHP’s configuration on my server, not the Virtue theme).
I’m using Virtue 2.4.7 (free version, although if the same code is used in the paid version this may also be an issue there). The issue is at lib/cleanup.php:44 (quoted below):
function kadence_addlightboxrel($content) { global $post; $pattern ="/<a(.*?)href=('|\")(.*?).(bmp|gif|jpeg|jpg|png)('|\")(.*?)>/i"; $replacement = '<a$1href=$2$3.$4$5 rel="lightbox" $6>'; $content = preg_replace($pattern, $replacement, $content); return $content; }
The pattern has a few problems with it that can cause an incorrect match, and hence some subtle bugs. The issues are:
* The dot before the file extensions is not escaped, so it’s acting like the regex “any single character” match. This incorrectly matches
<a href="page/notapng">...</a>
.*
(.*?)
is too wide a match. It’s a lazy operator, so given multiple possible matches,(.*?)
will match the shortest possible string. However, if it has only one possibility, even if that possibility crosses through several tags, it will match it. A few examples where this regex will match too much (one for each of the(.*?)
used):Find us at <address>123 Main Street</address> (<a href="img/map.png">see the map</a>)
(I know this example is silly in this day and age when we have interactive maps =P This could be a longer paragraph though that happens to link an image later on! Note how(.*?)
not only matches too much, but also matches on a tag that isn’t just<a>
but<address>
because no whitespace is first found. However, since you addrel="lightbox"
AFTER the match, this still works despite it matching incorrectly… but had you simply added the rel before, it would have broken.)Blah blah <a href="/not/an/image">link</a> and then we put an image like <div class="icon"><img src="img/icon-pdf.png" /></div>
(This will insertrel="lightbox"
into the<img>
tag, not the<a>
tag)This potato is <a href="/an/image.png" title="This potato is > 3 metres">the largest naturally grown spud</a>.
(Will match the>
inside thetitle
attribute—if I’m not mistaken,title
contains CDATA and therefore>
is allowed there. This is one case where regexes just really fail at parsing HTML.)* Single quotes are “safer” than double quotes (because PHP allows more escape sequences in double quotes than in single quotes, so you’re more likely to miss the fact that PHP “consumes” a backslash and it doesn’t get passed to the PCRE/
preg_
engine)* The starting and ending quotes don’t necessarily match. It will match this invalid HTML:
<a href="/an/image.png'>
. Solution, use a backreference in the pattern.* Image files could have a query string afterwards:
image.png?v=123
(either because it’s generated via PHP/ImageMagick, or to trick the browser into reloading it just like how WordPress deals with Virtue’s CSS files). In the extreme case, they could have more path parts:/make-image.php/image.png/2048
– in this case a script is generating it. I think the first case is an edge case that should be dealt with, the second case is probably too complex to reliably detect if users decide to use that kind of image-generation script.)* Maybe it would be a good idea to catch PCRE errors? I had accidentally set my backtrace limit to 10000 in
php.ini
(the issue I mentioned earlier with my server), which causedpreg_replace()
to fail silently on a big/link-heavy page.preg_last_error()
should perhaps be checked and if non-zero, an error should be triggered in order to alert the administrator/webmaster to a potential configuration issue?My suggested fix to the pattern (while keeping the regex solution) is below. It’s not 100% (it fails my third example, for instance), and I haven’t tested extensively (though it seems to work); it’s not what I recommend for this particular situation, but is the best I can do with a relatively readable regex solution.
$pattern = '/<a(\s[^>]*?)href=(\'|")([^\'">]*?)\.(bmp|gif|jpeg|jpg|png)(\?[^\'">]*?)?(\2)([^>]*?)>/i';
(https://regex101.com/ is a nice utility for quickly validating regexes against test cases, if you don’t already have a test framework for these functions.)
However, it’s generally not a good idea to use regex to parse HTML, unless it’s a VERY simple situation you can control precisely (for example, if you know you always have
<div class="main">
and</div>
EXACTLY). In this case, you can’t control it precisely because it’s user-input HTML. My third example above illustrates one case where regex fails for HTML, and in general the big problem with regex is dealing with nested tags (in this case it’s detecting one tag, so not at issue).My recommended solution and the one I’d probably use if this were my own code is as follows for the entire function. It uses a DOM parser, so no work trying to parse any possible user-input HTML links, and no bugs related to that (unless the bug is in libXML, which is unlikely given its maturity). Unfortunately, the DOM parser in PHP before 5.4 adds a DOCTYPE, html, head, body, etc. elements that we have to deal with (5.4 adds an option not to add these so we can work more easily with only an HTML snippet, but WordPress requires only 5.2.4).
function kadence_addlightboxrel($content) { $dom = new DOMDocument(); $contentEntities = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'); // fix for UTF-8 not being correctly handled by DOMDocument @$dom->loadHTML('<div>' . $contentEntities . '</div>'); // div wrapper to allow easy extracting from <html><body> etc. afterwards $x = new DOMXPath($dom); foreach($x->query('//a') as $node) { $href = $node->getAttribute('href'); // If it seems to be an image, add a lightbox rel if(preg_match('/(bmp|gif|jpe?g|png)((\?|&).*)?$/i', $href) === 1) { $node->setAttribute('rel', 'lightbox'); } } // Get the HTML - we need to extract it out of the <html><head><body> etc. skeleton that DOMDocument adds // (PHP 5.4+ has a clean way of working with a fragment of an HTML document without adding this stuff, but WordPress supports all the way back to 5.2.4) $dom->removeChild($dom->doctype); $dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild); // extract our wrapper <div> into the root of the document $newContent = $dom->saveHTML(); $start = strpos($newContent, '<div>') + 5; // find the outer <div> (guaranteed since we manually added it) $end = strrpos($newContent, '</div>'); // find the outer </div> (guaranteed since we manually added it) return substr($dom->saveHTML(), $start, $end-$start); // remove <div> ... </div> - we search to avoid issues like trailing whitespace }
This is working on my site at the moment (having patched it into Virtue), but I have not done any rigorous testing. That said, since it relies on a standard libxml implementation, it should at the minimum not have any issues with parsing complex HTML, unlike the regex.
Best regards,
Laogeodritt
- The topic ‘Regex bugs in kadence_addlightboxrel()’ is closed to new replies.