Bug Report: Attachment Body Regex is Broken for Large Attachments
-
Hi. We noticed a small bug in “wpsolr-index-solr-client.php” in the “extract_attachment_text_by_calling_solr_tika” function. There is a regex which extracts the body from the Solr response. This regex throws an error for large attachments. (E.g. We tried a sample PDF of about 1 MB and it fails. PDFs around 500 KB work fine.) The line of code that fails is this:
$attachment_text_extracted_from_tika = preg_replace( '/^.*?\<body\>(.*?)\<\/body\>.*$/i', '\1', $response );
The “preg_replace” function returns a PREG_BACKTRACK_LIMIT_ERROR error. (Error code 2). You can determine this by using the preg_last_error() function right after the above line of code. It appears that there’s some string size threshold over which this error occurs. Our temporary solution is to remove the ? in the middle of the line of code between the parentheses. This apparently helps with the backtrack limit.
$attachment_text_extracted_from_tika = preg_replace( '/^.*?\<body\>(.*)\<\/body\>.*$/i', '\1', $response );
I’m notifying WPSolr of this problem so that they can perhaps fix this (in whatever way they determine best) in a future release. Since preg_replace doesn’t “throw” the exception, it’s not caught in the exception catch. It simply returns “null” and code proceeds on. As a result, the body of the attachment is not full-text indexed by Solr. (Even though the WP item itself does make it to the index.)
- The topic ‘Bug Report: Attachment Body Regex is Broken for Large Attachments’ is closed to new replies.