Indexing PDF attachments not working.
-
1. Created new post.
2. Added a PDF attachment to the post.
3. Manually kicked off update to solr index. (document count increased)
4. Tried to search within the PDF for words *known* to be in there.
5. PDF is not encrypted.
6. No results on searchDoes wpsolr not really fulltext index the attachments, and only the attachment metadata?
(wpsolr v. 1.8)
-
Hi,
I confirm that attachments’ content are indexed.
1) Did you select “attachments” in WPSOLR setup’s document types to index
2) Did document count increased by 2 (one for the post, one for the PDF)Yes,
The index increased by 2, and the check box was marked.Indexing PDF (and other formats!) is a capability of solr, so I am hoping that we can get this working.
Other workarounds (creating custom field medatdata in wordpress and then smashing all of the fulltext into it) are not scalable, and plugins using this technique tend to be rejected by wordpress.
I can see the post, and click on the attachment. The attachment (PDF) has copy/paste/search within (actual text).
Offline I could open up the site for you to take a look if necessary.
What is the size of your pdf ?
filesize 5.3 megs.
php.ini currently set to 20M
wordpress currently seems to have an 8 meg limit for media.What happens when you query your index by hand (in Solr admin for instance).
Can you see a document containing your pdf content ?
No.
But cannot find a corresponding error.
Should the PDF’s actually be uploading into solr /data to be indexed?
There was a path error in your replacement solrconfig.xml with respect to v.4.3
<lib dir=”../../../contrib/ . . .
seems to now only require ../../
(However, these libraries seem to only matter for clustering environment so does not seem to be the cause of current problem)
Can you find any attachment in your index ?
Try /select?q=*%3A*&fq=type%3Aattachment&wt=json&indent=true
<response><lst name=”responseHeader”><int name=”status”>0</int><int name=”QTime”>1</int><lst name=”params”><str name=”amp;fq”>type:attachment</str><str name=”q”>*:*</str><str name=”amp;wt”>json</str><str name=”amp;indent”>true</str></lst></lst><result name=”response” numFound=”17″ start=”0″><doc><str name=”id”>1</str><str name=”PID”>1</str><str name=”title”>Hello world!</str><arr name=”spell”><str>Hello world!</str><str>test documenttest document Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!</str><str>Uncategorized</str></arr><arr name=”autocomplete”><str>Hello world!</str><str>test documenttest document Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!</str><str>Uncategorized</str></arr><str name=”content”>test documenttest document Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!</str><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>post</str><str name=”displaydate”>2015-01-30 19:02:41</str><str name=”displaymodified”>2015-02-03 23:39:31</str><str name=”permalink”>https://intranet.mohavecountylibrary.us/?p=1</str><int name=”numcomments”>0</int><arr name=”categories”><str>Uncategorized</str></arr><long name=”_version_”>1492189512981282816</long></doc><doc><str name=”id”>2</str><str name=”PID”>2</str><str name=”title”>Sample Page</str><arr name=”spell”><str>Sample Page</str><str>This is an example page. It’s different from a blog post because it will stay in one place and will show up in your site navigation (in most themes). Most people start with an About page that introduces them to potential site visitors. It might say something like this:
Hi there! I’m a bike messenger by day, aspiring actor by night, and this is my blog. I live in Los Angeles, have a great dog named Jack, and I like piña coladas. (And gettin’ caught in the rain.)
…or something like this:
The XYZ Doohickey Company was founded in 1971, and has been providing quality doohickeys to the public ever since. Located in Gotham City, XYZ employs over 2,000 people and does all kinds of awesome things for the Gotham community.
As a new WordPress user, you should go to your dashboard to delete this page and create new pages for your content. Have fun!</str></arr><arr name=”autocomplete”><str>Sample Page</str><str>This is an example page. It’s different from a blog post because it will stay in one place and will show up in your site navigation (in most themes). Most people start with an About page that introduces them to potential site visitors. It might say something like this:
Hi there! I’m a bike messenger by day, aspiring actor by night, and this is my blog. I live in Los Angeles, have a great dog named Jack, and I like piña coladas. (And gettin’ caught in the rain.)
…or something like this:
The XYZ Doohickey Company was founded in 1971, and has been providing quality doohickeys to the public ever since. Located in Gotham City, XYZ employs over 2,000 people and does all kinds of awesome things for the Gotham community.
As a new WordPress user, you should go to your dashboard to delete this page and create new pages for your content. Have fun!</str></arr><str name=”content”>This is an example page. It’s different from a blog post because it will stay in one place and will show up in your site navigation (in most themes). Most people start with an About page that introduces them to potential site visitors. It might say something like this:
Hi there! I’m a bike messenger by day, aspiring actor by night, and this is my blog. I live in Los Angeles, have a great dog named Jack, and I like piña coladas. (And gettin’ caught in the rain.)
…or something like this:
The XYZ Doohickey Company was founded in 1971, and has been providing quality doohickeys to the public ever since. Located in Gotham City, XYZ employs over 2,000 people and does all kinds of awesome things for the Gotham community.
As a new WordPress user, you should go to your dashboard to delete this page and create new pages for your content. Have fun!</str><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>page</str><str name=”displaydate”>2015-01-30 19:02:41</str><str name=”displaymodified”>2015-01-30 19:02:41</str><str name=”permalink”>https://intranet.mohavecountylibrary.us/?page_id=2</str><int name=”numcomments”>0</int><long name=”_version_”>1492189512991768576</long></doc><doc><str name=”id”>4</str><str name=”PID”>4</str><str name=”title”>IP Spreadsheet Test</str><arr name=”spell”><str>IP Spreadsheet Test</str><str>5</str></arr><arr name=”autocomplete”><str>IP Spreadsheet Test</str><str>5</str></arr><str name=”content”>5</str><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>document</str><str name=”displaydate”>2015-02-02 16:06:18</str><str name=”displaymodified”>2015-02-02 17:55:22</str><str name=”permalink”>https://intranet.mohavecountylibrary.us/?post_type=document&p=4</str><int name=”numcomments”>0</int><long name=”_version_”>1492189513024274432</long></doc><doc><str name=”id”>5</str><str name=”PID”>5</str><str name=”title”>IP</str><arr name=”spell”><str>IP</str><str/></arr><arr name=”autocomplete”><str>IP</str><str/></arr><str name=”content”/><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>attachment</str><str name=”displaydate”>2015-02-02 16:06:02</str><str name=”displaymodified”>2015-02-02 16:06:02</str><str name=”permalink”/><int name=”numcomments”>0</int><long name=”_version_”>1492189513026371584</long></doc><doc><str name=”id”>10</str><str name=”PID”>10</str><str name=”title”>test search fulltext one</str><arr name=”spell”><str>test search fulltext one</str><str>11</str></arr><arr name=”autocomplete”><str>test search fulltext one</str><str>11</str></arr><str name=”content”>11</str><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>document</str><str name=”displaydate”>2015-02-02 19:46:11</str><str name=”displaymodified”>2015-02-02 19:46:11</str><str name=”permalink”>https://intranet.mohavecountylibrary.us/?post_type=document&p=10</str><int name=”numcomments”>0</int><long name=”_version_”>1492189513040003072</long></doc><doc><str name=”id”>11</str><str name=”PID”>11</str><str name=”title”>test document</str><arr name=”spell”><str>test document</str><str/></arr><arr name=”autocomplete”><str>test document</str><str/></arr><str name=”content”/><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>attachment</str><str name=”displaydate”>2015-02-02 19:45:52</str><str name=”displaymodified”>2015-02-02 19:45:52</str><str name=”permalink”/><int name=”numcomments”>0</int><long name=”_version_”>1492189513042100224</long></doc><doc><str name=”id”>13</str><str name=”PID”>13</str><str name=”title”>test fulltext search attachment</str><arr name=”spell”><str>test fulltext search attachment</str><str>14</str></arr><arr name=”autocomplete”><str>test fulltext search attachment</str><str>14</str></arr><str name=”content”>14</str><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>document</str><str name=”displaydate”>2015-02-02 21:57:17</str><str name=”displaymodified”>2015-02-02 21:57:17</str><str name=”permalink”>https://intranet.mohavecountylibrary.us/?post_type=document&p=13</str><int name=”numcomments”>0</int><long name=”_version_”>1492189513043148800</long></doc><doc><str name=”id”>14</str><str name=”PID”>14</str><str name=”title”>test document</str><arr name=”spell”><str>test document</str><str/></arr><arr name=”autocomplete”><str>test document</str><str/></arr><str name=”content”/><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>attachment</str><str name=”displaydate”>2015-02-02 21:57:09</str><str name=”displaymodified”>2015-02-02 21:57:09</str><str name=”permalink”/><int name=”numcomments”>0</int><long name=”_version_”>1492189513055731712</long></doc><doc><str name=”id”>16</str><str name=”PID”>16</str><str name=”title”>test document</str><arr name=”spell”><str>test document</str><str/></arr><arr name=”autocomplete”><str>test document</str><str/></arr><str name=”content”/><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>attachment</str><str name=”displaydate”>2015-02-02 21:58:54</str><str name=”displaymodified”>2015-02-02 21:58:54</str><str name=”permalink”>https://intranet.mohavecountylibrary.us/?attachment_id=16</str><int name=”numcomments”>0</int><long name=”_version_”>1492189513065168896</long></doc><doc><str name=”id”>21</str><str name=”PID”>21</str><str name=”title”>Search Results</str><arr name=”spell”><str>Search Results</str><str>[solr_search_shortcode]</str></arr><arr name=”autocomplete”><str>Search Results</str><str>[solr_search_shortcode]</str></arr><str name=”content”>[solr_search_shortcode]</str><str name=”author”>admin</str><str name=”author_s”>https://intranet.mohavecountylibrary.us/?author=1</str><str name=”type”>page</str><str name=”displaydate”>2015-02-03 16:52:26</str><str name=”displaymodified”>2015-02-03 16:52:26</str><str name=”permalink”>https://intranet.mohavecountylibrary.us/?page_id=21</str><int name=”numcomments”>0</int><long name=”_version_”>1492189513067266048</long></doc></result></response>
Sorry for the long brick. I can edit it down after you look.
As far as I can see in your results, all attachments have an empty content body.
It could mean the php code can’t fetch the attachment files from disk.
A security, again ?
I will completely disable selinux and see if this is cause.
Alternative maybe file permissions on the attachments folders under wordpress, but ownership appears to be correct.
1. disabled selinux
2. relaxed permissions on wordpress directoriesProblem not fixed.
web server eror logs say error line 153 of php script – “cannot send session cache limiter”error on which php script ?
Can you add the following line in the catch block, to function get_attachment_body() in file class-wp-solr.php:
throw new Exception('Attachment error on file ' . get_the_guid( $post->ID ) . ": <br/>" . $e->getMessage());
It should show you the error nicely.
This broke attempt to rebuild solr index entirely.
] PHP Fatal error: Call to a member function getMessage() on a non-object in /var/www/html/wp-content/plugins/wpsolr-search-engine/class-wp-solr.php on line 846, referer: https://intranet.mohavecountylibrary.us/wp-admin/admin.php?page=solr_settings&tab=solr_operations
[root@localhost httpd]#Can you verify the syntax?
throw new Exception(‘Attachment error on file ‘ . get_the_guid( $post->ID ) . “:
” . $e->getMessage());My bad.
After moving trap to correct block get error as followsError:
Attachment error on file https://intranet.mohavecountylibrary.us/?post_type=document&p=4:
Extract query file path/url invalid or not available
- The topic ‘Indexing PDF attachments not working.’ is closed to new replies.