Post missing, search results’ order
-
Hi,
After updating to v1.65.225, I’ve noticed some things changed in search results after having to re-index our whole database.
The most worrying part is that some posts aren’t showing up in results at all. Here’s one example: https://macmagazine.com.br/post/2023/12/06/apple-tv-recebe-5-indicacoes-aos-spirit-awards-com-shrinking-the-changeling-e-mais/
Here’s the test for its ID: https://d.pr/i/nLWK9U
And a test search for “Spirit Awards”: https://d.pr/i/1DQofL
Besides that, I’ve also noticed that the search results’ order has changed a bit from the previous to the new version. Did it change the results’ weights or anything else I can change to make it more or less how it was before?
Thank you!
-
Hi @rfischmann
Thank you for this report.
In fact, we changed a lot in the algorithm in the new version 1.65, so despite deep testing, shortcomings are not excluded.
I would advise you to start by opening the publication with ID 984018 for editing via Edit Post. And just save it. This forces the update of the publication index. After that, try the search again (in Sandbox you can).
After that, write to me to see if the publication has appeared in the index. This step is for verification purposes, of course, I don’t want you to search for all non-indexed publications and manually index them. I need this to understand what’s going on.
I would also ask you to indicate your PHP version, the amount of script memory from php.ini and the current allowed time_limit. You can view this information in Tools/Website Health.
Thank you!
Hi, @epsiloncool!
I’ve edited post ID 984018 and yes, now it shows up on top: https://d.pr/i/20lJe0
I have no idea at all how many other posts would be affected, of course.
We’re running under PHP 8.2.18, 512MB memory limit, 300 seconds time limit.
Cheers!
Every Sunday, I save a weekly backup of our MySQL database and it’s always growing up by a few MBs. This time, it fell from 928MB to 679MB, and I’m sure it’s related to WP Fast Total Search’s latest update.
Its index database is now either very optimized or smaller, or it really didn’t recreate a full index of our posts.
Hi @rfischmann
Actually, the new indexing rule removed “revisions” from index. Yes, it makes the index smaller. You can check how much “revision” records were removed from the index on the “Indexing Engine Settings / Indexing Rules”, there is a message like “We also found?N?records that are not subject to any rule (they will have empty data in the index).” You can compare how much records were removed from the index and compare with backup size change.
According to missed posts I would like to ask you to find some more posts that are not searchable (but have to be searchable). If you find them, I would ask you to check some database records then (or ask you to make an Anydesk session for me so I can check some things). It’s really important, because your case can be either single case or repeatable problem.
Yesterday I made some more checks for indexing and still was unable to repeat your problem.
Thanks.
This, @epsiloncool? https://d.pr/i/FRfv0f
It does make total sense not to include revisions from the index, good idea!
That post that wasn’t correctly indexed was identified by us simply because we were looking for it specifically. I haven’t noticed other examples up until now, but I’ll run some tests and see if I can identify any others.
As for the “new” search results’ order, look at this example query: https://d.pr/i/OLWydu
Before the update, it would show “Os 5 artigos mais lidos no MacMagazine: de 12 a 19 de maio” first, then “Os 5 artigos mais lidos no MacMagazine: de 5 a 12 de maio” second, which both have the exact query in their title AND are the two latest/most recent posts. The first result there, “Retrospectiva 2009: os dez posts mais acessados no MacMagazine durante o ano”, doesn’t even have “mais lidos” in its title — which seems very weird to me.
Another example that I don’t understand the search results’ order at all. Here is our podcast’s tag: https://macmagazine.com.br/categoria/podcast/
If you run a search for “MacMagazine no Ar”, it will bring results in what seems to be a total random order. I’d expect at least the latest podcasts to be on top of the results.
It’s really hard to find a specific post that might have not been indexed out of nowhere.
One idea: where can I see how many posts has WP Fast Total Search indexed? Then, I can compare that number to the number of published posts we have.
The total posts (wp_posts records) processed and indexed is displayed here
But to say which posts has any index data and which are not, we need to make some DB queries I think.
Let me compose one and I will send you so we can make fast check
That number (307815 right now, in our case) doesn’t make much sense to me. We have 82,677 published posts…
Hi @rfischmann
Actually, WP and plugins store almost everything in the wp_posts table. Not only pages and posts but a lot of different records too. For example menu items, service information, custom post types etc etc
WPFTS tries to index all that info, fortunately most of those additional records are short and does not give any sensible index size increment.
Okay
Well, could we try to make a fast check of your database? I have constructed the MySQL query for this, and would ask you to run it in your phpMyAdmin to detect those posts that were indexed wrongly or not indexed.
select * from (select p.ID post_id, wi.id windex_id, p.post_title, p.post_type, p.post_status, wlog.
status
, wd_t.n wdt_n, dd_t.n_total wdt_total, dd_t.n_nw wdt_nw, length(p.post_title) title_len, (length(p.post_title) / wd_t.n) av_t, wd_c.n wdc_n, dd_c.n_total wdc_total, dd_c.n_nw wdc_nw, length(p.post_content) content_len, (length(p.post_content) / wd_c.n) av_c, (if(isnull(wi.id),128,0) + if(wlog.status
!= 3,64,0) + if((dd_t.n_nw != dd_t.n_total) or (dd_t.n_total != wd_t.n) or (dd_c.n_nw != dd_c.n_total) or (dd_c.n_total != wd_c.n), 32, 0) + if(((length(p.post_title) > 0) and (wd_t.n = 0)) or ((length(p.post_content) > 0) and (wd_c.n = 0)), 16, 0) + if(((isnull(wd_t.n) or (wd_t.n < 1)) and (isnull(wd_c.n) or (wd_c.n < 1)) and (p.post_type in ("post", "page"))), 8, 0)) warn_level from wp_posts p left join wpftsi_index wi on p.ID = wi.tid and wi.tsrc = "wp_posts" left join wpftsi_docs wd_t on wd_t.token = "post_title" and wd_t.index_id = wi.id left join wpftsi_docs wd_c on wd_c.token = "post_content" and wd_c.index_id = wi.id left join wpftsi_ilog wlog on wlog.index_id = wi.id left join ( select v.did, sum(if(isnull(w.word),0,1)) n_nw, count(*) n_total from wpftsi_vectors v left join wpftsi_words w on w.id = v.wid group by v.did ) dd_t on dd_t.did = wd_t.id left join ( select v.did, sum(if(isnull(w.word),0,1)) n_nw, count(*) n_total from wpftsi_vectors v left join wpftsi_words w on w.id = v.wid group by v.did ) dd_c on dd_c.did = wd_c.id ) tt where tt.warn_level > 0 order by tt.warn_level desc limit 1000This MySQL looks scary, but it does simple things: it finds some key data values in WPFTS index and in case something suspicious, it will show the post record in the result with extended data.
The main column to check is “warn_level”. The value there means how worse the data we found.
128 means “post does not have any index record at all”
64 means “indexing was broken because of internal error, status not equal to 3”
32 means “calculated number of words in the document does not equal to actually stored number in index”
16 means “post title or post content has some text data, but number of indexed words is 0”
We mainly interested for records with warn_level greater than 16. Sometimes (often) wp_posts records may have warn_level = 16 because title has only special characters (those are not indexed normally) or content have only images (those not indexed too).
If you found some posts with warn_level = 16, you can check them by post_id in WP Admin to see whether they have indexable text info or not.
Ideally if you can send me CSV or screenshot of this MySQL query result.
P.S. You may need to change “wp_posts” value in the query to actual name of your wp_posts DB table (sometimes wp_ prefix may be different).
Thanks.
Understood about the wp_posts table, @epsiloncool, thanks!
The MySQL query took almost 2 minutes to run and it seems to have generated up to the 1,000 results’ limit. I think there are lots of “oembed_cache” lines that are useless, I’m not sure.
Here’s the CSV file: https://d.pr/f/oNV1lK
Great, I see nothing too bad, but we can check what happen with some posts.
For example in the row 9 we can see post_id = 56355. According to CSV data, this post has big content size (4301 characters), but NO indexed words.
Could you open any Post in Edit Post mode, please, and then replace post_id in the URL to 56355 so you can see Edit Post page for this post. (“Criado para explorar o coprocessador M7 de iPhones 5s….”)
What can you say about it’s post content? Why it’s not indexed? In case it contains only images this it OK. But in case it has some text, it’s a problem.
Let me know, thanks!
Got it.
That’s really weird, @epsiloncool. That post_id is actually an image, it seems? https://d.pr/i/7NuAv1
The “Criado para explorar o coprocessador M7 de iPhones 5s, Nike+ Move já está na App Store” post’s ID is actually 419985, as far as I checked directly in its page: view-source:https://macmagazine.com.br/post/2013/11/10/criado-para-explorar-o-coprocessador-m7-de-iphones-5s-nike-move-ja-esta-na-app-store/
I’m a bit confused.
Yes, sorry. I said you windex_id, not post_id.
Actual post_id was 419985.
So as I can see on the post page it has a lot of texts, but no words in the index. That’s bad.
Do you see post_content in Sandbox tester for ID = 419985 ?
No content at all: https://d.pr/i/ZKzSQG
I opened that post in the editor and what I’ve noticed is that, as it’s from 2013, it still uses WordPress’ old/classic editor: https://d.pr/i/RWUjLJ
I made a small edit, first without converting to blocks, and that seems to be enough to index it: https://d.pr/i/QMLz6Y
I then converted it to blocks, didn’t seem to make much difference to its indexation: https://d.pr/i/sGoxMC
- The topic ‘Post missing, search results’ order’ is closed to new replies.