Poor relevance (possibly mysql-8)
-
We’ve been using YARPP for some years & it has worked well. We recently migrated to a new server & since then YARPP has been generating very poor matches, for instance we have lots of blogs about bees & this recent blog post doesn’t find any of them – previously we would have expected this to work.
One obvious change is that we’ve moved from MySQL-5.7 to MySQL-8, and the table collation has changed from
utf8 (general_ci)
toutf8mb4 (utf8mb3_general_ci)
, so my suspicion is that the FULLTEXT match is no longer working correctly – but I’m not sure how to start debugging this. Do you have any pointers?We’ve also upgraded from php-7.2 to php-8.1, but I feel less suspicious about that change in terms of this plugin.
- This topic was modified 7 months, 2 weeks ago by tomward.
The page I need help with: [log in to see the link]
-
Trying to fuzz my current install to understand more about the issue. Have switched the columns to
utf8mb3_general_ci
collation which changed the results slightly but not meaningfully.Now looking at the yarpp options – I have changed the settings to only “consider with extra weight” for the title & “do not consider” for everything else. I can see that the results re-generate when I change this setting. Looking at a different post which is titled “Butterfly updates”, the related posts I get are “Bindweeds”, “Woodcock Wood”, “Tree planting”, etc – but if I use the WordPress search box I get lots of results with post titles including “The Big Butterfly Count”, “Butterfly management & conservation”, “Big Butterfly Count 2021”, “Big Butterfly count – 2020”, etc. So something is clearly going wrong with the matching. I guess I’m going to try and isolate a particular SQL query that’s producing these poor matches next.
With yarpp still set to only consider titles, I pulled this query out by using a query monitor:
insert into blog_yarpp_related_cache (reference_ID,ID,score)
SELECT 40775 AS reference_ID, ID, ROUND(0 + (MATCH (post_title) AGAINST ('butterfly updates')) * 3,4) AS score
FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
WHERE post_status IN ( 'publish', 'static' )
AND post_password =''
AND post_type IN ('post','page','attachment')
AND blog_posts.ID NOT IN (40775)
GROUP BY ID
HAVING score >= 2.000000
AND ID != 0
AND bit_or(terms.term_taxonomy_id IN (792,15,4612,1354,791,9)) = 0
ORDER BY score DESC
LIMIT 5 on duplicate key update date = now()I then noticed that this subquery was giving me different results on mysql8 compared to mysql5.7
SELECT ROUND(0 + (MATCH (post_title) AGAINST ('butterfly updates')) * 3,4) AS score
FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
WHERE post_status IN ( 'publish', 'static' )
AND post_password =''
AND post_type IN ('post','page','attachment')
AND blog_posts.ID NOT IN (40775)
GROUP BY ID
HAVING score >= 2.000000
AND ID != 0
AND bit_or(terms.term_taxonomy_id IN (792,15,4612,1354,791,9)) = 0
ORDER BY score DESCin particular, the
AND blog_posts.ID NOT IN (40775)
appeared to be behaving weirdly – with the clause present, the blog with id 40775 was being excluded, but also some other blogs with different ids. mysql-5.7 was returning 14 blogs, mysql-8 with the clause was returning 11 blogs, and mysql-8 with the clause removed was returning 15 blogs (the 14 plus the blog that should be excluded 40775). Amending this clause toAND blog_posts.ID NOT IN (SELECT 40775)
appears to give the same results as mysql-5.7. Whilst sort of interesting, this hasn’t fixed my relevance issues, though!The yarpp.php file says I’m using version 5.30.10, although the WordPress admin interface is offering me “beta (4.0.7b1)”.
- This reply was modified 7 months, 2 weeks ago by tomward.
Hi @tomward can you try dropping the existing YARPP related full-text indexes and recreating them? It may be possible that upgrading the DB + changing the collation may have affected its integrity. It would be worth recreating them from scratch.
To remove the indexes:
ALTER TABLE wp_posts DROP INDEX yarpp_title; ALTER TABLE wp_posts DROP INDEX yarpp_content;
To add back the indexes:
ALTER TABLE wp_posts ADD FULLTEXT
yarpp_title
(post_title
); ALTER TABLE wp_posts ADD FULLTEXTyarpp_content
(post_content
);Note: the above assumes your posts table is named “wp_posts”. Please update it according to the actual value in your database.
Hey @jeffparker, yeah, just tried dropping & recreating those indexes & I’m still seeing the same behaviour that I don’t understand – the different results for
NOT IN (id)
compared toNOT IN (SELECT id)
as well as poor relevance.(my table is
blog_post
so I edited the SQL accordingly)Having a poke around with settings more like what we usually use. With content matching, the weirdness around
NOT IN (SELECT id)
seems to go away, but the matches are really different to the previous server. These results may not mean very in isolation, but for interest, this querySELECT 40775 AS reference_ID, ID, ROUND(0 + (MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) * 1 + (MATCH (post_title) AGAINST ('butterfly updates')) * 1 + COUNT(DISTINCT IF( terms.term_taxonomy_id IN (919,920), terms.term_taxonomy_id, null )) * 1 + COUNT(DISTINCT IF( terms.term_taxonomy_id IN (4344,112,6708,6707,58,6703,6246,132,6706,6705,6712,6709,6711,6710,4213,6704,5901,6713), terms.term_taxonomy_id, null )) * 3,4) AS score
FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
WHERE post_status IN ( 'publish', 'static' )
AND post_password =''
AND post_type IN ('post')
AND blog_posts.ID NOT IN ( 40775)
GROUP BY ID
HAVING score >= 2.000000
AND ID != 0
AND bit_or(terms.term_taxonomy_id IN (792,15,4612,1354,791,9)) = 0 and COUNT(DISTINCT IF( terms.term_taxonomy_id IN (919,920), terms.term_taxonomy_id, null )) >= 1
ORDER BY score DESC;produces quite different top matches on the old database
40775 33720 71.9465
40775 21211 67.4809
40775 29723 26.0072
40775 39918 22.9360
40775 24623 21.6051
40775 34554 21.5452
40775 39009 21.0883
40775 726 18.6029
40775 15599 18.4169
40775 21522 18.4026
40775 35919 17.5108
40775 1863 17.4167
40775 34839 16.8549
40775 23774 16.7163
40775 24606 16.6431
40775 5681 14.9384
40775 30694 14.2196
40775 33751 14.2144
40775 34244 14.0386compared to the new database
40775 33706 102.2484
40775 11699 63.8313
40775 22225 33.9329
40775 39804 33.7642
40775 29721 32.4923
40775 34829 31.8152
40775 39001 27.5836
40775 36323 25.9082
40775 36239 25.4413
40775 24606 25.388
40775 35906 25.3144
40775 21492 25.0634
40775 40406 24.9763
40775 1848 23.8333
40775 5665 23.1519
40775 493 22.135
40775 33739 21.754
40775 20986 21.574
40775 38189 20.232e.g. in the old database results the top hit 33720 is about solar panels & butterflies – highly revelant – but in the new results 33706 is about swifts & their nests, no keywords present. It’s very weird!
wow weird, refining this query to try to put my finger on the symptom, I get this:
SELECT 40775 AS reference_ID, ID, ROUND(0 + (MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) * 1 ) AS score FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID ) WHERE post_status IN ( 'publish', 'static' ) AND post_password ='' AND post_type IN ('post') AND blog_posts.ID IN ( 33706, 33720) GROUP BY ID HAVING score >= 2.000000 AND ID != 0 AND bit_or(terms.term_taxonomy_id IN (792,15,4612,1354,791,9)) = 0 and COUNT(DISTINCT IF( terms.term_taxonomy_id IN (919,920), terms.term_taxonomy_id, null )) >= 1 ORDER BY score DESC;
old server:
40775 33720 68 40775 33706 2
new server:
40775 33706 101 40775 33720 101
The yarpp.php file says I’m using version 5.30.10, although the WordPress admin interface is offering me “beta (4.0.7b1)”.
…this is also weird. Do you want to try deleting the plugin all together and reinstalling? You should see 5.30.10 in the admin interface.
do you have any other plug-in that may modify Yarpp results?
Weird. The admin interface was saying 5.30.10, although it was giving this weird upgrade message. I just deleted & reinstalled it, the admin interface still says 5.30.10 although the upgrade message has gone away. The result quality is exactly the same, though. I don’t think there’s anything other plugins that would affect this, but will look into that next.
Haven’t managed to figure out any plugins that are affecting this, but have done a bit more fuzzing of the SQL & found some weird results. The query that I previously mentioned – derived from the query that yarrp uses:
SELECT 40775 AS reference_ID, blog_posts.ID, ROUND(0 + MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) AS score
FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
WHERE post_status IN ( 'publish', 'static' )
AND post_password =''
AND post_type IN ('post')
AND blog_posts.ID IN ( 33706, 33720)
GROUP BY ID
ORDER BY score DESC;gives this result
40775 33706 101
40775 33720 101But if I remove either the GROUP BY or the LEFT JOIN then I get results that look much more like what I expect:
SELECT 40775 AS reference_ID, blog_posts.ID, ROUND(0 + MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) AS score
FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
WHERE post_status IN ( 'publish', 'static' )
AND post_password =''
AND post_type IN ('post')
AND blog_posts.ID IN ( 33706, 33720)
ORDER BY score DESC;removed GROUP BY gives results which look like the GROUP BY is doing the wrong thing
40775 33720 101
40775 33720 101
40775 33720 101
40775 33720 101....
40775 33706 3
40775 33706 3
40775 33706 3
40775 33706 3....and this
SELECT 40775 AS reference_ID, blog_posts.ID, ROUND(0 + MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) AS score
FROM blog_posts
WHERE post_status IN ( 'publish', 'static' )
AND post_password =''
AND post_type IN ('post')
AND blog_posts.ID IN ( 33706, 33720)
GROUP BY ID
ORDER BY score DESC;removing the left join gives the answer that I was hoping for all along
40775 33720 101
40775 33706 3oh right yeah, amending the score to have MAX means that I get the answer I expect
MAX(ROUND(0 + MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged'))) AS score
Maybe some mysql config setting has changed on my new database server (which means the group by aggregation is behaving differently)
- This reply was modified 7 months, 1 week ago by tomward.
I’m also having an issue with line 140 of YARRP_DB_Schema.php, my server does not like
SHOW TABLES LIKE "blog_yarpp_related_cache"
it prefers
SHOW TABLES LIKE 'blog_yarpp_related_cache'
so have amended thusly
$tabledata = $wpdb->get_col( "SHOW TABLES LIKE '" . $wpdb->prefix . YARPP_TABLES_RELATED_TABLE . "'" );
anyway, adding MAX seems to solve my problem – although it doesn’t play nicely with the taxonomy clause. If I configure the plugin to not take into account categories & tags then I can again get good quality matches with MAX like this in YARRP_Cache.php line 218-243:
$newsql .= 'MAX(ROUND(0'; ... $newsql .= ',4)) AS score';
(personally I can live without categories & tags for the time being)
- This reply was modified 7 months, 1 week ago by tomward.
Nb. I think this is a mysql bug https://bugs.mysql.com/bug.php?id=114666
- The topic ‘Poor relevance (possibly mysql-8)’ is closed to new replies.