• We’ve been using YARPP for some years & it has worked well. We recently migrated to a new server & since then YARPP has been generating very poor matches, for instance we have lots of blogs about bees & this recent blog post doesn’t find any of them – previously we would have expected this to work.

    One obvious change is that we’ve moved from MySQL-5.7 to MySQL-8, and the table collation has changed from utf8 (general_ci) to utf8mb4 (utf8mb3_general_ci), so my suspicion is that the FULLTEXT match is no longer working correctly – but I’m not sure how to start debugging this. Do you have any pointers?

    We’ve also upgraded from php-7.2 to php-8.1, but I feel less suspicious about that change in terms of this plugin.

    • This topic was modified 7 months, 2 weeks ago by tomward.

    The page I need help with: [log in to see the link]

Viewing 13 replies - 1 through 13 (of 13 total)
  • Thread Starter tomward

    (@tomward)

    Trying to fuzz my current install to understand more about the issue. Have switched the columns to utf8mb3_general_ci collation which changed the results slightly but not meaningfully.

    Now looking at the yarpp options – I have changed the settings to only “consider with extra weight” for the title & “do not consider” for everything else. I can see that the results re-generate when I change this setting. Looking at a different post which is titled “Butterfly updates”, the related posts I get are “Bindweeds”, “Woodcock Wood”, “Tree planting”, etc – but if I use the WordPress search box I get lots of results with post titles including “The Big Butterfly Count”, “Butterfly management & conservation”, “Big Butterfly Count 2021”, “Big Butterfly count – 2020”, etc. So something is clearly going wrong with the matching. I guess I’m going to try and isolate a particular SQL query that’s producing these poor matches next.

    Thread Starter tomward

    (@tomward)

    With yarpp still set to only consider titles, I pulled this query out by using a query monitor:

    insert into blog_yarpp_related_cache (reference_ID,ID,score)
    SELECT 40775 AS reference_ID, ID, ROUND(0 + (MATCH (post_title) AGAINST ('butterfly updates')) * 3,4) AS score
    FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
    WHERE post_status IN ( 'publish', 'static' )
    AND post_password =''
    AND post_type IN ('post','page','attachment')
    AND blog_posts.ID NOT IN (40775)
    GROUP BY ID
    HAVING score >= 2.000000
    AND ID != 0
    AND bit_or(terms.term_taxonomy_id IN (792,15,4612,1354,791,9)) = 0
    ORDER BY score DESC
    LIMIT 5 on duplicate key update date = now()

    I then noticed that this subquery was giving me different results on mysql8 compared to mysql5.7

    SELECT ROUND(0 + (MATCH (post_title) AGAINST ('butterfly updates')) * 3,4) AS score
    FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
    WHERE post_status IN ( 'publish', 'static' )
    AND post_password =''
    AND post_type IN ('post','page','attachment')
    AND blog_posts.ID NOT IN (40775)
    GROUP BY ID
    HAVING score >= 2.000000
    AND ID != 0
    AND bit_or(terms.term_taxonomy_id IN (792,15,4612,1354,791,9)) = 0
    ORDER BY score DESC

    in particular, the AND blog_posts.ID NOT IN (40775) appeared to be behaving weirdly – with the clause present, the blog with id 40775 was being excluded, but also some other blogs with different ids. mysql-5.7 was returning 14 blogs, mysql-8 with the clause was returning 11 blogs, and mysql-8 with the clause removed was returning 15 blogs (the 14 plus the blog that should be excluded 40775). Amending this clause to AND blog_posts.ID NOT IN (SELECT 40775) appears to give the same results as mysql-5.7. Whilst sort of interesting, this hasn’t fixed my relevance issues, though!

    The yarpp.php file says I’m using version 5.30.10, although the WordPress admin interface is offering me “beta (4.0.7b1)”.

    • This reply was modified 7 months, 2 weeks ago by tomward.
    Plugin Author YARPP

    (@jeffparker)

    Hi @tomward can you try dropping the existing YARPP related full-text indexes and recreating them? It may be possible that upgrading the DB + changing the collation may have affected its integrity. It would be worth recreating them from scratch.

    To remove the indexes:

    ALTER TABLE wp_posts DROP INDEX yarpp_title;
    ALTER TABLE wp_posts DROP INDEX yarpp_content;

    To add back the indexes:

    ALTER TABLE wp_posts ADD FULLTEXT yarpp_title (post_title);
    
    ALTER TABLE wp_posts ADD FULLTEXT yarpp_content (post_content);

    Note: the above assumes your posts table is named “wp_posts”. Please update it according to the actual value in your database.

    Thread Starter tomward

    (@tomward)

    Hey @jeffparker, yeah, just tried dropping & recreating those indexes & I’m still seeing the same behaviour that I don’t understand – the different results for NOT IN (id) compared to NOT IN (SELECT id) as well as poor relevance.

    (my table is blog_post so I edited the SQL accordingly)

    • This reply was modified 7 months, 2 weeks ago by tomward.
    • This reply was modified 7 months, 2 weeks ago by tomward.
    Thread Starter tomward

    (@tomward)

    Having a poke around with settings more like what we usually use. With content matching, the weirdness around NOT IN (SELECT id) seems to go away, but the matches are really different to the previous server. These results may not mean very in isolation, but for interest, this query

    SELECT 40775 AS reference_ID, ID, ROUND(0 + (MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) * 1 + (MATCH (post_title) AGAINST ('butterfly updates')) * 1 + COUNT(DISTINCT IF( terms.term_taxonomy_id IN (919,920), terms.term_taxonomy_id, null )) * 1 + COUNT(DISTINCT IF( terms.term_taxonomy_id IN (4344,112,6708,6707,58,6703,6246,132,6706,6705,6712,6709,6711,6710,4213,6704,5901,6713), terms.term_taxonomy_id, null )) * 3,4) AS score
    FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
    WHERE post_status IN ( 'publish', 'static' )
    AND post_password =''
    AND post_type IN ('post')
    AND blog_posts.ID NOT IN ( 40775)
    GROUP BY ID
    HAVING score >= 2.000000
    AND ID != 0
    AND bit_or(terms.term_taxonomy_id IN (792,15,4612,1354,791,9)) = 0 and COUNT(DISTINCT IF( terms.term_taxonomy_id IN (919,920), terms.term_taxonomy_id, null )) >= 1
    ORDER BY score DESC;

    produces quite different top matches on the old database

    40775 33720 71.9465
    40775 21211 67.4809
    40775 29723 26.0072
    40775 39918 22.9360
    40775 24623 21.6051
    40775 34554 21.5452
    40775 39009 21.0883
    40775 726 18.6029
    40775 15599 18.4169
    40775 21522 18.4026
    40775 35919 17.5108
    40775 1863 17.4167
    40775 34839 16.8549
    40775 23774 16.7163
    40775 24606 16.6431
    40775 5681 14.9384
    40775 30694 14.2196
    40775 33751 14.2144
    40775 34244 14.0386

    compared to the new database

    40775 33706 102.2484
    40775 11699 63.8313
    40775 22225 33.9329
    40775 39804 33.7642
    40775 29721 32.4923
    40775 34829 31.8152
    40775 39001 27.5836
    40775 36323 25.9082
    40775 36239 25.4413
    40775 24606 25.388
    40775 35906 25.3144
    40775 21492 25.0634
    40775 40406 24.9763
    40775 1848 23.8333
    40775 5665 23.1519
    40775 493 22.135
    40775 33739 21.754
    40775 20986 21.574
    40775 38189 20.232

    e.g. in the old database results the top hit 33720 is about solar panels & butterflies – highly revelant – but in the new results 33706 is about swifts & their nests, no keywords present. It’s very weird!

    Thread Starter tomward

    (@tomward)

    wow weird, refining this query to try to put my finger on the symptom, I get this:

    SELECT 40775 AS reference_ID, ID, ROUND(0 + (MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) * 1 ) AS score
    FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
    WHERE post_status IN ( 'publish', 'static' )
    AND post_password =''
    AND post_type IN ('post')
    AND blog_posts.ID  IN ( 33706, 33720)
    GROUP BY ID
    HAVING score >= 2.000000
    AND ID != 0
    AND bit_or(terms.term_taxonomy_id IN (792,15,4612,1354,791,9)) = 0 and COUNT(DISTINCT IF( terms.term_taxonomy_id IN (919,920), terms.term_taxonomy_id, null )) >= 1
    ORDER BY score DESC;

    old server:

    40775    33720   68
    40775    33706   2

    new server:

    40775    33706   101
    40775    33720   101
    Plugin Author YARPP

    (@jeffparker)

    The yarpp.php file says I’m using version 5.30.10, although the WordPress admin interface is offering me “beta (4.0.7b1)”.

    …this is also weird. Do you want to try deleting the plugin all together and reinstalling? You should see 5.30.10 in the admin interface.

    do you have any other plug-in that may modify Yarpp results?

    Thread Starter tomward

    (@tomward)

    Weird. The admin interface was saying 5.30.10, although it was giving this weird upgrade message. I just deleted & reinstalled it, the admin interface still says 5.30.10 although the upgrade message has gone away. The result quality is exactly the same, though. I don’t think there’s anything other plugins that would affect this, but will look into that next.

    Thread Starter tomward

    (@tomward)

    Haven’t managed to figure out any plugins that are affecting this, but have done a bit more fuzzing of the SQL & found some weird results. The query that I previously mentioned – derived from the query that yarrp uses:

    SELECT 40775 AS reference_ID, blog_posts.ID, ROUND(0 + MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) AS score
    FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
    WHERE post_status IN ( 'publish', 'static' )
    AND post_password =''
    AND post_type IN ('post')
    AND blog_posts.ID IN ( 33706, 33720)
    GROUP BY ID
    ORDER BY score DESC;

    gives this result

    40775 33706 101
    40775 33720 101

    But if I remove either the GROUP BY or the LEFT JOIN then I get results that look much more like what I expect:

    SELECT 40775 AS reference_ID, blog_posts.ID, ROUND(0 + MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) AS score
    FROM blog_posts left join blog_term_relationships as terms on ( terms.object_id = blog_posts.ID )
    WHERE post_status IN ( 'publish', 'static' )
    AND post_password =''
    AND post_type IN ('post')
    AND blog_posts.ID IN ( 33706, 33720)
    ORDER BY score DESC;

    removed GROUP BY gives results which look like the GROUP BY is doing the wrong thing

    40775 33720 101
    40775 33720 101
    40775 33720 101
    40775 33720 101....
    40775 33706 3
    40775 33706 3
    40775 33706 3
    40775 33706 3....

    and this

    SELECT 40775 AS reference_ID, blog_posts.ID, ROUND(0 + MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged')) AS score
    FROM blog_posts
    WHERE post_status IN ( 'publish', 'static' )
    AND post_password =''
    AND post_type IN ('post')
    AND blog_posts.ID IN ( 33706, 33720)
    GROUP BY ID
    ORDER BY score DESC;

    removing the left join gives the answer that I was hoping for all along

    40775 33720 101
    40775 33706 3

    Thread Starter tomward

    (@tomward)

    oh right yeah, amending the score to have MAX means that I get the answer I expect

    MAX(ROUND(0 + MATCH (post_content) AGAINST ('solar cabbage brassicas caterpillars farms panels beneficial may plants butterflies chemical glucobrassicin eggs fields increase insects insect wildlife offer damaged'))) AS score

    Maybe some mysql config setting has changed on my new database server (which means the group by aggregation is behaving differently)

    • This reply was modified 7 months, 1 week ago by tomward.
    Thread Starter tomward

    (@tomward)

    I’m also having an issue with line 140 of YARRP_DB_Schema.php, my server does not like

    SHOW TABLES LIKE "blog_yarpp_related_cache"

    it prefers

    SHOW TABLES LIKE 'blog_yarpp_related_cache'

    so have amended thusly

            $tabledata = $wpdb->get_col( "SHOW TABLES LIKE '" . $wpdb->prefix . YARPP_TABLES_RELATED_TABLE . "'" );
    Thread Starter tomward

    (@tomward)

    anyway, adding MAX seems to solve my problem – although it doesn’t play nicely with the taxonomy clause. If I configure the plugin to not take into account categories & tags then I can again get good quality matches with MAX like this in YARRP_Cache.php line 218-243:

        $newsql .= 'MAX(ROUND(0';
    
           ...
    
        $newsql .= ',4)) AS score';

    (personally I can live without categories & tags for the time being)

    • This reply was modified 7 months, 1 week ago by tomward.
    Thread Starter tomward

    (@tomward)

    Nb. I think this is a mysql bug https://bugs.mysql.com/bug.php?id=114666

Viewing 13 replies - 1 through 13 (of 13 total)
  • The topic ‘Poor relevance (possibly mysql-8)’ is closed to new replies.