• Resolved franciscus

    (@franciscus)


    I have lots of pages that contain ellipsis “…” which is encoded as & # 8230; (without the spaces).
    Relevanssi indexes this as “8230”. I noticed this because it shows up in my “Stopword Candidates” list.
    I think that HTML entities like this should not be indexed.

    • This topic was modified 6 years, 4 months ago by franciscus.
    • This topic was modified 6 years, 4 months ago by franciscus.
Viewing 4 replies - 1 through 4 (of 4 total)
  • Plugin Author Mikko Saari

    (@msaari)

    You’re right, entities should not be indexed. Adding this to your theme functions.php should fix the issue, I believe:

    add_filter( 'relevanssi_post_content_before_tokenize', 'rlv_ellipsis_fix' );
    function rlv_ellipsis_fix( $content ) {
        return html_entity_decode( $content, ENT_QUOTES, 'UTF-8' );
    }

    Can you confirm that this fixes the problem and doesn’t cause any other complications?

    Thread Starter franciscus

    (@franciscus)

    Actually, this fix didn’t work for me because the “…” is in the post title, not the body.
    So I tried filter ‘relevanssi_post_title_before_tokenize’ and that didn’t work either, because the ellipsis is actually the result of WP default ‘the_title’ filter function wptexturize transforming the “…” in the title into html entity 8230.
    Since your code applies filter ‘the_title’ after ‘relevanssi_post_title_before_tokenize’, my filter’s work is for naught.
    I switched the 2 apply_filter around in your code and then it worked as expected.

    On an unrelated note, there seems to be an option to disable selected shortcodes, but I didn’t see a UI for that. Is that part of the premium version?

    And by the way, the debug statement
    relevanssi_debug_echo( "\tTitle, tokenized: " . implode( ' ', array_keys( $titles ) ) );
    should be
    relevanssi_debug_echo( "\tTitle, tokenized: " . implode( ' ', array_keys( $title_tokens ) ) );

    Plugin Author Mikko Saari

    (@msaari)

    Good catches there. Of course the relevanssi_post_title_before_tokenize should be the last thing before tokenizing.

    As I was testing this, I noticed something funny: wptexturizer() doesn’t touch an actual ellipsis (…), but does change three dots (…) to an ellipsis entity. I was using actual ellipsis in my testing, so I was a bit puzzled here.

    In any case, I thought about this a bit, and that filter is not the correct solution. The correct solution is to add the html_entity_decode() to the punctuation remover. That way it can handle the punctuation correctly.

    I will add

    $a = html_entity_decode( $a, ENT_QUOTES );

    as the first step in relevanssi_remove_punct() (in /lib/common.php), before $a = preg_replace( '/<[^>]*>/', ' ', $a );. That should handle this in a neat way. I’m going to make the change in the next version of Relevanssi, but if you want it now, just patch the common.php.

    And yeah, the shortcode disabling UI is a premium feature. It can be used in the free version by directly adjusting the relevanssi_disable_shortcodes option.

    Thread Starter franciscus

    (@franciscus)

    Yes, that solved the problem.
    Until you publish a fix, I added the call in a relevanssi_remove_punctuation filter with higher priority than yours, and that works as well.

Viewing 4 replies - 1 through 4 (of 4 total)
  • The topic ‘HTML entity is indexed’ is closed to new replies.