• Hi,

    I’ve installed Relevanssi plugin on the site I’m working on and it has improved the search dramatically. Thanks!

    My client uses NinjaTables Pro to manage product spec data. They insert a shortcode into the product description which is then expanded to the full table on display on the site front-end. Initially this table wasn’t being indexed at all. I found a forum post with a similar issue which lead me to add the following filter to the site:

    add_filter( ‘relevanssi_custom_field_value’, ‘relevanssi_index_ninja_tables’ );

    At the same time I activated Relevanssi debug and saw that the table shortcode is indeed now being expanded during indexing. The problem occurs in that when the table terms are tokenised most of them are ignored. The relevant part of the debug is as follows (I’ve removed the majority of the content for brevity):

    Post content after relevanssi_post_content:
    The below table gives an overview of the specifications for the Isabellenhütte A-H1 series: [ninja_tables id="57757"] Resistance Values 1 to 500<br />1 to 100 mOhm<br />Ohm Tolerance 0.1<br />1 % Temperature Coefficient (20-60°C) <30 ppm/K Applicable Temperature Range -55 to +140 °C Power Rating 3<br />10 (on a heatsink) W Thermal Resistance to Ambient (Rth) <15 K/W Thermal Resistance to Aluminium Substrate (Rthi) <3 K/W Dielectric Withstanding Voltage 500 V AC/DC Inductance <10 nH Stability (Nominal Load) Deviation After 2000h <0.1 (T? = 80°C with heatsink)<br /><0.2 (T? = 95°C with heatsink)<br />T? = Terminal Temperature %

    The table content appears after the [ninja_tables id=”57757″] shortcode and to the end of the snippet.

    Content, tokenized:
    hole resistor series isabellenhütte offers terminal connection technology current sensing applications designed easy heat sink mounting kelvin connections allow high precision measurements low resistance values range 0001ω 100ω available inductance pulse power handling capabilities select rating 10w suitable free air maximum permanent 81a constant applicable temperature 55°c 140°c tolerance options ±01 depending required tcr ppm 20°c 60°c self heating typical include measurement equipment reference resistors laboratories sources laboratory supplies table gives overview specifications 500 100 mohm ohm coefficient

    The only table terms that have been tokenised for indexing are: 500, 100, mohm, ohm, coefficient. This is much less than the amount of content in the table. This table contains a high number of technical words so wouldn’t be covered by the stop list. The last content term in the token list ‘coefficient’ appears on row 3 of 10 in the table.

    I hope I have understood the indexing process correctly. I have a few questions:
    How are the terms chosen for tokenisation?
    Is there any way I can increase the number of terms that are tokenised from the expanded shortcode?
    Additionally is it possible to force select uppercase characters on tokenisation? E.g. the ‘ω’ characters should either be indexed as ‘Ω’ or ‘ohms’ (apologies for sneaking in this complete other issue into the end of the main problem!)

    Many thanks for any help you can provide with this issue. Please let me know if I can get any further debug to find out what’s going on?

    Antony

    The page I need help with: [log in to see the link]

Viewing 6 replies - 1 through 6 (of 6 total)
  • Plugin Author Mikko Saari

    (@msaari)

    Something’s wrong there indeed. I tried having Relevanssi index the post content, and everything was indexed, considering minimum word length and so on:

    table gives overview specifications isabellenhütte series resistance values 500 100 mohm ohm tolerance temperature coefficient 60°c ppm applicable range 140 power rating heatsink thermal ambient rth aluminium substrate rthi dielectric withstanding voltage inductance stability nominal load deviation 2000h 80°c 95°c terminal

    In your case slightly more of the table has been tokenized than appears, as each token appears only once in the list. So tokens that also appear in the post before the table are earlier in the content, like “ppm”, “60°c”, “terminal”, “range”, “power”, “rating” and so on.

    Some of the words are still missing, and this looks like a case of a problem I fixed recently. The problem is caused by “<3”, which to Relevanssi looks like an opening of a HTML tag. Which version of Relevanssi are you using? The latest version has a better tokenizer that isn’t confused by this, but if you’re using even a slightly older version, that would explain the problem.

    For the ohms problem,

    add_filter( 'relevanssi_punctuation_filter', 'rlv_ohms_fix' );
    function rlv_ohms_fix( $replacements ) {
        $replacements['ω'] = 'ohms';
        $replacements['Ω'] = 'ohms';
        return $replacements;
    }

    You have to think whether you want the replacement to be ‘ohms’ or ‘ ohms’, ie. do you want “100ω” to become “100ohms” or “100 ohms”. Probably the latter. You may want to do something about the ° symbol as well, you now have some temperatures with a space between the number and ° and some without; probably best to add $replacements['°'] = ' '; in there as well.

    Thread Starter antonywalton

    (@antonywalton)

    Hi Mikko,

    Thanks for the reply, that make sense about the non-repeated tokens. We’re using version 4.11.0 of Relevanssi. Does this have the better tokeniser? Min word length is set at the default, I think I remember seeing a setting for this at 3?

    If this is the latest version, I need to look at something else going on with the tokenisation?

    Thanks also for the replacement ohms filter, it looks fairly similar to some example code I found and adapted:

    add_filter( 'relevanssi_punctuation_filter', 'rlv_character_equivalency' );
    function rlv_character_equivalency( $array ) {
    	$array['ω'] = 'ohms';
    	$array['Ω'] = 'ohms';
    	return $array;
    }

    In the Relevanssi debug tab, the majority of the ohms entries, which haven’t been replaced by the filter, are listed under ‘Other taxonomies’ heading. Should I be using a slightly different filter for these? For the ° symbol, I might end up replacing it with something like ' degrees'. Also, should I move this specific issue into a new forum thread?

    Plugin Author Mikko Saari

    (@msaari)

    Yes, 4.11.0 should be the first corrected version. I can confirm there’s some weird bug in play. I don’t understand what’s going on in here, but the problem is the “<30”, that kills it. This’ll require a Relevanssi update to work. Once I figure out what’s going on in here, I’ll get you something to test.

    Do you have something like taxonomy terms with the ohm symbol in the name of the term? All taxonomy term content is passed through the same tokenizer, so those should be replaced as well. If you can specify where the symbols appear, I can be more specific.

    This one thread is fine.

    Plugin Author Mikko Saari

    (@msaari)

    Ah, found it. In lib/indexing.php there’s this line:

    $contents = wp_strip_all_tags( $contents );

    The problem is that wp_strip_all_tags() is way too aggressive and strips things that just look like tags. You can actually just remove that line and it shouldn’t have any unpleasant consequences, but it should fix this problem.

    Thread Starter antonywalton

    (@antonywalton)

    Fast work! ??

    I’ve commented that line out of lib/indexing.php and rerun the indexer. The debug now lists more tokenised terms as:

    ±0.1 ±30 0.001 0.1 0.2 100 10nh 10w 140 2000hrs 500m 500v 81a air allow aluminium ambient applicable applications available capabilities connection connections constant current degreesc depending designed deviation dielectric easy equipment free handling heat heating heatsink high hole include inductance isabellenhütte kelvin laboratories laboratory low maximum measurement measurements mounting offers ohms options permanent power ppm precision pulse range rating reference required resistance resistor resistors rth rthi select self sensing series sink sources specification stability substrate suitable supplies table tcr technology temperature terminal thermal tolerance typical value values voltage withstanding

    So that’s way better, thanks.

    Using the filter code snippet and checking the debug for the product under discussion has a couple of different results, depending on where the character appears.

    Taxonomy
    There are indeed taxonomy terms with the Ω symbol present, in the debug these are shown to be replaced for the lowercase version of the character; ω. Is it useful to know which taxonomy is affected?

    Other taxonomies (with the relevanssi_punctuation_filter active):

    ±0.1 ±30 0.001 0.001ω 0.005ω 0.01ω 0.02ω 0.05ω 0.1 0.1ω 0.25ω 0.2ω 0.3ω 0.5ω 1.0 100 100r 100ω 10r0 10w 140 1r00 33r0 33ω 5r00 81a ah1 compliant components control current degreesc electronic hole instrumentation isabellenhuette low measurement ohms power ppm r001 r005 r010 r020 r050 r100 r200 r250 r300 r500 resistance resistor resistors scientific sensing terminal test

    As ‘ohms’ does appear in this token list, is the replacement actually happening and I’m just misreading the token list? My expectation would be to not see any ω characters.

    Content
    General content terms have the target replacements stripped out and disappear from the list of tokens. Is this the correct behaviour? This is shown in the content debug code above.

    Content terms that should have ‘ohms’: 100Ω, 0.001Ω, 500mΩ, 1mΩ (it looks like due to the Ω being stripped out, this term falls below the min word length limit).

    Hope this info helps.

    Plugin Author Mikko Saari

    (@msaari)

    The taxonomy should not matter. All taxonomies are handled the same. The replacement should convert “0.5ω” to “0.5ohms” if it’s working correctly. If you’re adding a space, then it’s “0.5 ohms”, but the “ω” should not remain in any case.

    Could it be the ohm symbol in the token is different from the one in the function? I doubt that, but it’s possible, I guess. At that point the symbol should not be a HTML entity, but that may also be worth checking: if the token has the entity, and the function is looking for the character, there’s no match. In that case adding a replacement for the entity is also useful.

    Adding a filter function on relevanssi_remove_punctuation on a priority less than 10 will let you see what the tokens look like before Relevanssi handles them, and adding a filter function on a priority more than 10 lets you see what happens to the tokens, so that’s one way to debug this.

Viewing 6 replies - 1 through 6 (of 6 total)
  • The topic ‘Tokenisation of expanded ninja tables shortcode terms’ is closed to new replies.