WordPress search algorithm (search rules; search results)
-
There are plenty of plugins which extend WordPress search in various ways, and a web search will quickly tell you where WordPress searches and how it presents its results, but I have not been able to find much on how the default WordPress searches actually work – what the rules are – so I did some research, and offer it here in the hope it may help others. No doubt WordPress developers can expand and correct this where needed. I think a description like this would be helpful in the codex!
This description applies to versions 3.9 and 4.0 at least.
TERMINOLOGY
What the user puts in the search box is a list of “search terms”. Search terms are either, loosely speaking, “words” or “phrases”. These are separated by spaces, tabs or (slightly surprisingly) commas. Phrases are indicated by enclosing them in double quotes – but users have to know this syntax, so hardly any searches will use phrases (though more than the URL options mentioned at the end, because Google searches using quoted phrases are at least a little known).
When search terms “match” words and phrases in a page or post, the same sequence of characters in the terms are found in the article and that article is included in the results. This is also sometimes called a “search hit”.
WHERE IT SEARCHES
Default WordPress searches only look at page or post titles and main content. They completely ignore excerpts, comments, tags, categories, custom fields and everything else.
Search terms can appear in either title or content independently. That is, a search for north east will match an article called The North West passage which contains starting from the east… in the content.
HOW IT DIFFERS FROM GOOGLE
Most people’s benchmark for searching is probably Google, so it may be helpful to understand the main differences.
- WordPress searches aren’t backed by a thesaurus, so a search won’t find things which are synonymous, or different verb tenses or plurals (it may appear to find plurals sometimes when singular is asked for, but that’s for a different reason – see below)
- it doesn’t match articles where only some of the search terms match, only all of them
- very confusingly for users, WordPress finds results where the search term appears inside other words, not just whole words. For example, love will find articles containing she was beloved by him, it was a lovely (all potentially helpful) and he was wearing gloves (distinctly unhelpful). Sometimes this helps particularly when dealing with plurals or punctuation such as words with dashes (see below)
- punctuation is not ignored, except for some limited cases. In particular, apostrophes are a problem in that andrews won’t match Saint Andrew’s Church (but andrew would because partial words match); nor will dashes (low-energy – one word to WordPress – won’t match low energy, though low energy would match low-energy because each of the two search words partially matches the single “word” low-energy in the article)
- accented characters are treated as different from their unaccented counterparts. Thus cafe won’t match café, whereas Google would
- if the PHP which WordPress runs in is installed with only US character handling functions (technically, without the multi-byte functions starting mb_…, which bizarrely is the default for PHP installs) case-insensitivity is limited to US characters, thus excluding accented characters and the like.
HOW IT SEARCHES
Search terms match an article if they appear anywhere in the content or title. If there is more than one term (word or phrase) the terms can be found in any order. Thus north west will match an article containing ten degrees west and forty degrees north.
Searches are case insensitive. That is North matches north and vice-versa. (However, as noted above, this only works properly if PHP is installed with multi-byte character support).
Phrases (multi-word search terms) match exactly, other than in case, so any punctuation, multiple spaces, leading spaces etc will count.
Words must also match exactly including punctuation, but because spaces, tabs or commas separate search terms, these won’t be matched in single words.
Outside quotes, common words (“stop words”) are ignored completely; by default, these are:
about,an,are,as,at,be,by,com,for,from,how,in,is,it,of,on,or,that,the,
this,to,was,what,when,where,who,will,with,www
but blogs which have translation packages installed will provide their own list for the relevant language.(Hence the north street matches even if the appears nowhere in the article).
Any single letter (that is, a to z, not any single character) search terms are ignored.
Also, rather arbitrarily, if you have more than 9 words (after stop words are removed) it treats as if they were in quotes whether they are or not, I imagine because of the complexity of the database search it would otherwise generate.
ABOUT PLURALS AND TENSES
It may sometimes appear that singular words match their plurals or present tenses match past tenses etc. However, this is purely a consequence of partial word matches. Hence rabbit will find rabbits, disorder will find disorders and disordered and child will find children. However rabbits won’t find rabbit (plural will never match singular – except for sheep!) and knife won’t find knives (singular is not a stem of plural).
Partial matches are a consequence of how the underlying search is done in the database. The terms simply have wildcards attached either end and the database is told to search for that. In technical terms this is the MySQL LIKE clause implemented as content LIKE ‘%searchterm%’
These partial searches make WordPress search relatively forgiving, and therefore they hide many of the other limitations. However, it does mean you get completely unexpected false positives (like gloves for love) and worse, many false negatives where users who realise that glove matches gloves, but not why, would expect knife to match knives and therefore not realise pages are potentially missing from the results.
SPECIAL CASES
There are a couple of options (exact and sentence) that can be put in the URL to change the standard behaviour, but unless the search form has been modified to put these in, it’s unlikely they’d ever be used (and they aren’t very helpful either). ‘exact’ doesn’t include the wildcards in the search, so the entire title or content must match the search term, not just appear somewhere within. ‘sentence’ treats the whole search data as a single search term irrespective of quotes and separators (loosely speaking, as if the whole thing were in quotes; this is what the ‘more than 9 terms’ case above also does).
- The topic ‘WordPress search algorithm (search rules; search results)’ is closed to new replies.