Word stemming

Availability

Infradox XS 26.4 or later

Last article update

7 January 2016 - excluding words from the stemming algorithm

Status

Deployed

Related articles

The metadata repository 

Extracting unique words for tag lists and suggestions

Infradox XS version 26.4

Introduction

Word stemming is an algorithm (currently for English only) that can be used to convert words to their so called stemmed versions. Searching for the stemmed version of words may improve search results by removing the difference between singular and plural, and by converting verbs to their root. E.g. the stemmed version of walk, walks, walked and walking is walk. If you use word stemming, files containing the word walk will be found if users search for either walk, walks, walked or walking.

For example:

  • A file has the keywords boys, girls, walking, dog, park
  • The stemmed words column will have the words boi, girl, walk (if only the affected words are stored, which is a configuration setting as explained below)
  • The words a, in and the are stop words 
  • A user searches for boys and girls walking a dog in the park, the actual search will be converted to boi girl walk dog park so the file will be found
  • A user searches for boy and girl walked in the park with a dog, the actual search will be boi girl walk park dog so the file will be found

This example uses the full stemming algorithm which should be used if you have stemmed words in your metadata (see Processing stemmed words below). Alternative options are available too, and these options do not require stemmed words to be stored and indexed. Your metadata should however at least have keywords in the singular form if you do not enable processing of stemmed words (read below). 

Disabling stemming when searching

Search results may appear less accurate depending on the options that you use and what your metadata looks like. You can add a checkbox to the search panel that allows the end user to switch stemming on or off. You can also disable stemming by adding the _stemming=0 parameter to a search URL.
E.g. /search?s=dogs&_stemming=0 to search without stemming or /search?s=dogs&_stemming=1 to switch it back on again. 

Processing stemmed words

To store the stemmed versions of words - so that you can use the full stemming algorithm - go to Site configuration and click Word stemming in the menu.

Select the fields that you want to process by clicking on the field names. Fields that will be processed are displayed in green.

Next, select the target field in which you want to store the processed words. Currently you can select either Category or Subcategory because - besides Keywords and Caption - these are the only indexed fields that can hold large amounts of words. The target field should normally not be visible on the client facing pages. You can hide fields by editing the properties in the metadata repository.

Storing all the words

The stemming algorithm will not create a stemmed version of a word:

  • if the original and stemmed version are the same
  • if a word is shorter than 3 characters
  • if the word is part of the ignore list 
  • doesn't meet the stemming criteria (i.e. it doesn't start with a letter, contains certain characters etc)

If the option "store all words" is not selected, then only the stemmed words are stored in the target field. This makes sense if the fields that you are stemming are indexed fields - because then the unstemmed or ignored version of the original words are already searchable - regardless of a stemmed version being stored or not.

If however you have selected fields that are not indexed, then you should select the "store all words" option. In this case words for which there is no stemmed version are also added to target field. Which should of course be an indexed field.

So if for instance a file has the following caption.

  • An elderly woman wearing a hat, walking in a street in Amsterdam.

Then the stemming algorithm will produce the following two words which will be stored in the target field:

  • walk,wear

So if the target field is indexed and the source field is not, this file can only be found by (variations of) these two words.

With the option "store all words" selected, the stemming algorithm will produce the following output:

  • amsterdam,elderli,hat,street,walk,wear,woman

Some other things to consider

  • Any changes that you make will only affect new files and files that you change after saving your changes.
    If you need to reprocess all files then you can contact support.
  • The target column will be overwritten with the stemmed words every time a file is processed.

Configuring searches with Word stemming

To enable Word stemming for searches, go to Site configuration and click Search settings in the menu (on the left). Next, open the section Word stemming. 
This function offers several options. The most suitable option depends on your metadata, keywording style and which fields are indexed. You can use the link Test your stemming settings to open a dialog that you can use to try how your settings will affect search queries. Note that you can select a different option without having to close the dialog, and that you can enter multiple queries (on separate lines) to test several queries at once.

  • Use full stemming algorithm only
    This option should be selected only if your data has stemmed versions of words, either because you have enabled processing of words (as described above) or because you supply metadata with stemmed versions.
  • Search stemmed words only
    This option uses an adjusted version of the stemming algorithm so that it can be used if your data doesn't have stemmed versions of words. For example, the stemming algorithm would normally convert boy and boys to boi and it would convert both luxurious and luxury to luxuri. This is obviously only useful if your data has stemmed words too. If however you use this option (or the following two other options), the stemming algorithm will not process such words. It will still convert verbs to their stemmed version, and it will convert plural words to singular. E.g. flowers will be converted to flower. This will be useful if you always add singular keywords (which is recommended) to describe your files.
  • Keep word and add its stemmed version
    This option again can be used if you don't store stemmed words. Searches are expanded by adding the stemmed version of what the user searches for. For example, if a user searches for ducks swimming water the actual search will be (duck or ducks) (swim or swimming) water
  • Convert word to stemmed version with wildcard
    With this option, searches are converted to the stemmed version of words but a wildcard (asterisk) is added to each stemmed word. You can use this if you don't store stemmed words. Searching with wildcards does affect search performance.
    Example: ducks swimming water will be converted to duck* swim* water. As a result, files that have the word swim and that have words that start with duck and that start with swim - will be found.

Exclude words from being stemmed

In version 27.1 a list has been added that allows you to specify words that you want to exclude from being stemmed. You can enter several words separated by commas. Consider this example:

The word "snowing" has been added to the exclude list, the word "walking" has not been added.
The selected option is Keep word and add its stemmed version.
Someone searches for "snowing walking".
The actual search will be "snowing+(walking+or+walk)"
As you can see the word snowing has not been affected by the algorithm.

Have more questions? Submit a request

0 Comments

Article is closed for comments.
Powered by Zendesk