« Combined Chemical Dictionary Expands Its Content | Main | Dialog Divides into Sci-Tech/Intellectual Property and Business/News »

Engineering Village 2 - Minor Anomalies With Truncation

:: In a previous post, I waxed eloquent about the new changes to EV2, unveiled last month by Engineering Information Inc. One of the (very welcome) upgrades I noted was the ability to truncate to a single character, using the wildcard character, "?", i.e., search equation?, and EV2 would return results with equation or equations in the records.

Upon closer inspection, however, I discovered that the wildcard is used to replace a single character only, rather than allow for zero-to-one character replacement. From the EV2 site:

Use wildcard (?) to replace a single character.
This morning, I was helping a chem eng student search for the phrase, osmotic virial equation, using Easy Search. The phrase returned 117 records in a combined Compendex and Inspec search. Assuming the "?" would return both equation and equations in the results, we searched the phrase, osmotic virial equation?, in Easy Search. The resulting set was 80 records, much to my surprise. We checked the 80 records, and discovered that each of them had the word "equations" somewhere in the record. However, the remaining 37 records did not, confirming that the "?" is always searching for one extra character, not zero or one extra character.

To add to this equation (no pun intended, I think!), EV2 has an autostemming feature that can be turned on or off - it defaults to on - which seems to mimic the truncation symbol, "*". EV2 describes the truncation function as:

Use truncation (*) to search for words that begin with the same letters.
comput* returns computer, computers, computerize, computerization
EV2 describes the autostemming function as:
Terms are automatically stemmed, except in the author field, unless the "Autostemming off" feature is checked.
management returns manage, managed, manager, managers, managing, management
I don't see a difference between the two functions, which I believe could cause some confusion for the user.

Using a different example, consider the word cat, which can cause all kinds of problems for the searcher. In a db where the "?" truncates zero or one character, a search on cat? would return cat or cats. Where the asterisk returns zero-to-unlimited characters, search cat*, and the results would include cat, cats, cathode, catalysis, catastrophe, catch, catalogue, catatonic, cattle, etc.

I searched cat on EV2 (Quick Search, Compendex/Inspec combined) in the following ways, with the following results:

  1. cat - Autostemming on: 166108 records found in Compendex & Inspec for: ((cat) WN All fields), 1969-2005
  2. cat - Autostemming off: 164303 records found in Compendex & Inspec for: ((cat) WN All fields), 1969-2005
  3. cat? - Autostemming on: 9893 records found in Compendex & Inspec for: ((cat?) WN All fields), 1969-2005
  4. cat? - Autostemming off: 9893 records found in Compendex & Inspec for: ((cat?) WN All fields), 1969-2005
  5. cat* - Autostemming on: 818875 records found in Compendex & Inspec for: ((cat?) WN All fields), 1969-2005
  6. cat* - Autostemming off: 818875 records found in Compendex & Inspec for: ((cat*) WN All fields), 1969-2005
What is evident from the results is that use of either the truncation or wildcard symbol overrides the autostemming function.

Of note is that the use of the wildcard function as a single character replacement, rather than a zero-to-one character replacement, is not endemic to EV2. CSA Cambridge Scientific Abstracts uses it the same way, as does Web of Science. However, Web of Science allows for all three options:

The asterisk (*) represents zero to multiple characters.
The question mark stands for one character. The dollar sign stands for one character or no characters.
The SilverPlatter WebSPIRS platform uses "*" for zero-to-unlimited truncation, and the "?" for zero-to-one character truncation. The OVID platform also allows for the three options, but with a different character set (dollar sign, question mark, hash mark.)

Comments: Truncation and wildcard functionality are important options for searchers. In my experience though, most students and researchers seldom use truncation, because generally they aren't thinking of plurals or variant spellings of words, or are not aware the option exists in the database they are searching. As such, I'd like to see a simplification of truncation/wildcard functionality in EV2, and by extention, in most if not all databases. (I know, that is truly wishful thinking!)

Options to consider for EV2:

  • Allow the wildcard symbol, ?, to work as a zero-to-one character function, or introduce a third symbol to do this, if it is considered important to retain single-character truncation;
  • Reconsider the autostemming function. How valuable is it to the user if the user does not know it is working, or does not know what its function is from the outset of the search? I don't believe the average user twigs to this option, even if it is already on;
  • Eliminate left-side, or prefix truncation. It would never occur to me that $catal would return catalyis, catalyses, catalytic, etc.
  • Allow for the use of the same truncation/wildcard functions across all three EV2 search options, Easy, Quick and Expert.
Of course, this is just my opinion, I could be wrong. :-)

Despite the foregoing observations, I very much like the new changes to EV2, especially the faceted searching, which will expand to Quick Search and Expert Search sometime in the near future. I demonstrated faceted searching yesterday afternoon to 70+ graduates and faculty in Chemical and Materials Engineering on campus, and they were suitably impressed. I have more suggestions for improvements to the search function on EV2, but that can wait for another post sometime soon.

Comments

Randy,
Thanks again for this insighful post.

Your point on the wildcard is correct. Our search engine http://fastsearch.com/ does not allow zero-to-one character replacement. This is in our wish list from FAST.

The difference between truncation and autostemming is that truncation will return results that start with the same letters as the truncated term up to the point that the truncation symbol is used. Whereas autostemming will retrieve variants of a word using the word root as the stemming basis.

In terms of your "cat" example, the results are consistent. As you indicated "truncation or wildcard symbol overrides the autostemming function". We missed to indicate this in the Help text and we will add it shortly to make this clearer to searchers.

Your point on simplifying truncation and wildcard operations is very interesting.I am open for suggestions from your readers and other interested parties.

As for the options that you are suggesting:
* We are working with FAST on the wildcard function. Unfortunately we don't have any timeline on this.
* The autostemming is working correctly. If you can provide some example where you see some problems, I would greatly appreciate it. For some searchers autostemming is important hence we are now providing on/off customization of this feature.
*Left-side truncation is working properly too. When searching for *catal, you get results for catal, biocatal, photocatal, electrocatal.
$ sign is the stemming operator for our Expert Search and when you search for $catal, it brings catals, catalizing, catalyis, catalyses, catalytic, catalysed
*Truncation/wildcard work the same way in all the three interfaces. Here is an example with truncation


1. Easy ( (*cat) ) 208828 1884-2005 Compendex, Inspec & NTIS
2. Quick ((*cat) WN All fields) 208828 1884-2005 Compendex, Inspec & NTIS
3. Expert *cat Relevance 208828 1884-2005 Compendex, Inspec & NTIS

Here is an example with the wildcard


1. Easy ( (m?cro) ) 208361 1884-2005 Compendex, Inspec & NTIS
2. Quick ((m?cro) WN All fields) 208361 1884-2005 Compendex, Inspec & NTIS
3. Expert m?cro 208361 1884-2005 Compendex, Inspec & NTIS

Please let us know if you have any additional questions on these.

We do greatly appreciate getting this kind of feedback for our users. Please keep these suggestions and feedback coming in.

Rafael

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)