CHWP B.13 Tompa, "Experiences with the OED"

4. Support for full text search

A text database system must provide an effective query language -- users' retrieval requests must be easily expressible and efficiently supported. To this end, we have developed the Pat full text search system (Pat, 1990).

Pat can retrieve all occurrences of any word or phrase appearing anywhere in the OED in less than one second. A user may choose to combine results using boolean expressions or proximity conditions. Furthermore, Pat's flexible field-defining facility provides a mechanism for restricting a search to one or more particular regions of text or to retrieve all regions of a particular type (e.g., all quotations) containing some specified string.

For the OED, we have defined a control file for Pat which declares the character mappings to be in effect for retrieval purposes and the points in the text that are to be indexed (and thus potentially retrieved in response to a query for a word or phrase).

The CharMappings statement declares which punctuation marks and special symbols should be treated as if they were blanks (and thus word delimiters) and that retrieval should be case-insensitive (all upper case letters treated as if they were lower-case). The WordStarters statement declares that any string starting with a printable character after a blank or hyphen or any string starting with a hyphen, left angle bracket or ampersand can be retrieved. Thus a search for "able as" in the OED instantly returns fifteen matches including:
   as able as any cowboy on the range..to manage anythin..
   be able as days go by Always to look myself straight ..
   <T>Able as he is, he has adopted a tone and style..un..
   <T>Able as he proved himself, his task was one of no ..
   as able As he that hight <i>Irrefragable</i>. </T></Q..
   ne able as hee went along to have seene the Wood for ..
   ng able, as I noted before, to see them at that dista..
   ---able', as in <CF>countless</CF>, <CF>numberless</C..
   an-able as it should be), it sets a-worke thousands.<..
   so able as now. </T></Q><Q><D>1611</D> <A>Shaks.</A> ..
   so able as now. </T></Q><Q><D>1651</D> tr. <W>Life Fa..
   as able, as opportunity occurred, to secure the servi..
   ng able, as the phrase is, to take the law of him. </..
   ng able, as they say, to overpower and hinder its inc..
   As able as yourself and as nimble too, though I mayn'..
We can subsequently determine that fourteen of these are within quotations, and that the fifteenth includes
   'not to be --ed', 'un---able', as in  <CF>countless</CF>...
within the definition for the entry for -less.

Unlike other text search systems, which index individual words in a text, Pat is based on the concept of semi-infinite strings (Gonnet, Baeza-Yates & Snider, 1991). Thus the query "one of" should not be interpreted as a search for all occurrences of this particular two-word phrase, but rather a request to retrieve all occurrences of strings that begin with the character o (or O, since that is mapped to o), then followed by n, then e, then one or more blanks, then o, then f. The 23,899 matches in the OED include not only

   ayed one of her jade tricks.</T></Q></QP></S6></S4><p..
   > <W>One of our Conquerors</W> I. xiv. 269 <T>A young..
    was one of the principal executors of the murder [of..
   sed..one of the ten Commandments.</T></Q></QP><QP><LB..
but also
   put (one) off <CF>with</CF>.  <LB>Obs.</LB> <LB>rare<..
   ding one offer only and this is a conditional offer t..
    but one Office. </T></Q><Q><D>1732</D> <A>Lediard</A..
   that one often feels..disinclined to get off. </T></Q..
(To search for the two-word phrase, one would specify "one of " to insure that a blank or punctuation mark follows the f.) As a result, searches for prefixes of words or arbitrarily truncated phrases are as easily supported by Pat as are searches for complete words.

[Return to table of contents] [Continue]