Siemens, Lemmatization and parsing: introduction

Siemens, "Lemmatization and parsing"

Introduction: Lemmatization and parsing

By its ideal definition, lemmatization is a process wherein the inflectional and variant forms of a word are reduced to their lemma: their base form, or dictionary look-up form. When one lemmatizes a text, one replaces each individual word in that text with its lemma; a text in English which has been lemmatized, then, would contain all forms of a verb represented by its infinitive, all forms of a noun by its nominative singular, and so forth.[1]

Those involved in the production of dictionaries and concordances have long realised the value of lemmatization. When generating word indexes, concordances, and dictionaries from a text, lemmatizing that text beforehand ensures that all forms of a particular word within it can be located by searching only for its lemma form.[2] Even those involved solely as users of such a resource realize gains from the process, because every time someone looks up a headword (which is, typically, the lemma form of a word) in a standard dictionary that person is reaping the benefits of lemmatization; so, too, when using many concordances and indexes. In electronic texts, which are themselves potential dictionaries, concordances, and indexes, lemmatization ensures the accuracy of search procedures, both simple and complex alike. But the use of texts which have been lemmatized is not limited solely to such procedures. Because the numerous forms a word can take are reduced in such a text (since the lemma replaces variant forms), the overall number of individual word types in that text is decreased; thus, a lemmatized text is an invaluable aid for semantic studies and others using analysis techniques involving repeating sequences of words or word pairs, and other related studies.

The syntactic counterpart of lemmatization is parsing, an act wherein each word in a text is assigned a tag that reflects its part-of-speech and syntactic function. In a parsed text, all verb forms are marked as verbs, perhaps with varying degrees of specificity within that category (lexical, modal, primary; first person singular, first person plural, and so forth), all noun forms as nouns, and so forth; possibly, as is typical with lemmatized texts, each word is replaced by the tag which represents it, though this need not be the case.

A parsed text is more difficult to place, in terms of its everyday applications, than one which has been lemmatized in part because parsing and lemmatization have been traditionally treated as somewhat separate acts and, also, in part because the daily activities of the typical scholar do not often involve conscious contact with a text in this manner. It is true that dictionaries, as well as some concordances and indexes, do parse their texts in that they assign a syntactic marker to the words they describe, but the majority of scholars would not typically search for an occurrence of, say, a lexical verb the way one might search for the word run, nor are standard resources oriented to allow such searches easily to be carried out. But, just as this last statement -- "nor are standard resources oriented to allow such searches easily to be carried out" -- is less true today than it was some 10 years ago (due to the recent generation of computerised dictionaries, which allow much flexibility in searching) so is it less true today that lemmatizing and parsing should be treated as isolated acts. Of course, any project requiring the analysis of morphological and syntactic structures will clearly demand a parsed text. But work which involves, for example, stylistic analysis and authorship attribution often benefits considerably from a text which has been both lemmatized and parsed; so, too, can searches of large text corpuses be enhanced if that corpus has been treated by both processes. In these instances and others, texts which are both parsed and lemmatized are of considerable benefit to the scholar interested in computer-assisted textual research.[3]

[Return to table of contents] [Continue]

Notes

[1] The words am, are, and is, would appear as be, and the words car, cars, car's and cars' would appear as car; the phrase the boy's cars are different colours would appear in a lemmatized text as the boy car be different colour. In languages other than English, this process would involve similar, though not exactly the same, principles of reduction.

[2] To do the same with an unlemmatized text would still involve lemmatization of a sort, although it would be a "mental lemmatization" in which the user would have to search for each variant form of the lemma each time he or she wished to locate all occurrences of a single word throughout their text (Devine 1987: 56).

[3] Moreover, studies connected with lemmatization and those with parsing are beginning to be considered more closely related than previously thought (see Meunier 1976: 208-9).