CHWP A.1 Siemens, "Lemmatization and parsing"

5. The Dictionary (DCT) file

Ultimately, the principles of lemmatization and the parsing grammar one adopts are reflected in the emendations one makes to the dictionary file; thus, editing the dictionary is a stage central to these processes when using the TACT programs. MakeDCT has, in the past, retrieved lemma and part-of-speech information for words on which no previous information existed in the master dictionary from the Oxford Advanced Learner's Dictionary, an electronic version of which is deposited in the Oxford Text Archive; this option is not currently available, though, and those starting to use the preprocessing programs must build, from the texts being processed, the dictionary from which the computer will retrieve information as the text is parsed and lemmatized. With the master dictionary blank, as will be the case as one begins using the program, the text-specific dictionary will appear as in Figure 5; it is made up of five fields separated by tabs (ASCII 09, represented in the figure by ). The first field contains the word as it appears in the text and, the second, its part-of-speech; if the word is not found in the master dictionary, ??? is placed in this field. The third and fourth fields contain the lemma form of the word, and the fifth preserves its original, or raw, form. The appropriate part-of-speech and lemma form must be added manually to each word in this file with a text editor or a word processor that will handle ASCII text, including extended ASCII characters, without corruption.[19] Using the aforementioned lemmatization principles and parsing tagset, the resultant dictionary file would appear as in Figure 6.[20]

Though one cannot hope at this stage to take into account all lexical categories of a particular form -- for example, to may be used as an infinitive marker as well as the preposition it is listed to be (Figure 6) -- errors in tagging are corrected manually as they arise in the tagged text after it is proofread; these emendations are then included in the master dictionary using the program SatDCT, and will be given as an option within the parsing tag in the future. Because the results of the manual edit are stored for future use, the master dictionary grows quickly and considerably, thus decreasing the amount of labour necessary to parse and lemmatize in each successive effort. For this reason, those wishing to parse and lemmatize a large text with an empty master dictionary will find it profitable to divide the text into smaller units and complete the process on each smaller text separately, one after another.

[Return to table of contents] [Continue]


Notes

[19] It is crucial to understand that some wordprocessors, such as WordPerfect, convert pre-existing input text into a proprietary format, so that if you save an edited file as you normally would, rather than as an ASCII or DOS file, it may be unusable by other software. Furthermore, in the case of the tab-character, WordPerfect will convert it on entry into a series of space-characters and will not restore these to a tab even if you save the file in DOS format. The best advice is to avoid such a wordprocessor when editing text files and to use instead a text-editor; a good public-domain editor for DOS, Windows 3.1 and 95, is the Programmer's File Editor, available online via the URL http://www.lancs.ac.uk/people/cpaap/pfe/. Otherwise you need to work out a procedure for making sure that the file has not been changed by the wordprocessor. With WordPerfect, you will need to reinsert the tab, which is represented as [4,1] in its Typographical Symbols character set.

[20] In this specific case, using a text which is from early modern English, variants in spellings and graphical forms had to be considered as well.