CHWP A.1 Siemens, "Lemmatization and parsing"

2. The TACT programs[*]

The TACT preprocessing programs reflect consideration of such matters, and incorporate the computer in a partially automated process in which the electronic text is tagged with lemma and part-of-speech/syntactic information simultaneously. Figure 1 outlines the sequence in which the preprocessing programs are used. PreProc prepares the initial text file (the input text) for use with the other programs. MakeDCT interacts with a master dictionary file containing lemma and part-of-speech forms. TagText applies this information to the text[8] and, after one manually edits and corrects the tagged text, SatDCT makes a dictionary file which is specific to the input text and both updates existing entries and adds new entries to the master dictionary. Though not a fully automatic system, these programs automate those features which can be accurately carried out by the computer and, together, greatly streamline what can be a very time consuming process.

The first preprocessing program, PreProc, works with a tagged text file, as seen in Figure 2,[9] and the information provided by the setup file (*.mks) to produce a set of four output files which are used by the other programs. One of these files is a copy of the input text, which will be eventually tagged with lemma and parsing information; in this file, all tags have been removed and all words which were separated by a continuation character, such as a hyphen, are joined. Another is a list of distinct words which will be used by MakeDCT to create a dictionary specific to the input text which, when properly prepared, will be used to tag the text.

Manual interaction occurs in the process at two points: firstly, the text-specific dictionary is edited after using MakeDCT and, secondly, the text is proofread after using TagText. Proofreading, a stage required even with more automated systems, is necessary to ensure the accuracy of the text. Manually editing the text-specific dictionary file to assign proper lemma and part-of-speech/syntactic forms to the individual words allows the user full control over the representation of that information in the text. This method of editing offers much flexibility, but also requires that one adopt a methodology to guide lemmatization and an accurate grammatical tagset to ensure consistency in the final text.

[Return to table of contents] [Continue]


Notes

[*] Editorial note. Cf. notice on availability of TACT.

[8] The preprocessing programs do not, at this time, offer features such as contextual disambiguation; this must be done manually when one proofreads and edits the tagged text.

[9] All examples are from Cawdrey's A Table Alphabeticall (Cawdrey 1604).