CHWP A.1 Siemens, "Lemmatization and parsing"

Conclusion: Applying the principles

While this process is by no means fully-automatic, the preprocessing programs automate several key and time-consuming parts of the process. My own project, which ultimately has employed stylistic analysis techniques, is based on a text which conflates the four editions of Robert Cawdrey's A Table Alphabeticall (Siemens 1994). This required the lemmatization and parsing of some 17,000 words in a file of approximately 125 kilobytes, not including tags. A conservative estimate of the time spent working with the preprocessing programs, excluding that used in determining the principles of lemmatization and a parsing grammar (and tagset), is sixty hours. Some of this time may be attributed to the fact that early modern English is not the fixed system contemporary English is, and variants in spelling had to be entered into the dictionary manually.[21] Those working with other early writing systems may encounter a similar situation; others working in languages with considerable homographic ambiguity, such as Latin or Hebrew, may find that extra time is required for disambiguation.

Once a methodology to guide lemmatization and parsing has been decided upon, the text-specific dictionary edited, the tags applied to the text, the text proofread and corrected, and the master dictionary updated, the lemmatized and parsed text is ready for analysis; Figure 7 shows an example of this file, the output text. The original, or raw, form of the word is retained with the tag RAW, the results of parsing with the tag POS, and the lemma form of the original word appears without a tag. This file, with the creation of a simple setup file, may then be made into a textual database with MakeBase and then can be analyzed by UseBase and other TACT utility programs. Because UseBase automatically concords its textbases, one can locate all forms of a word under its lemma. As well, co-occurrences of lemma forms can be easily retrieved with UseBase, and collocations and word distribution patterns mapped. The utility Collgen, moreover, will track exact repetitions and generate a rule file whereby these repetitions can be located in the text.

Those wishing to perform similar functions on the parsed text can do so by placing the lemma form within a tag and making the contents of the POS tag the main text, as in Figure 8; this is most easily accomplished with TagText, or with the use of the macro language of a word-processor. By revising the initial setup file, a textbase can be created with MakeBase in the same way as the file which is used to analyze the lemma forms of a text. Further alterations can be made with a text editor or word processor to take into account punctuation and other structural features of the text which are not captured by parsing and lemmatizing, as seen in Figure 9.

Though TACT's analysis programs offer functions which will assist in most computer-based studies -- those of style and authorship, those involving semantic, morphological, and syntactic analysis, those relying on word indexes and concordances, and others -- files which have been parsed and lemmatized with the preprocessing programs can also be easily modified and exported for use with other packages and on other platforms. As well, the results of studies completed with the analysis programs may be converted into forms which will allow their manipulation and representation by a variety of database, spreadsheet, and other statistical analysis software applications.

[Return to table of contents]


[21] The time was well-spent, however, for the process had the positive effect of normalising the spelling within the text for other types of analysis.