CHWP B.12 Lancashire, "English Renaissance Knowledge Base"

4. Tagging Palsgrave and Cotgrave with SGML/TEI

Figure 1 lists 53 tags employed in my encoding. These fall into four groups: entities (special graphic figures, brevigraphs, contractions and one diacritic, the soft hyphen), bibliographical tags (these mark page, line and column breaks and tag non-text blocks on the page), lexical tags (these identify the functions and language of symbols, words, phrases and sentences in the text, mostly within the dictionary entries themselves), and tags of functional structure of the books as a whole.[9]

I use SGML entities to tag ambiguous accented letters or brevigraphs that occur in pre-modern printing because each does not necessarily stand for the same set of letters each time. They cannot be handled as extensions of the basic English alphabet. The e-macron, for example, might represent en or em or even ett (as in sēd, stē and lrē). For this reason contracted letters have no fixed place in a collating sequence. Yet neither are they diacritics (letters that are part of a word but do not affect its alphabetical sorting). COCOA-tagging programs such as TACT have a problem in representing these.

Bibliographical tags separate extra-textual information from the body of the text while ensuring that the information appears accurately. By identifying all extra-textual blocks on the page, I can ensure that searches of the main body of the text will not pick up matches in the running titles, catchwords, marginalia, etc. Note that for convenience sake I also normalize page number, signature and foliation in the COCOA-style <page.break> tag. I employ five SGML tag attributes in this section, three in the <page.break> tag, and two in the <ornament> tag.

The lexical tags may need some formal explanation in three points.

First, with TEI/SGML tagging it is possible to associate a given language globally with a given tag. For instance, the head lemmas and sample quotations in Cotgrave's dictionary are always French, and the meaning and translation always English. There is no need to tag this aspect of font or "rendition" each time. The DTD can assign language to tags automatically. On the other hand, where foreign-language words are cited within Palsgrave's discussion of French grammar -- prose uniformly written in English -- they have to be marked as foreign each time.[10]

Second, the method for tagging a textual variant (which could be a correction by the editor) involves a triplet of linked tags, <var> (which encloses the variant of the immediately preceding text), <rdg> (which specifies the variant reading) and <wit> (which states the source or witness of the variant reading). A complication occurs when the variant covers more than one word in the text, or involves a deletion. Anchor tags must be placed before and after the words in the text for which the variant exists, and 'start' and 'end' attributes must be added to the <var> tag. Without these delimiting tags the match between variant and text-varied-from cannot be made. Again, COCOA-tagging software has difficulty with this problem, although it is endemic to critical editions.

Third, contractions and brevigraphs are expanded between the <expan> and </expan> tags. The attribute type gives the kind of abbreviation, orig gives the entity name for that abbreviated form, while the text itself contains the expanded form. Inquiries on the text, then, need not take into account the various abbreviated forms of a word, although the editorial information about the original reading and the editor's choice of expansion are always available.[11]

Facsimile and tagged forms of two pages in Palsgrave's Lesclarcissement, one from the table of substantives, another from the table of verbs, may be seen in Figure 2, Figure 3, Figure 4 and Figure 5. Tags on these pages have been printed in non-proportional Courier (text is in proportional Times) to show how much bulk encoding adds to a file (consider the effects of tagging for morphology and part-of-speech). Minimization techniques within TEI/SGML can reduce the number of tags displayed, but the invisible tags remain available, like formatting codes hidden until the user activates WordPerfect's Reveal Codes function. It is very important to have the tags available; it is a matter of choice whether to display them or not. Were words to be tagged for part-of-speech and lemmatized form, non-proportional Courier would swamp the page. Since software can use tags without displaying them, however, the ratio of tags to text is not a concern.

Palsgrave's dictionary, barely three generations younger than printing itself, evades a hierarchical structure anticipated by TEI/SGML. For instance, its table of contents follows the first of three books, which is a lengthy introduction. Its 'dictionaries' appear in successive tables and take several forms, schemas to which appear in Figure 6. An entry for substantives looks straightforward, but an entry for a verb places the lemma by which the entry is apparently alphabetized inside the opening sentence or phrase (generally in the first two words) and permits recursion (repetition of parts of the entry) at a minimum of three points in the regular sequences. The structure looks more like a maze than a hierarchy. Consider the entry beginning "I Lye at a siege byfore a towne". Two translations follow, connected by or, and succeeded by a note about the conjugation of what may be one or both of the verbs tenir and assieger.

In contrast, Cotgrave's dictionary has a conventional hierarchical structure more suitable to a TEI-SGML Document Type Definition. Within the dictionary proper, head-lemmas always follow alphabetical letter headings, and phrase-lemmas always come after a related head-lemma. Because head-lemma entries regularly overlap page and column boundaries, in TEI/SGML terms Cotgrave's book offers two overlapping hierarchies: letter / head-lemma / phrase, and part / page / column. The structure of the head-lemma -- see Figure 7, Figure 8 and Figure 9 for a facsimile page, its tagged form, and a general schema -- has repeated sequences of tagged fields, as well as multiple paths. Perhaps a rigidly consistent system of rules exists, but only a thorough computer-aided study of the sequence of tagged fields in all entries would show what they are. The "meaning" tag especially is a grab-bag of explanations of word-sense, synonyms, commentary on grammatical and even historical topics, and straight translation.

I found TEI/SGML guidelines to be a useful yardstick against which to work out the structure of both dictionaries. Both books resist a fixed structure, although entries are remarkably regular. The "cross-reference" sequence in Cotgrave, as (or other directive word) followed by another head-lemma, appears in several places, as if it were a called procedure. Cotgrave's treatment of phrases as sub-entries under the head-lemmas resembles his use of a sample quotation and its translation after the "meaning" section in the main entry.

[Return to Table of Contents] [Continue]


[9] Not all these tags and their attributes appear in the samples. The list of tag attributes is also incomplete (e.g., dialectal character of a translational equivalent tagged by <m>...</m>).

[10] One can also assign a language attribute to each head lemma in the event that directionality changes within the text (e.g., English-to-French, French-to-English).

[11] For most purposes, expansions may be represented simply within square brackets.