CHWP B.12 Lancashire, "English Renaissance Knowledge Base"

2. Database Structure

The data structure imposed on the five dictionaries depends on their function in the RKB. In this paper I will describe a tagging system for only two of the books in the RKB, dictionaries by Palsgrave and Cotgrave. Both have been fully entered into machine-readable form now, and their tagging has been in progress (especially that of Palsgrave) for several years.

The RKB markup system derives from preliminary guidelines issued by the Text Encoding Initiative (TEI) in 1990, which implements a subset of Standard Generalized Markup Language (SGML), published by the International Standards Organization (ISO) (Sperberg-McQueen & Burnard, 1990).[5] The TEI implementation of SGML in its draft P1 guidelines is in course of revision now by dozens of scholars in many fields. A final version will be published in 1992. No other markup method promises to meet scholarly needs; in fact no competing markup system exists. No guidelines have ever been published for the COCOA markup employed by Oxford Concordance Program,[6] TACT and other text-retrieval software (named from the earliest Oxford mainframe concordance system). These programs lack a formalism to express the complexities of literary texts.

The TEI markup scheme is independent of any commercial or shareware text-retrieval system, even SGML text-editors. That is a strength, not a weakness. Unlike procedural markup, such as we use in word-processing programs like WordPerfect or scholarly text-formatting systems like TEX, SGML tags do not directly call for a procedure to be followed (e.g., italicizing a title). TEI markup is descriptive. It normally describes the function of a piece of text rather than indicating what should be done with that text, although, within SGML, tags can be created to indicate what TEI calls "rendition", that is, the appearance of the text on the page. Unlike non-procedural markup employed by popular software now in use for text retrieval and analysis, TEI tagging handles text in all languages. Like TACT markup but unlike WordCruncher markup,[7] TEI can tag different kinds of text that turn up randomly or intermittently, such as speeches, speech prefixes and stage directions. Unlike TACT markup but somewhat like WordCruncher markup, TEI expects that every text's structure have a "grammar" that can be parsed: it expects that a hierarchical structure will be assigned to a text, even if it be a simple two-level one that in effect recognizes only a lattice-like structure.

Any software has limitations but an academic markup scheme should have none. It should be able to reflect the complexities of texts without being compromised by local implementations of those texts for specific retrieval, analysis or editorial programs. TEI produces what is called an interchange format, that is, a set of tagging guidelines suitable for passing electronic texts around from one local system to another.

What is the point of encoding a text with SGML when most local software cannot handle that protocol? How individuals answer this question will depend on what is most important to them: the usability of the electronic text by others (i.e. considerations of 'scholarly publication'), or personal convenience. In my view, texts tagged for scholarly reasons will often contain more information than any local software can process. Editing 'down' hurts nothing while editing 'up' is impossible. At Toronto a TEI format will be transformed automatically into a form suitable for processing with TACT, a local text-retrieval system.

If an electronic text belongs to one person alone and will never be used by others, then the markup chosen for it need only meet the specifications of the software he or she uses. Yet that circumstance almost never arises any more where scholarly editions are concerned. For 20 years, editors have used COCOA or WordCruncher syntax for tagging but have chosen their tag types and tag tokens privately (sometimes with indecipherable abbreviations), without reference to any general consensus about what would work best for scholars. TEI seeks that interdisciplinary consensus for a tagging syntax independent of the life-cycle of specific pieces of hardware and software.

[5] This publication has been sponsored by the Association for Computing in the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics. The full-text databases published by Chadwyck-Healey, for instance, will employ TEI encoding. For an introduction to SGML, see Goldfarb, 1990.

[6] Micro-OCP is distributed by Oxford Electronic Publishing, Oxford University Press, Walton Street, Oxford OX2 6DP, UK; and in North America by OUP, 200 Madison Ave., New York, NY 10016, USA.

[7] TACT and WordCruncher are interactive text-retrieval programs.