2. Text data modelling
Conventionally a database is interpreted as a repository of data that, taken as a whole,
constitutes a model of (some aspects of) an enterprise. A text database, on the other hand, is a
model of one or more texts, which in turn model some aspects of reality (Figure 2). Thus, text databases are used not only for information retrieval
(e.g., "What types of monkeys are found in Brazil?"), but also for editorial work and
lexical analysis (e.g., "Which words are defined using 'of or pertaining to'?"). Thus,
while some queries ask about reality, others ask about the text (Tompa & Raymond, 1991).
Furthermore, as well as supporting retrieval activities, a text database must provide mechanisms
for update and revision as well as for formal publication and other forms of dissemination.
We need to preserve text 'as written' and to transmit such text from process to process and
from machine to machine. Therefore, to indicate the significant units within a text (e.g., the textual
extent of an etymology), we have chosen to represent the data using text markup (Coombs,
Renear & DeRose, 1987). Three distinct forms of markup are possible: presentational,
procedural, and descriptive.
2.1. Presentational markup
This form of text representation, also known as "what you see is what you get" or
WYSIWIG, uses typography and layout to indicate textual sub-units.
Macco (mæ·ko). ? Obs. [? A variant spelling of MACAO.] A
gambling game; = MACAO.
Ironically, through the adoption of standard printing conventions, this form of markup makes it
difficult to distinguish types of text units algorithmically. For example, within the citation to
Sporting Mag., where does the location information end and the text itself begin? Consider
the difficulty when the last piece of location information is the roman numeral I or the first
word in the text is the pronoun I. Furthermore, the string "MACAO" and the
string "BYRON" have similar form, but the former is a cross-reference to a dictionary
entry whereas the latter is the name of a cited author. The system would find it difficult to satisfy
a user who wished to retrieve all citations for Lord Bacon without accidentally retrieving cross-references to pork.
1809 BYRON in Moore Life (1875) 143 When
macco (or whatever they spell it) was introduced. 1825 Sporting
Mag. XVI. 277 A rubber of whist, or a game of Macco. 1859
2.2. Procedural markup
An alternative representation uses tags in the text to indicate font shifts and spacing:
interpreting the procedural markup converts the tagged text into a corresponding presentational
form. This form of tagging is used internally in most word processors as well as for typesetting
tapes and to control mainframe typesetting systems. The following example is adapted from the
keying conventions used for the OED.
+L +B Macco +R +N (m+23 +11 k+I o+R
). ?+I +0 Obs. +OB ? A variant spelling of +SC
Macao.+EB +0 A gambling game; +29 +0 +SC Macao. +PP
+S +B 1809 +SC Byron +R in Moore +I Life +R
(1875) 143 When macco (or whatever they spell it) was introduced. +0 +B
1825 +I Sporting Mag. +R XVI. 277 A rubber of whist, or a game
of Macco. +0 +B 1859 +SC Thackeray ...
Unfortunately, we still have the same difficulties as before. Furthermore, although each
typographically distinct field is marked at its start, the extent of the field now has to be deduced
from the starting point of the next field (e.g., the end of the date field for the first citation is
indicated by +SC whereas the end of the date field for the next citation is indicated
by +I). This places a potentially complicated pattern-matching burden on all programs
that must extract fields from the text.
2.3. Descriptive markup
Just as for procedural markup, the third form of text markup uses tags to delimit units of text.
However, the name of each tag is chosen to indicate the role of each unit in the text rather
than indicating how it is to appear in print.
Notice that each field is delimited at both ends, and that the uses of cross-reference tags
(XR and XL) vs. author tags (A) distinguish the role
of "MACAO" from the role of "BYRON". This is the form of markup
chosen for the OED: a partial list of tags is given in Figure 3.
<ET>? A variant spelling of
<S6> <DEF>A gambling game; =
<A>Byron</A> in Moore
(1875) 143 <T>When macco (or whatever they spell it) was
<Q><D>1825</D> <W>Sporting Mag.</W>
XVI. 277 <T>A rubber
of whist, or a game of Macco.</T></Q>
[Return to table of contents] [Continue]