CHWP A.9 | Tirvengadum, "Linguistic Fingerprints and Literary Fraud" |

It is not surprising that correlation between
all the books is high and significant for, as has already been pointed
out by Burrows, the shape of the English language makes it impossible for
certain word-types like *the* (*le, la, l'* and *les* in French)
and *of* (*de* in French) to slip towards the bottom of the list,
to be replaced at the top with words like *more* (*plus* in French;
Burrows 1989: 313). Burrows' observation certainly
applies to the French language as well.

However, highest correlation occurs when
books written by the same author are compared. For example, *L'Immoraliste*
and *La Porte étroite* (both written by Gide) correlate the
highest with each other, at 0.9563. *Clair de femme* and *Au-delà
de cette limite votre ticket n'est plus valable* -- both published under
the Gary name -- have a high correlation, of 0.9542. High correlation also
occurs between these two Gary novels and *Gros-Câlin* (the first
Ajar novel), but not with *La Vie devant soi*. *La Vie devant soi*
correlates the highest with *Gros-Câlin* (0.8204), followed
closely behind by *L'Étranger* (with a correlation of 0.8204).
*La Vie devant soi* shows a lower correlation with all the other novels
ranging between 0.66 (Engwall) and 0.82 with (*Gros-Câlin*).
Its correlation with the two Gary novels is respectively 0.78 and 0.80 (while correlation
between the two Gary novels, as noted above, is at 0.95). This suggests
that *La Vie devant soi* is quite different from the other Gary novels.

When the list of sixty words is condensed to a context-free list (i.e. when all person markers, pronouns, verbs, etc., are removed from the list) the correlation, as seen in Table 5, is obtained.

We observe once more that highest correlation
occurs between books written by the same author: *L'Immoraliste* and
*La Porte étroite* have a high correlation of 0.9682; *Clair
de femme* and *Au-delà de cette limite votre ticket n'est plus
valable* show a correlation of 0.9760. *Gros-Câlin* and *La
Vie devant soi*, however, show a correlation of 0.8739. In fact, there
is a stronger correlation between *La Vie devant soi* and *L'Étranger*,
at 0.87475, than between *La Vie devant soi* and the other Gary novels.

The Engwall corpus shows the lowest correlation
of all. This is probably due to the fact that this corpus comprises a selection
of passages from 25 novels, and is not only confined to intradiegetic novels.
But it's interesting to note that it correlates the highest with *L'Étranger*
(0.6626), followed by *La Vie devant soi* (0.6152). The Pearson correlation
tests of relationship between the novels and the Engwall corpus show that
*La Vie devant soi* is different from the Gary novels as well as all
the other novels.

The next test that we will use in this paper to determine the degree of similarity between the novels and the Engwall corpus is the chi-squared-test.

The null hypothesis is this case is that the high frequency words should de similarly distributed among the four Gary novels. One should note that there are a few restrictions governing the use of this test -- firstly, that no expected value must be below 5 and, secondly, that this test cannot be applied to relative frequencies, which constitute a proportion. The test is therefore applied to absolute values only. Table 6 gives chi-squared values for each type in each novel and the Engwall corpus.

To interpret these results, one must note that any chi-squared value that is less than 3.84 is dismissed as being susceptible to chance. Any value that falls between 3.84 and 6.62 is somewhat significant, as it has only a 5% likelihood of having occurred by chance alone. Values that fall between 6.63 and 10.82 are significant, having a 1% likelihood of having occurred by chance alone. Values that fall above 10.83 are very highly significant, having a likelihood of one chance per thousand of occurring by chance alone. (All significant values are shown in bold letters in the table.)

*Ticket* has 32 significant ^{2}
values at the 0.01% level, *Clair* has 30, *Câlin* 44,
*Vie* 49, *L'Étranger* 39, *L'Immoraliste* 35, *La Porte
étroite* 44, *Vipères* 43, Engwall 48 and *La Vie
devant soi* has 49 significant ^{2}
values. Total chi-squared values range between 908.82 (*Ticket*) and
5159.77 (*La Vie devant soi*). When context-sensitive words are removed
from the list, the global chi-squared values are as follows: *Ticket*
245.54; *Clair* 355.63; *Câlin* 666.91; *Vie* 2359.85;
*L'Étranger* 388.64; *Vipères* 684.37; *L'Immoraliste* 525.1; *Porte* 1025.9; Engwall 798.22. Once more, *La Vie devant
soi* presents the highest ^{2
}value.

When the Engwall corpus is removed from
the group, thus allowing comparisons between the 8 novels alone, the following
^{2}
values are obtained: *Ticket* 771.52; *Clair* 979.73; *Câlin*
1281.3; *Vie* 3867.26; *L'Étranger* 2628.26; *Vipères* 1698.44; *L'Immoraliste* 1349.88; *Porte* 2157.64. Once
more, *La Vie devant soi* presents the highest significant value.
When contest-sensitive words are removed from list, the ^{2} results are as follows: *Clair* 267.78; *Étranger* 327.19;
*Ticket* -327.41; *L'Immoraliste* 459.99; *Câlin* 528.49;
*Vipères* 693.84; *Porte* 809.99; *Vie* 1675.6.
Highest significant value is again indicated by *La Vie devant soi*.

When we apply the "goodness of fit" test
to the data, the theoretical values can now be represented by the two Gary
novels *Clair* and *Ticket*. By applying the chi-squared formula
(O-E)^{2}/E, we get the results shown in Table
7.

In this table, *Ex. Val.* means the
expected value (i.e. the average value for the two Gary novels). Again,
the total chi-squared values are all very high. However, *Gros-Câlin*,
the first Ajar novel, corresponds the most with the two Gary novels, whereas
*La Vie devant soi* and *L'Étranger* correspond the least
with the two Gary novels. In fact, there is more similarity between the
two Gary novels and *L'Immoraliste*, *La Porte Étroite*,
*Le Noeud de vipères* and the Engwall corpus than between
the two novels and *La Vie devant soi*. We also observe similarities
between *L'Étranger* and *La Vie devant soi*.

When this list is condensed to a context-free
list, we get the following chi-squared results: *Câlin* 414,
*Vie* 1084, *L'Étranger* 460, *Vipères* 500,
*L'Immoraliste* 580, *Porte* 940, Engwall 300. Once more, *Gros-Câlin*
is the most like the two Gary novels, but *La Vie devant soi* shows
less similarity with these two novels.

Most of the chi-squared-tests attempted
so far show that *La Vie devant soi* displays the least similarity
with the control group, be it represented by Engwall or the two Gary novels
*Ticket* and *Clair de femme*.

Using the chi-squared test on these pairs of synonyms, I shall test the degree to which the occurrence of these paired-words differ from one novel to another. The chi-squared results are shown in Table 9.

Chi-squared values for each individual
word are indicated in the table. Those that are significant are shown in
bold letters. The last row shows total chi-squared values for each novel
and the Engwall corpus, ranging from 52.36 (*Clair*) to 746.42 (*La
Vie devant soi*), indicating that the difference between the frequency
of these synonyms is smaller in *Clair de femme* and larger in *La
Vie devant soi*.

When a chi-squared test on the four Gary
novels (using the contingency table) is done, the following results are
obtained. The least significant value is indicated by *Gros-Câlin*
(76.75), followed by *Ticket* (104.11) and *Clair* (121.61).
The most significant value is indicated *La Vie devant soi* (255.34).
When the three novels -- *Clair*, *Ticket* and *Câlin* --
are compared, the following chi-squared values are obtained: 44.14
(*Clair*), 33.61 (*Ticket*) and 31.45 (*Câlin*). Total
chi-squared value for the three novels is 122.2, indicating once more that
the paired words do not occur at the same frequency in the three novels.

When the same test is done on *Clair*,
*Ticket* and *La Vie*, the following chi-squared values are obtained:
155.9 (*Clair*), 128.69 (*Ticket*) and 185.61 (*La Vie*).
The total chi-squared value for all three novels is 470.20, which means
that there is a greater difference between the frequency at which these
paired-words occur in the novels *La Vie* and the two Gary novels,
taken together, than between *Gros-Câlin* and the two Gary novels
taken together; in turn, this indicates that *La Vie devant soi* stands
apart from the other three Gary novels.

While *Gros-Câlin*, the first
Ajar novel, closely resembles the two Gary novels, *La Vie devant soi*
is so significantly different from the two Gary novels that it could have
been written by another author. It would appear that Gary did not feel
the need to change his style drastically in *Gros-Câlin*, his
first Ajar novel, feeling confident that nobody would make the connection
between him and Ajar. But, when critics saw similarities between* Gros-Câlin*
and the Gary novels, he became increasingly paranoid and wrote *La Vie
devant soi*, being out to prove that he was not Ajar. In so doing, he
consciously or unconsciously changed the genetic fingerprint of the Gary
style in that novel.

In this way, the findings in this paper question the notion that function words (and synonyms) constitute a genetic fingerprint of an author's style.