CHWP A.9 Tirvengadum, "Linguistic Fingerprints and Literary Fraud"

5. Pearson Correlation

The Pearson Correlation is a precise measure of the way in which two variables correlate. Its value is such as to indicate both the direction (positive or negative) and the strength of the correlation between two variables. The value +1 indicates a perfect positive correlation and the value -1 a perfect negative correlation, whereas a value of 0 indicates no correlation at all. Any value between +1 and 0, and -1 and 0, shows some degree of correlation. A correlation of 0.75340 indicates that approximately 56.76% of the values correlate. (The Correlation Coefficient is the square root of the percentage of variation explained in a linear regression). Table 4 shows correlation among the novels and the Engwall corpus when the 60 most frequent words are compared to one another.

It is not surprising that correlation between all the books is high and significant for, as has already been pointed out by Burrows, the shape of the English language makes it impossible for certain word-types like the (le, la, l' and les in French) and of (de in French) to slip towards the bottom of the list, to be replaced at the top with words like more (plus in French; Burrows 1989: 313). Burrows' observation certainly applies to the French language as well.

However, highest correlation occurs when books written by the same author are compared. For example, L'Immoraliste and La Porte étroite (both written by Gide) correlate the highest with each other, at 0.9563. Clair de femme and Au-delà de cette limite votre ticket n'est plus valable -- both published under the Gary name -- have a high correlation, of 0.9542. High correlation also occurs between these two Gary novels and Gros-Câlin (the first Ajar novel), but not with La Vie devant soi. La Vie devant soi correlates the highest with Gros-Câlin (0.8204), followed closely behind by L'Étranger (with a correlation of 0.8204). La Vie devant soi shows a lower correlation with all the other novels ranging between 0.66 (Engwall) and 0.82 with (Gros-Câlin). Its correlation with the two Gary novels is respectively 0.78 and 0.80 (while correlation between the two Gary novels, as noted above, is at 0.95). This suggests that La Vie devant soi is quite different from the other Gary novels.

When the list of sixty words is condensed to a context-free list (i.e. when all person markers, pronouns, verbs, etc., are removed from the list) the correlation, as seen in Table 5, is obtained.

We observe once more that highest correlation occurs between books written by the same author: L'Immoraliste and La Porte étroite have a high correlation of 0.9682; Clair de femme and Au-delà de cette limite votre ticket n'est plus valable show a correlation of 0.9760. Gros-Câlin and La Vie devant soi, however, show a correlation of 0.8739. In fact, there is a stronger correlation between La Vie devant soi and L'Étranger, at 0.87475, than between La Vie devant soi and the other Gary novels.

The Engwall corpus shows the lowest correlation of all. This is probably due to the fact that this corpus comprises a selection of passages from 25 novels, and is not only confined to intradiegetic novels. But it's interesting to note that it correlates the highest with L'Étranger (0.6626), followed by La Vie devant soi (0.6152). The Pearson correlation tests of relationship between the novels and the Engwall corpus show that La Vie devant soi is different from the Gary novels as well as all the other novels.

The next test that we will use in this paper to determine the degree of similarity between the novels and the Engwall corpus is the chi-squared-test.

6. Chi-squared Test (chi2)

The chi-squared test is a test of probability whose function is to establish whether a discrepancy of a given size is large enough to be dismissed as having occurred by chance alone. The basis lies in testing a "null hypothesis", in which the actual result is compared with the expected result. The hypothesis is upheld when the expectation is satisfied. The formula for this test is as follows: chi2 = (O-E)2/E. The letter O signifies an observed value and the letter E signifies an expected value.

The null hypothesis is this case is that the high frequency words should de similarly distributed among the four Gary novels. One should note that there are a few restrictions governing the use of this test -- firstly, that no expected value must be below 5 and, secondly, that this test cannot be applied to relative frequencies, which constitute a proportion. The test is therefore applied to absolute values only. Table 6 gives chi-squared values for each type in each novel and the Engwall corpus.

To interpret these results, one must note that any chi-squared value that is less than 3.84 is dismissed as being susceptible to chance. Any value that falls between 3.84 and 6.62 is somewhat significant, as it has only a 5% likelihood of having occurred by chance alone. Values that fall between 6.63 and 10.82 are significant, having a 1% likelihood of having occurred by chance alone. Values that fall above 10.83 are very highly significant, having a likelihood of one chance per thousand of occurring by chance alone. (All significant values are shown in bold letters in the table.)

Ticket has 32 significant chi2 values at the 0.01% level, Clair has 30, Câlin 44, Vie 49, L'Étranger 39, L'Immoraliste 35, La Porte étroite 44, Vipères 43, Engwall 48 and La Vie devant soi has 49 significant chi2 values. Total chi-squared values range between 908.82 (Ticket) and 5159.77 (La Vie devant soi). When context-sensitive words are removed from the list, the global chi-squared values are as follows: Ticket 245.54; Clair 355.63; Câlin 666.91; Vie 2359.85; L'Étranger 388.64; Vipères 684.37; L'Immoraliste 525.1; Porte 1025.9; Engwall 798.22. Once more, La Vie devant soi presents the highest chi2 value.

When the Engwall corpus is removed from the group, thus allowing comparisons between the 8 novels alone, the following chi2 values are obtained: Ticket 771.52; Clair 979.73; Câlin 1281.3; Vie 3867.26; L'Étranger 2628.26; Vipères 1698.44; L'Immoraliste 1349.88; Porte 2157.64. Once more, La Vie devant soi presents the highest significant value. When contest-sensitive words are removed from list, the chi2 results are as follows: Clair 267.78; Étranger 327.19; Ticket -327.41; L'Immoraliste 459.99; Câlin 528.49; Vipères 693.84; Porte 809.99; Vie 1675.6. Highest significant value is again indicated by La Vie devant soi.

When we apply the "goodness of fit" test to the data, the theoretical values can now be represented by the two Gary novels Clair and Ticket. By applying the chi-squared formula (O-E)2/E, we get the results shown in Table 7.

In this table, Ex. Val. means the expected value (i.e. the average value for the two Gary novels). Again, the total chi-squared values are all very high. However, Gros-Câlin, the first Ajar novel, corresponds the most with the two Gary novels, whereas La Vie devant soi and L'Étranger correspond the least with the two Gary novels. In fact, there is more similarity between the two Gary novels and L'Immoraliste, La Porte Étroite, Le Noeud de vipères and the Engwall corpus than between the two novels and La Vie devant soi. We also observe similarities between L'Étranger and La Vie devant soi.

When this list is condensed to a context-free list, we get the following chi-squared results: Câlin 414, Vie 1084, L'Étranger 460, Vipères 500, L'Immoraliste 580, Porte 940, Engwall 300. Once more, Gros-Câlin is the most like the two Gary novels, but La Vie devant soi shows less similarity with these two novels.

Most of the chi-squared-tests attempted so far show that La Vie devant soi displays the least similarity with the control group, be it represented by Engwall or the two Gary novels Ticket and Clair de femme.

7. Synonyms

The second part of this study focuses on the use of synonyms in the corpus as a style discriminant. The groups of paired-words chosen for this purpose, as well as their frequency in the corpus, are indicated in Table 8.

Using the chi-squared test on these pairs of synonyms, I shall test the degree to which the occurrence of these paired-words differ from one novel to another. The chi-squared results are shown in Table 9.

Chi-squared values for each individual word are indicated in the table. Those that are significant are shown in bold letters. The last row shows total chi-squared values for each novel and the Engwall corpus, ranging from 52.36 (Clair) to 746.42 (La Vie devant soi), indicating that the difference between the frequency of these synonyms is smaller in Clair de femme and larger in La Vie devant soi.

When a chi-squared test on the four Gary novels (using the contingency table) is done, the following results are obtained. The least significant value is indicated by Gros-Câlin (76.75), followed by Ticket (104.11) and Clair (121.61). The most significant value is indicated La Vie devant soi (255.34). When the three novels -- Clair, Ticket and Câlin -- are compared, the following chi-squared values are obtained: 44.14 (Clair), 33.61 (Ticket) and 31.45 (Câlin). Total chi-squared value for the three novels is 122.2, indicating once more that the paired words do not occur at the same frequency in the three novels.

When the same test is done on Clair, Ticket and La Vie, the following chi-squared values are obtained: 155.9 (Clair), 128.69 (Ticket) and 185.61 (La Vie). The total chi-squared value for all three novels is 470.20, which means that there is a greater difference between the frequency at which these paired-words occur in the novels La Vie and the two Gary novels, taken together, than between Gros-Câlin and the two Gary novels taken together; in turn, this indicates that La Vie devant soi stands apart from the other three Gary novels.

8. Conclusions

The statistical tests done in this paper point to the same conclusion: La Vie devant soi is significantly different from the other Gary novels, as well as the other novels in the test. They also suggest that high frequency words and pairs of synonyms, which are considered by many experts on style to constitute the unconscious elements of an author's style, can indeed be consciously manipulated by the author. The notion that function words (and synonyms) constitute a genetic fingerprint of an author's style is, therefore, disputed by the case of Romain Gary / Émile Ajar.

While Gros-Câlin, the first Ajar novel, closely resembles the two Gary novels, La Vie devant soi is so significantly different from the two Gary novels that it could have been written by another author. It would appear that Gary did not feel the need to change his style drastically in Gros-Câlin, his first Ajar novel, feeling confident that nobody would make the connection between him and Ajar. But, when critics saw similarities between Gros-Câlin and the Gary novels, he became increasingly paranoid and wrote La Vie devant soi, being out to prove that he was not Ajar. In so doing, he consciously or unconsciously changed the genetic fingerprint of the Gary style in that novel.

In this way, the findings in this paper question the notion that function words (and synonyms) constitute a genetic fingerprint of an author's style.

[Return to table of contents]