Statistics [02]: Shakespear’s New Poem
Published:
In 1985, Shakespearean scholar Gary Taylor discovered a nine-stanza poem in a bound folio volume that was attributed to Shakespeare (called the Taylor poem). The size of the newly discovered poem is small relative to the size of Shakespeare’s total work, only 429 total words. Can we prove that the poem was actually written by Shakespeare or not?
Here is the analysis given in “Did Shakespeare write a newly-discovered poem?”
Observation
Of the 429 worlds in the newly discovered poem, 258 are distinct. Therefore, the analysis begins by ranking each of the 258 distinct words in the Taylor poem according to its rarity of usage in the Shakespearean canon. The results are shown in the following table, where the number (denoted as ) denotes the number of distinct words in the Taylor poem which occurred exactly times in the 884647 total words of the Shakespearean canon. For example, 9 distinct words in the poem appeared zero times in the canon, 7 distinct words in the poem appeared 1 times in the canon, etc.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | total | |
---|---|---|---|---|---|---|---|---|---|---|---|
0+ | 9 | 7 | 5 | 4 | 4 | 2 | 4 | 0 | 2 | 3 | 40 |
10+ | 1 | 0 | 3 | 0 | 1 | 1 | 1 | 2 | 1 | 0 | 10 |
20+ | 2 | 2 | 1 | 5 | 3 | 1 | 0 | 2 | 2 | 3 | 21 |
30+ | 3 | 1 | 1 | 1 | 2 | 1 | 0 | 0 | 3 | 3 | 16 |
40+ | 1 | 2 | 0 | 0 | 2 | 1 | 1 | 2 | 1 | 1 | 11 |
50+ | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 2 | 7 |
60+ | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 4 |
70+ | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 4 |
80+ | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
90+ | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 3 |
Estimation
Now that we have the real value of . The next step is to estimate the expected value of assuming Shakespearean authorship, denoted as . Assuming Poisson process, the results of an empirical Bayes estimate are shown in the following table.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0+ | 6.97 | 4.21 | 3.33 | 2.84 | 2.53 | 2.43 | 2.16 | 2.01 | 1.87 | 1.76 |
10+ | 1.62 | 1.50 | 1.52 | 1.51 | 1.36 | 1.38 | 1.33 | 1.28 | 1.25 | 1.22 |
20+ | 1.18 | 1.16 | 1.13 | 1.11 | 1.09 | 1.06 | 1.04 | 1.02 | 1.00 | 0.98 |
30+ | 0.96 | 0.94 | 0.93 | 0.91 | 0.90 | 0.88 | 0.86 | 0.85 | 0.83 | 0.82 |
40+ | 0.80 | 0.79 | 0.77 | 0.76 | 0.75 | 0.74 | 0.73 | 0.72 | 0.70 | 0.69 |
50+ | 0.68 | 0.67 | 0.66 | 0.65 | 0.64 | 0.63 | 0.62 | 0.61 | 0.60 | 0.59 |
60+ | 0.58 | 0.57 | 0.56 | 0.55 | 0.54 | 0.53 | 0.52 | 0.51 | 0.50 | 0.50 |
70+ | 0.49 | 0.48 | 0.48 | 0.47 | 0.47 | 0.46 | 0.45 | 0.45 | 0.44 | 0.44 |
80+ | 0.43 | 0.42 | 0.42 | 0.41 | 0.41 | 0.40 | 0.39 | 0.39 | 0.38 | 0.38 |
90+ | 0.37 | 0.36 | 0.36 | 0.35 | 0.35 | 0.34 | 0.34 | 0.33 | 0.32 | 0.32 |
To broaden the empirical base of the results, seven more Elizabethan poems are analyzed using the same method, with three poems attributed to Ben Jonson, Christopher Marlowe and John Donne respectively, and the other four poems attributed to Shakespear. The results are omitted here.
Modeling
The oabjective is to test whether the observed counts fit the predicted value based on the assumption of Shakespearean authorship. The tests rely upon the following regression model:
where for , have independent Poisson distribution with means .
The null hypothesis:
corresponds to .
The model can also be written in the form:
where we can see that if , increases as ; if , decreases as . Therefore, (slope) is of particular interest.
Tests
Three different tests are performed: Test 1, total account; Test 2, new words; Test 3, slope.
Test 1: Let be the total account of categories , then for Taylor poem. Similarly, let , so that has a Poisson distribution of mean . Test 1 is just the usual test of the simple null hypothesis .
Test 2: The zero count , considered conditional on the total count , has a binomial distribution of index and parameter . Test 2 is the usual test of the simple null hypothesis .
Test 3: Test 3 is the usual test, using large-sample maximum likelihood approximations, of the null hypothesis based on the data . This is equivalent to testing on , in which case has a multinomial distribution depending only upon the slope parameter .
Results
Test 1
Poem | Total Count | Expectation | |
---|---|---|---|
1. JON | 95 | 88.8 | 0.67 |
2. MAR | 134 | 106.5 | 2.57 |
3. DON | 107 | 105.1 | 0.20 |
4. CYM | 95 | 69.9 | 2.86 |
5. PUC | 53 | 50.5 | 0.37 |
6. PHO | 105 | 76.1 | 3.13 |
7. JON | 109 | 96.7 | 1.24 |
8. JON | 118 | 95.0 | 2.29 |
Asterisks indicate deviations from null hypothesis.
.
Test 2
Poem | New Words | Expectation | |
---|---|---|---|
1. JON | 8 | 7.14 | 0.37 |
2. MAR | 10 | 10.12 | 0.01 |
3. DON | 17 | 8.06 | 2.90 |
4. CYM | 7 | 7.13 | 0.00 |
5. PUC | 1 | 3.98 | -1.64 |
6. PHO | 14 | 7.89 | 2.08 |
7. JON | 7 | 8.21 | -0.39 |
8. JON | 9 | 8.66 | 0.16 |
Test 3
Poem | Estmated Slope | Estimated Standard Error | |
---|---|---|---|
1. JON | 0.229 | 0.11 | 2.08 |
2. MAR | -0.323 | 0.08 | -4.04 |
3. DON | -0.138 | 0.09 | -1.53 |
4. CYM | -0.047 | 0.10 | -0.47 |
5. PUC | -0.050 | 0.12 | -0.42 |
6. PHO | -0.127 | 0.09 | -1.41 |
7. JON | -0.034 | 0.09 | -0.38 |
8. JON | -0.075 | 0.09 | -0.83 |
Conclusion
Test 1 is the least reliable for discriminating between Shakespearean and non-Shakespearean authorship. Test 2 seems only moderately useful for discerning Shakespearean authorship. Test 3 seems to be promising as a discriminator between Shakespearean versus non-Shakespearean authorship.
On the basis of the results, the Taylor poem appears consistent with the hypothesis of Shakespearean authorship. In particular it passes the slope test, which is the best discriminator among the three. It fails the total count test, but less dramatically than do two of the four Shakespearean poems. Overall it seems fair to say that the Taylor poem fits Shakespearean usage about as well as do the four Shakespeare poems.
From this interesting example, we can have an intuitive feeling that statistics is really about collecting and analyzing data. When properly handled using appropriate methods, some interesting and reliable results can be obtained.
Table of Contents
- Probability vs Statistics
- Shakespear’s New Poem
- Some Common Discrete Distributions
- Some Common Continuous Distributions
- Statistical Quantities
- Order Statistics
- Multivariate Normal Distributions
- Conditional Distributions and Expectation
- Problem Set [01] - Probabilities
- Parameter Point Estimation
- Evaluation of Point Estimation
- Parameter Interval Estimation
- Problem Set [02] - Parameter Estimation
- Parameter Hypothesis Test
- t Test
- Chi-Squared Test
- Analysis of Variance
- Summary of Statistical Tests
- Python [01] - Data Representation
- Python [02] - t Test & F Test
- Python [03] - Chi-Squared Test
- Experimental Design
- Monte Carlo
- Variance Reducing Techniques
- From Uniform to General Distributions
- Problem Set [03] - Monte Carlo
- Unitary Regression Model
- Multiple Regression Model
- Factor and Principle Component Analysis
- Clustering Analysis
- Summary
Comments