Statistics [02]: Shakespear’s New Poem

7 minute read

Published: January 02, 2021

In 1985, Shakespearean scholar Gary Taylor discovered a nine-stanza poem in a bound folio volume that was attributed to Shakespeare (called the Taylor poem). The size of the newly discovered poem is small relative to the size of Shakespeare’s total work, only 429 total words. Can we prove that the poem was actually written by Shakespeare or not?

Here is the analysis given in “Did Shakespeare write a newly-discovered poem?”

Observation

Of the 429 worlds in the newly discovered poem, 258 are distinct. Therefore, the analysis begins by ranking each of the 258 distinct words in the Taylor poem according to its rarity of usage in the Shakespearean canon. The results are shown in the following table, where the number (denoted as $m_x$ ) denotes the number of distinct words in the Taylor poem which occurred exactly $x$ times in the 884647 total words of the Shakespearean canon. For example, 9 distinct words in the poem appeared zero times in the canon, 7 distinct words in the poem appeared 1 times in the canon, etc.

$x$	0	1	2	3	4	5	6	7	8	9	total
0+	9	7	5	4	4	2	4	0	2	3	40
10+	1	0	3	0	1	1	1	2	1	0	10
20+	2	2	1	5	3	1	0	2	2	3	21
30+	3	1	1	1	2	1	0	0	3	3	16
40+	1	2	0	0	2	1	1	2	1	1	11
50+	0	1	1	1	1	0	0	1	0	2	7
60+	0	1	0	0	1	1	0	0	1	0	4
70+	0	0	1	0	0	1	0	0	1	1	4
80+	0	0	1	1	0	0	0	0	0	0	2
90+	0	0	0	1	0	1	1	0	0	0	3

Estimation

Now that we have the real value of $m_x$ . The next step is to estimate the expected value of $m_x$ assuming Shakespearean authorship, denoted as $\hat{\nu}_x$ . Assuming Poisson process, the results of an empirical Bayes estimate are shown in the following table.

$x$	0	1	2	3	4	5	6	7	8	9
0+	6.97	4.21	3.33	2.84	2.53	2.43	2.16	2.01	1.87	1.76
10+	1.62	1.50	1.52	1.51	1.36	1.38	1.33	1.28	1.25	1.22
20+	1.18	1.16	1.13	1.11	1.09	1.06	1.04	1.02	1.00	0.98
30+	0.96	0.94	0.93	0.91	0.90	0.88	0.86	0.85	0.83	0.82
40+	0.80	0.79	0.77	0.76	0.75	0.74	0.73	0.72	0.70	0.69
50+	0.68	0.67	0.66	0.65	0.64	0.63	0.62	0.61	0.60	0.59
60+	0.58	0.57	0.56	0.55	0.54	0.53	0.52	0.51	0.50	0.50
70+	0.49	0.48	0.48	0.47	0.47	0.46	0.45	0.45	0.44	0.44
80+	0.43	0.42	0.42	0.41	0.41	0.40	0.39	0.39	0.38	0.38
90+	0.37	0.36	0.36	0.35	0.35	0.34	0.34	0.33	0.32	0.32

To broaden the empirical base of the results, seven more Elizabethan poems are analyzed using the same method, with three poems attributed to Ben Jonson, Christopher Marlowe and John Donne respectively, and the other four poems attributed to Shakespear. The results are omitted here.

Modeling

The oabjective is to test whether the observed counts $m_x$ fit the predicted value $\hat{\nu}_x$ based on the assumption of Shakespearean authorship. The tests rely upon the following regression model:

$\log\mu_x = \log\hat{\nu}_x + \beta_0 + \beta_1\log(x + 1)$

where for $x = 0, 1, ..., 99$ , $m_x$ have independent Poisson distribution with means $\mu_x$ .

The null hypothesis:

$H_0: \beta_0 = \beta_1 = 0$

corresponds to $\mu_x = \hat{\nu}_x$ .

The model can also be written in the form:

$\mu_x/\hat{\nu}_x = e^{\beta_0}(x + 1)^{\beta_1}$

where we can see that if $\beta_1 < 0$ , ${\mu_x}/{\hat{\nu}_x}$ increases as $x \to 0$ ; if $\beta_1 > 0$ , ${\mu_x}/{\hat{\nu}_x}$ decreases as $x \to 0$ . Therefore, $\beta_1$ (slope) is of particular interest.

Tests

Three different tests are performed: Test 1, total account; Test 2, new words; Test 3, slope.

Test 1: Let $m_+$ be the total account of categories $0-99$ , then $m_+ = 118$ for Taylor poem. Similarly, let $\mu_+ = \sum\mu_x$ , so that $m_+$ has a Poisson distribution of mean $\mu_+$ . Test 1 is just the usual test of the simple null hypothesis $H_1: \mu_+ = \hat{\nu}_+$ .

Test 2: The zero count $m_{0}$ , considered conditional on the total count $m_+$ , has a binomial distribution of index $\mu_+$ and parameter $\pi_0 = \mu_0/\mu_+$ . Test 2 is the usual test of the simple null hypothesis $H_2: \pi_0 = \hat{\nu}_0/\hat{\nu}_+$ .

Test 3: Test 3 is the usual test, using large-sample maximum likelihood approximations, of the null hypothesis $H_3: \beta_1 = 0$ based on the data $(m_1, m_2, ..., m_{99})$ . This is equivalent to testing $H_3$ on $(m_{0}, m_+)$ , in which case $(m_1, m_2, ..., m_{99})$ has a multinomial distribution depending only upon the slope parameter $\beta_1$ .

Results

Test 1

Poem	Total Count $m_+$	Expectation $\hat{\nu}_+$	$z$
1. JON	95	88.8	0.67
2. MAR	134	106.5	$\text{***}$ 2.57
3. DON	107	105.1	0.20
4. CYM	95	69.9	$\text{***}$ 2.86
5. PUC	53	50.5	0.37
6. PHO	105	76.1	$\text{****}$ 3.13
7. JON	109	96.7	1.24
8. JON	118	95.0	$\text{**}$ 2.29

Asterisks indicate deviations from null hypothesis.

$\text{*}1.5\leq|z|<2.0, \text{**}2.0\leq|z|<2.5, \text{***}2.5\leq|z|<3.0, \text{****}3.0\leq|z|$ .

Test 2

Poem	New Words $m_{0}$	Expectation ${\mu}_+(\hat{\nu}_0/\hat{\nu}_+)$	$z$
1. JON	8	7.14	0.37
2. MAR	10	10.12	0.01
3. DON	17	8.06	$\text{***}$ 2.90
4. CYM	7	7.13	0.00
5. PUC	1	3.98	$\text{*}$ -1.64
6. PHO	14	7.89	$\text{**}$ 2.08
7. JON	7	8.21	-0.39
8. JON	9	8.66	0.16

Test 3

Poem	Estmated Slope $\hat{\beta}_{1}$	Estimated Standard Error $\hat{\sigma})$	$z=\hat{\beta}_1/\hat{\sigma}$
1. JON	0.229	0.11	$\text{**}$ 2.08
2. MAR	-0.323	0.08	$\text{****}$ -4.04
3. DON	-0.138	0.09	$\text{*}$ -1.53
4. CYM	-0.047	0.10	-0.47
5. PUC	-0.050	0.12	-0.42
6. PHO	-0.127	0.09	-1.41
7. JON	-0.034	0.09	-0.38
8. JON	-0.075	0.09	-0.83

Conclusion

Test 1 is the least reliable for discriminating between Shakespearean and non-Shakespearean authorship. Test 2 seems only moderately useful for discerning Shakespearean authorship. Test 3 seems to be promising as a discriminator between Shakespearean versus non-Shakespearean authorship.

On the basis of the results, the Taylor poem appears consistent with the hypothesis of Shakespearean authorship. In particular it passes the slope test, which is the best discriminator among the three. It fails the total count test, but less dramatically than do two of the four Shakespearean poems. Overall it seems fair to say that the Taylor poem fits Shakespearean usage about as well as do the four Shakespeare poems.

From this interesting example, we can have an intuitive feeling that statistics is really about collecting and analyzing data. When properly handled using appropriate methods, some interesting and reliable results can be obtained.

Twitter Facebook LinkedIn

Chao Huang