Statistics [19]: Python [03] - Chi-Squared Test

2 minute read

Published: January 19, 2021

Python realization of $\chi^2$ tests, including one-way $\chi^2$ test and $\chi^2$ independencae test.

One-Way Chi-Square Test

scipy.stats.chisquare(f_obs, f_exp=None, ddof=0, axis=0)

The chi-square test tests the null hypothesis that the categorical data has the given frequencies. It return the $\chi^2$ statistic and the $p$ value.

Data

Let’s look at the Example 1 in this post.

$k$	0	1	2	3	4	5	6	7	8	9	$\geq 10$
$n_k$	57	203	383	525	532	408	273	139	45	27	16
$n\cdot \hat{p}_k$	54	211	407	525	508	394	254	140	68	29	17

The purpose is to confirm whether the number of particles follows the Poisson distribution.

import numpy as np
import pandas as pd
data = [[0,1,2,3,4,5,6,7,8,9,10],[57,203,383,525,532,408,273,139,45,27,16],[54,211,407,525,508,394,254,140,68,29,17]]
dataDF = pd.DataFrame(np.transpose(data),columns = ['k','nk','nk_target'])  

Plot

import matplotlib.pyplot as plt
plt.scatter(dataDF['k'],dataDF['nk'])
plt.scatter(dataDF['k'],dataDF['nk_target'])
plt.legend(['nk','nk_target'])
plt.show()

drawing

Chi-Squared Test

# one parameter needs to be estimated, hence, ddof = 1
chisq, p = ss.chisquare(dataDF['nk'],dataDF['nk_target'],ddof=1)
chisq, p

Result:

(12.921106842358597, 0.16620871010382665)

The $p$ value suggests that the differences are not significant. Therefore, the number of particles follows the Poisson distribution.

Chi-Square Independent Test

scipy.stats.chi2_contingency(observed, correction=True, lambda_=None)

This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed. The expected frequencies are computed based on the marginal sums under the assumption of independence. The number of degrees of freedom is dof = observed.size - sum(observed.shape) + observed.ndim - 1.

Data

Test the independence of income with respect to the number of kids in a family.

kids \ income	0-1	1-2	2-3	3	SUM
0	2161	3577	2184	1636	9558
1	2755	5081	2222	1052	11110
2	936	1753	640	306	3635
3	225	419	96	38	778
4	39	98	31	14	182
SUM	6116	10928	5173	3046	25263

data = np.array([[0,1,2,3,4],[2161,2755,936,225,39],[3577,5081,1753,419,98],[2184,2222,640,96,31],[1636,1052,306,38,14]])
dataDF = pd.DataFrame(np.transpose(data),columns = ['Num_Kids','L','M','H','HH'])  

Plot

plt.plot(dataDF['Num_Kids'],dataDF['L'])
plt.plot(dataDF['Num_Kids'],dataDF['M'])
plt.plot(dataDF['Num_Kids'],dataDF['H'])
plt.plot(dataDF['Num_Kids'],dataDF['HH'])
plt.legend(['L','M','H','HH'])
plt.show()

drawing

Chi-Squared Test

chisq, p, dof, expected = ss.chi2_contingency(dataDF)
chisq, p

Result:

(568.5662976004844, 5.4281303038400445e-114)

The $p$ value suggests that the differences is very significant. Therefore, the number of kids in a family is strongly related with the income.

Twitter Facebook LinkedIn

Chao Huang

Statistics [19]: Python [03] - Chi-Squared Test

One-Way Chi-Square Test

Data

Plot

Chi-Squared Test

Chi-Square Independent Test

Data

Plot

Chi-Squared Test

Table of Contents

Comments