Data Modeling [01]: Patsy and Statsmodels
Published:
patsy
is a Python package for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. It is used in many projects to provide a high-level interface to the statistical code, including:
statsmodels
: Estimation of statistical models, statistical tests, and statistical data exploration (link)HDDM
: Hierarchical Bayesian parameter estimation of Drift Diffusion Models (DDM) (link)
Statmodels Overview
The main statsmodels API is split into models:
statsmodels.api
: General models and methods, including- Regression
- Imputation
- Generalized Estimating Equations
- Generalized Linear Models
- Discrete and Count Models
- Multivariate Models
- …
statsmodels.tsa.api
: Time-series models and methods, including- Statistics and Tests
- Univariate Time-Series Analysis
- Multivariate Time Series Models
- Exponential Smoothing
- Filters and Decompositions
- Markov Regime Switching Models
- Forecasting
- …
statsmodels.formula.api
: A convenience interface for specifying models using formula strings and DataFrames.
Model Specifying: Array
In its simplest form, statsmodels
supports array as inputs. Below is the API for Ordinary Least Squares:
statsmodels.regression.linear_model.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)
where endog
and exog
are dependent and independent variables respectively.
Here is a simple example.
import numpy as np
import statsmodels.api as sm
X1 = np.linspace(0,10,100).reshape(100,1)
X2 = np.random.random([100,1])
X = sm.add_constant(np.column_stack([X1,X2]))
beta = [1, .5, .2]
y = np.dot(X, beta) + np.random.random(100)
results = sm.OLS(y, X).fit()
results.summary()
Out[133]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.966
Model: OLS Adj. R-squared: 0.966
Method: Least Squares F-statistic: 1398.
Date: Sat, 19 Mar 2022 Prob (F-statistic): 3.05e-72
Time: 12:51:38 Log-Likelihood: -13.431
No. Observations: 100 AIC: 32.86
Df Residuals: 97 BIC: 40.68
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.4422 0.071 20.228 0.000 1.301 1.584
x1 0.5076 0.010 52.574 0.000 0.488 0.527
x2 0.2328 0.095 2.439 0.017 0.043 0.422
==============================================================================
Omnibus: 14.426 Durbin-Watson: 2.092
Prob(Omnibus): 0.001 Jarque-Bera (JB): 4.610
Skew: -0.164 Prob(JB): 0.0998
Kurtosis: 2.000 Cond. No. 22.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
We can see that the parameter names here have been given the generic names x1, x2, and so on.
Model Specifying: Formulas and DataFrame
statsmodels
also support patsy formula strings (use API in statsmodels.formula.api
instead of statsmodels.api
), let’s repeat the example above:
import numpy as np
import statsmodels.formula.api as smf
X1 = np.linspace(0,10,100).reshape(100,1)
X2 = np.random.random([100,1])
X = np.column_stack([X1,X2])
beta = [.5, .2]
y = np.dot(X, beta) + np.random.random(100)
data = pd.DataFrame(X, columns=['col1', 'col2'])
data['y'] = y
results = smf.ols('y ~ col1 + col2', data=data).fit()
results.summary()
Out[145]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.963
Model: OLS Adj. R-squared: 0.962
Method: Least Squares F-statistic: 1247.
Date: Sat, 19 Mar 2022 Prob (F-statistic): 6.27e-70
Time: 13:02:40 Log-Likelihood: -17.237
No. Observations: 100 AIC: 40.47
Df Residuals: 97 BIC: 48.29
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.4290 0.077 5.565 0.000 0.276 0.582
col1 0.4984 0.010 49.785 0.000 0.479 0.518
col2 0.3410 0.097 3.529 0.001 0.149 0.533
==============================================================================
Omnibus: 35.840 Durbin-Watson: 2.154
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.139
Skew: -0.003 Prob(JB): 0.0464
Kurtosis: 1.786 Cond. No. 22.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
It should be noticed that the intercept is added by default.
Model Specifying: Patsy and DataFrame
Alternately, we can also use Patsy formulas in statsmodels.api
directly. Let’s repeat the example again:
```python
import numpy as np
import patsy
import statsmodels.api as sm
X1 = np.linspace(0,10,100).reshape(100,1)
X2 = np.random.random([100,1])
X = np.column_stack([X1,X2])
beta = [.5, .2]
y = np.dot(X, beta) + np.random.random(100)
data = pd.DataFrame(X, columns=['col1', 'col2'])
data['y'] = y
y, X = patsy.dmatrices('y ~ col1 + col2', data, return_type='matrix')
results = sm.OLS(y, X).fit()
results.summary()
Out[151]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.968
Model: OLS Adj. R-squared: 0.968
Method: Least Squares F-statistic: 1485.
Date: Sat, 19 Mar 2022 Prob (F-statistic): 1.78e-73
Time: 13:17:41 Log-Likelihood: -9.9079
No. Observations: 100 AIC: 25.82
Df Residuals: 97 BIC: 33.63
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.3495 0.081 4.321 0.000 0.189 0.510
col1 0.5108 0.009 54.402 0.000 0.492 0.529
col2 0.4076 0.100 4.074 0.000 0.209 0.606
==============================================================================
Omnibus: 8.622 Durbin-Watson: 1.969
Prob(Omnibus): 0.013 Jarque-Bera (JB): 3.501
Skew: -0.129 Prob(JB): 0.174
Kurtosis: 2.120 Cond. No. 26.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
In view of this, when the given statsmodels function does not support formulas, we can still use patsy’s formula language to produce design matrices.
More Examples
- t Test & F Test
- Chi-Squared Test
- Unitary Regression Model
- Multiple Regression Model
- Factor and Principle Component Analysis
Comments