Data Modeling [01]: Patsy and Statsmodels

4 minute read

Published: February 14, 2022

patsy is a Python package for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. It is used in many projects to provide a high-level interface to the statistical code, including:

statsmodels: Estimation of statistical models, statistical tests, and statistical data exploration (link)
HDDM: Hierarchical Bayesian parameter estimation of Drift Diffusion Models (DDM) (link)

Statmodels Overview

The main statsmodels API is split into models:

statsmodels.api: General models and methods, including
- Regression
- Imputation
- Generalized Estimating Equations
- Generalized Linear Models
- Discrete and Count Models
- Multivariate Models
- …
statsmodels.tsa.api: Time-series models and methods, including
- Statistics and Tests
- Univariate Time-Series Analysis
- Multivariate Time Series Models
- Exponential Smoothing
- Filters and Decompositions
- Markov Regime Switching Models
- Forecasting
- …
statsmodels.formula.api: A convenience interface for specifying models using formula strings and DataFrames.

Model Specifying: Array

In its simplest form, statsmodels supports array as inputs. Below is the API for Ordinary Least Squares:

statsmodels.regression.linear_model.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)

where endog and exog are dependent and independent variables respectively.

Here is a simple example.

import numpy as np
import statsmodels.api as sm

X1 = np.linspace(0,10,100).reshape(100,1)
X2 = np.random.random([100,1])
X = sm.add_constant(np.column_stack([X1,X2]))
beta = [1, .5, .2]
y = np.dot(X, beta) + np.random.random(100)

results = sm.OLS(y, X).fit()
results.summary()
Out[133]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.966
Model:                            OLS   Adj. R-squared:                  0.966
Method:                 Least Squares   F-statistic:                     1398.
Date:                Sat, 19 Mar 2022   Prob (F-statistic):           3.05e-72
Time:                        12:51:38   Log-Likelihood:                -13.431
No. Observations:                 100   AIC:                             32.86
Df Residuals:                      97   BIC:                             40.68
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.4422      0.071     20.228      0.000       1.301       1.584
x1             0.5076      0.010     52.574      0.000       0.488       0.527
x2             0.2328      0.095      2.439      0.017       0.043       0.422
==============================================================================
Omnibus:                       14.426   Durbin-Watson:                   2.092
Prob(Omnibus):                  0.001   Jarque-Bera (JB):                4.610
Skew:                          -0.164   Prob(JB):                       0.0998
Kurtosis:                       2.000   Cond. No.                         22.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""

We can see that the parameter names here have been given the generic names x1, x2, and so on.

Model Specifying: Formulas and DataFrame

statsmodels also support patsy formula strings (use API in statsmodels.formula.api instead of statsmodels.api), let’s repeat the example above:

import numpy as np
import statsmodels.formula.api as smf

X1 = np.linspace(0,10,100).reshape(100,1)
X2 = np.random.random([100,1])
X = np.column_stack([X1,X2])
beta = [.5, .2]
y = np.dot(X, beta) + np.random.random(100)
data = pd.DataFrame(X, columns=['col1', 'col2'])
data['y'] = y

results = smf.ols('y ~ col1 + col2', data=data).fit()
results.summary()
Out[145]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.963
Model:                            OLS   Adj. R-squared:                  0.962
Method:                 Least Squares   F-statistic:                     1247.
Date:                Sat, 19 Mar 2022   Prob (F-statistic):           6.27e-70
Time:                        13:02:40   Log-Likelihood:                -17.237
No. Observations:                 100   AIC:                             40.47
Df Residuals:                      97   BIC:                             48.29
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4290      0.077      5.565      0.000       0.276       0.582
col1           0.4984      0.010     49.785      0.000       0.479       0.518
col2           0.3410      0.097      3.529      0.001       0.149       0.533
==============================================================================
Omnibus:                       35.840   Durbin-Watson:                   2.154
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.139
Skew:                          -0.003   Prob(JB):                       0.0464
Kurtosis:                       1.786   Cond. No.                         22.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""

It should be noticed that the intercept is added by default.

Model Specifying: Patsy and DataFrame

Alternately, we can also use Patsy formulas in statsmodels.api directly. Let’s repeat the example again:

```python
import numpy as np
import patsy
import statsmodels.api as sm

X1 = np.linspace(0,10,100).reshape(100,1)
X2 = np.random.random([100,1])
X = np.column_stack([X1,X2])
beta = [.5, .2]
y = np.dot(X, beta) + np.random.random(100)
data = pd.DataFrame(X, columns=['col1', 'col2'])
data['y'] = y

y, X = patsy.dmatrices('y ~ col1 + col2', data, return_type='matrix')

results = sm.OLS(y, X).fit()
results.summary()
Out[151]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.968
Model:                            OLS   Adj. R-squared:                  0.968
Method:                 Least Squares   F-statistic:                     1485.
Date:                Sat, 19 Mar 2022   Prob (F-statistic):           1.78e-73
Time:                        13:17:41   Log-Likelihood:                -9.9079
No. Observations:                 100   AIC:                             25.82
Df Residuals:                      97   BIC:                             33.63
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.3495      0.081      4.321      0.000       0.189       0.510
col1           0.5108      0.009     54.402      0.000       0.492       0.529
col2           0.4076      0.100      4.074      0.000       0.209       0.606
==============================================================================
Omnibus:                        8.622   Durbin-Watson:                   1.969
Prob(Omnibus):                  0.013   Jarque-Bera (JB):                3.501
Skew:                          -0.129   Prob(JB):                        0.174
Kurtosis:                       2.120   Cond. No.                         26.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""

In view of this, when the given statsmodels function does not support formulas, we can still use patsy’s formula language to produce design matrices.

More Examples

Twitter Facebook LinkedIn

Chao Huang