Statistics [27]: Unitary Regression Model

5 minute read

Published:

Basics of unitary regression, especially linear regression.


Review: Correlation

Correlation tells us whether the two variables are linearly dependent. If the two variables and are independent, then ; however, if , it doesn’t necessarily mean that and are independent, there might still be nonlinear relationships between the two variables.

For example, , and are obviously not independent. However,

It doesn’t mean that and are independent, it only says that and are linearly independent.

if and only if and are lienarly dependent almost everywhere. That is


Least Square

Suppose the independent variable is and the dependent variable is . The objective of linear fitting is to find a linear relationship between and such that . Residual is defined as the difference between the actual value and the fitted value, that is .

Least square is to choose and to minimize , that is

Differentiate with respect to and respectively, we have

Thus,


Analysis of Variance

where

Therefore,

where

  • is the total sum of squares (TSS)
  • is the regression sum of squares (RegSS)
  • is the residual sum of squares (RSS)

A good fitting has a big RegSS and a small RSS. If TSS = RegSS, the fitting is perfect. The goodness of fit is determined by the coefficient of determination ,


Simple Linear Regression Model

The linear regression model is usually expressed as

where is the error term.

The least square model is an approximation of this linear regression model, where

with

To prove this,

Hence,

With ,

Therefore,


Properties

Least square estimation has the following properties:

  1. (number of parameters subtracted from the freedom, proved in the next post)
  2. are mutually independent (?).

Significance Test

In practice, the lienar regression model is only a hypothesis. Therefore, we need to test the significance of this hypothesis, which is

If is rejected, we can conclude that and have a linear relationship; otherwise, it means that there is no significant linear relationship between and .

F-Test

When holds,

It can be proved that , which will be proved in the next post.

As , we have,

t-Test

Since

When , we have

When , we have

Hence, confidence interval of and are respectively

Example

Relationship between the area and price of houses.

 Area (m^2)Price (10,000)
155100
276130
365100
4156255
55582
676105
789125
8226360
9134190
10156270
11114180
1276142
13164370
1455115
1590200
1681155
17215516
1876160
1966138
2076170

Linear regression using statsmodels.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

data = np.array([[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                 [55,76,65,156,55,76,89,226,134,156,114,76,164,55,90,81,215,76,66,76],
                 [100,130,100,255,82,105,125,360,190,270,180,142,370,115,200,155,516,160,138,170]])
dataDF = pd.DataFrame(np.transpose(data),columns = ['Num','Area','Price'])  

Y = dataDF['Price']
X = dataDF['Area']
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()

results.summary()

Results:

  OLS Regression Results                            
==============================================================================
Dep. Variable:                  Price   R-squared:                       0.857
Model:                            OLS   Adj. R-squared:                  0.849
Method:                 Least Squares   F-statistic:                     107.9
Date:                Tue, 08 Mar 2022   Prob (F-statistic):           4.94e-09
Time:                        12:13:11   Log-Likelihood:                -102.61
No. Observations:                  20   AIC:                             209.2
Df Residuals:                      18   BIC:                             211.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -12.8246     22.046     -0.582      0.568     -59.142      33.493
Area           1.9607      0.189     10.390      0.000       1.564       2.357
==============================================================================
Omnibus:                        2.958   Durbin-Watson:                   0.845
Prob(Omnibus):                  0.228   Jarque-Bera (JB):                1.423
Skew:                           0.613   Prob(JB):                        0.491
Kurtosis:                       3.455   Cond. No.                         267.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
x = [55,76,65,156,55,76,89,226,134,156,114,76,164,55,90,81,215,76,66,76]
x = np.sort(x)
y = results.params[1]*x + results.params[0]
plt.scatter(dataDF['Area'],dataDF['Price'])
plt.plot(x,y,'r')
plt.title('Price ~ b_0 + b_1*Area')
plt.show()

drawing

Linear regression using sklearn.

import numpy as np
from sklearn.linear_model import LinearRegression

x = np.array([55,76,65,156,55,76,89,226,134,156,114,76,164,55,90,81,215,76,66,76]).reshape((-1,1))
y = np.array([100,130,100,255,82,105,125,360,190,270,180,142,370,115,200,155,516,160,138,170])
model = LinearRegression(fit_intercept = True).fit(x,y)

Results:

model.coef_
# array([1.96072963])
model.intercept_
# -12.82464801609072
model.score(x,y)
# 0.8570804873379527

Prediction

Suppose , and the regression function is obtained from . Now fix such that , find the predicted value of along with its confidence interval.

It is easy to verify that is an unbiased estimation with

Consider the distribution of . Firstly,

Hence,

Therefore,

Along with , we have

Denote

,

then the confidence interval of with significance level is

Nonlinear Least Square

Suppose the joint density function of is , exists. Let , then

Proof. Consider the case of continuous variable,

Hence,

This completes the proof.

The conditional expectation is a random variable dependent on , and it is the variable closest to among all the variables that can be expressed by in the sense of least square.


Table of Contents

Comments