Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.983
Model: OLS Adj. R-squared: 0.982
Method: Least Squares F-statistic: 884.9
Date: Fri, 11 Sep 2020 Prob (F-statistic): 1.14e-40
Time: 06:24:40 Log-Likelihood: -0.080263
No. Observations: 50 AIC: 8.161
Df Residuals: 46 BIC: 15.81
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0021 0.086 58.081 0.000 4.829 5.175
x1 0.4995 0.013 37.607 0.000 0.473 0.526
x2 0.4943 0.052 9.466 0.000 0.389 0.599
x3 -0.0200 0.001 -17.150 0.000 -0.022 -0.018
==============================================================================
Omnibus: 1.687 Durbin-Watson: 1.755
Prob(Omnibus): 0.430 Jarque-Bera (JB): 1.341
Skew: -0.206 Prob(JB): 0.511
Kurtosis: 2.312 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.50212574 4.98049192 5.41996224 5.79360049 6.0841916 6.28707001
6.41088624 6.47618561 6.51203272 6.5512362 6.62495835 6.75759527
6.96276865 7.24108775 7.58004933 7.95609185 8.33846662 8.69428967
8.99394453 9.21594813 9.35048038 9.40099762 9.3836647 9.32469921
9.25606396 9.21021558 9.214772 9.28797491 9.43569268 9.65045622
9.91268612 10.19390951 10.46143804 10.68373852 10.8356137 10.90234214
10.88209687 10.78624558 10.63748292 10.46610185 10.30501659 10.1843549
10.12650817 10.1424526 10.22994581 10.37389529 10.54883802 10.72312402
10.8641163 10.94355306]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.92976322 10.78534702 10.5296871 10.20695402 9.87529179 9.5925822
9.40227327 9.32274176 9.34279382 9.42440566]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.002138
x1 0.499515
np.sin(x1) 0.494250
I((x1 - 5) ** 2) -0.020000
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.929763
1 10.785347
2 10.529687
3 10.206954
4 9.875292
5 9.592582
6 9.402273
7 9.322742
8 9.342794
9 9.424406
dtype: float64