Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     884.9
Date:                Fri, 11 Sep 2020   Prob (F-statistic):           1.14e-40
Time:                        06:24:40   Log-Likelihood:              -0.080263
No. Observations:                  50   AIC:                             8.161
Df Residuals:                      46   BIC:                             15.81
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0021      0.086     58.081      0.000       4.829       5.175
x1             0.4995      0.013     37.607      0.000       0.473       0.526
x2             0.4943      0.052      9.466      0.000       0.389       0.599
x3            -0.0200      0.001    -17.150      0.000      -0.022      -0.018
==============================================================================
Omnibus:                        1.687   Durbin-Watson:                   1.755
Prob(Omnibus):                  0.430   Jarque-Bera (JB):                1.341
Skew:                          -0.206   Prob(JB):                        0.511
Kurtosis:                       2.312   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.50212574  4.98049192  5.41996224  5.79360049  6.0841916   6.28707001
  6.41088624  6.47618561  6.51203272  6.5512362   6.62495835  6.75759527
  6.96276865  7.24108775  7.58004933  7.95609185  8.33846662  8.69428967
  8.99394453  9.21594813  9.35048038  9.40099762  9.3836647   9.32469921
  9.25606396  9.21021558  9.214772    9.28797491  9.43569268  9.65045622
  9.91268612 10.19390951 10.46143804 10.68373852 10.8356137  10.90234214
 10.88209687 10.78624558 10.63748292 10.46610185 10.30501659 10.1843549
 10.12650817 10.1424526  10.22994581 10.37389529 10.54883802 10.72312402
 10.8641163  10.94355306]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.92976322 10.78534702 10.5296871  10.20695402  9.87529179  9.5925822
  9.40227327  9.32274176  9.34279382  9.42440566]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.002138
x1                  0.499515
np.sin(x1)          0.494250
I((x1 - 5) ** 2)   -0.020000
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.929763
1    10.785347
2    10.529687
3    10.206954
4     9.875292
5     9.592582
6     9.402273
7     9.322742
8     9.342794
9     9.424406
dtype: float64