Linear Regression¶

Mad men

(image: flowingdata.com)

Topics¶

Linear and polynomial equations
Loss functions
Gradient descent
Evaluation metrics
Appendix: Regularized Regression Models

Where are we?¶

one of many cheatsheets

(image: sas.com)

Linear Equation¶

From $x$ (a feature), predict $y$ (outcome or result) assuming a "linear relationship".

$$y=Wx+b$$

linear equation

Polynomial Equations¶

Features can also be different degrees ($m$) of $x$ ($x^m$)

polynomial

(image: sklearn)

Objective¶

Given $y = Wx+b$

Find $W$ and $b$ so that $y$ is as accurate as possible

Loss function: measures "how accurate"

Loss Functions¶

Also known as cost function, objective function

Example: Mean Square Error $$L(W, b) = MSE(W, b) = \frac{1}{N}\sum_{i=1}^N{\big(y_i - (Wx_i + b)\big)^2}$$

Many more examples: http://scikit-learn.org/stable/modules/model_evaluation.html

Objective (of Training)¶

Find $W^*$ and $b^*$ to minimize the loss function:

$$\underset{W^*, b^*}{\arg \min}\; L(W, b)$$

$$\underset{W^*, b^*}{\arg \min}\; \frac{1}{N}\sum_{i=1}^N{\big(y_i - (Wx_i + b)\big)^2}$$

$N$: number of samples

Gradient Descent¶

Initialize parameters ($W$ and $b$) to random values
Compute gradient of the loss function: $L'(W, b)$
Update rule ($\epsilon$ = learning rate) $$W := W -\epsilon L'(W, b)$$ $$b := b -\epsilon L'(W, b)$$
Repeat 2 and 3 until find $W^*$ and $b^*$

Linear Equation as Dot Product¶

$y = Wx + b$

Let $x_0 = 1$, then:

$y = Wx + bx_0$

Linear Equation as Dot Product¶

$y = Wx + bx_0 = bx_0 + Wx$

$y = \left[ \begin{array}{cc} b & W \end{array} \right] \left[ \begin{array}{c} x_0 \\ x \end{array} \right] = \left[ \begin{array}{cc} b & W \end{array} \right] \left[ \begin{array}{c} 1 \\ x \end{array} \right] = \left[ \begin{array}{c} b \\ W \end{array} \right]^T \left[ \begin{array}{c} 1 \\ x \end{array} \right] = \theta^TX$

where $\theta = \left[ \begin{array}{c} b \\ W \end{array} \right]$ and $X = \left[ \begin{array}{c} 1 \\ x \end{array} \right]$

Polynomial Equation as Dot Product¶

$y = W_2x^2 + W_1x + b = b+ W_1x + W_2x^2$

$y = \left[ \begin{array}{ccc} b & W_1 & W_2 \end{array} \right] \left[ \begin{array}{c} 1 \\ x \\ x^2 \end{array} \right] = \left[ \begin{array}{c} b \\ W_1 \\ W_2 \end{array} \right]^T \left[ \begin{array}{c} 1 \\ x \\ x^2\end{array} \right] = \theta^TX$

Loss Function¶

For the $i^{th}$ sample: $y_i = \theta^TX_i$

Loss function computes for all N samples: $$L(\theta) = MSE(\theta) = \frac{1}{N}\sum_{i=1}^N{\big(y_i - \theta^TX_i\big)^2}$$

Why Dot Product?¶

In [ ]:

import numpy as np

# 25 features, 10000 samples
X = np.random.rand(10000, 25)
W = np.random.rand(1, 25)

y1 = np.zeros((10000, 1))

In [ ]:

%%time
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
     y1[i] = y1[i] +W[0][j]*X[i][j]

In [ ]:

%time
y2 = np.dot(X, W.T)

In [ ]:

# ensure the two operations are the same
np.testing.assert_allclose(y1, y2)

Libraries¶

http://scikit-learn.org/stable/modules/linear_model.html#linear-model

LinearRegression (Least Squares)
PolynomialFeatures

Evaluation Metrics¶

http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

Mean squared error (MSE): standard measure, but meaningless for exponentially large values
Mean squared log error (MSLE): use for exponentially large values
Mean absolute error (MAE): less sensitive to large errors than MSE / MSLE (which squares them)
$R^2$ score: Generally convenient to use for scoring, because higher number is better, and range is usually 0 to 1.

Appendix: Regularized Regression Models¶

Here's a quick summary of other regression models that may be of interest.

Ridge Regression: applies L2 regularization
Lasso: applies L1 regularization
Elastic Net: applies L1 and L2 regularization

L1 vs. L2 Regularization¶

L1 Regularization	L2 Regularization
Uses sum of squares of weights	Uses sum of weights
Computationally efficient, differentiable	Less computationally efficient, not differentiable
Produces sparse outputs because sum can cancel out weights	Produces non-sparse outputs

Recap: Linear Regression¶

Objective:

$$\underset{w} {\arg \min} {\left\Vert Xw - y \right\Vert_2}^2 $$

When there are multiple features in $X$, assumes features are independent.

If not, least squares becomes sensitive to noise. This is because the same noise can appear in related features and be magnified.

Note:

This syntax ${\left\Vert Xw - y \right\Vert_2}^2$ means the residual sum of squares: $\sum_i^n {(X_iw - y_i)}^2$

In [ ]:

%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D

# Our fake data
X = [[0, 0], [0, 0], [1, 1]]
y = [0, .1, 1]

# Let's plot our fake data in 3D
# We need 3 columns: x0, x1, y
data = np.array(X)
X0 = data[:, 0]
X1 = data[:, 1]

fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))
ax.scatter(X0, X1, y)
ax.set(xlabel='X0', ylabel='X1', zlabel='y')
fig.show()

In [ ]:

from sklearn.linear_model import LinearRegression

# Our baseline model
model = LinearRegression()

model.fit(X, y)
print('Coefficients', model.coef_, 'Intercept', model.intercept_)

X_test = [[1, 1]] # needs to be 2D array
model.predict(X_test)

In [ ]:

def plot_model(model, X, ax, label, color):
    """Plots a model on the given axes
    Args:
        model: the model to plot
        X: the input data
        ax: the matplotlib axes
        label: the label for the plot
        color: the color of the plot
    """
    data = np.array(X)
    X0 = data[:, 0]
    X1 = data[:, 1]

    x_test = np.stack((np.linspace(min(X0), max(X0)),
                   np.linspace(min(X1), max(X1))),
                  axis=1)
    preds = model.predict(x_test)
    ax.plot(x_test[:, 0], x_test[:, 1], preds, label=label, color=color)

In [ ]:

# Plot the model against samples
fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))

# plot the samples
ax.scatter(X0, X1, y, label='data')

# plot the model
plot_model(model, X, ax, 'linear regression', 'red')

ax.set(xlabel='X0', ylabel='X1', zlabel='y')
ax.legend()
fig.show()

Ridge Regression¶

Linear Regression + L2 Regularization = Ridge Regression

Penalizes the input features by a factor of the sum-of-squares of the parameters.

Objective:

$$\underset{w} {\arg \min} {\left\Vert Xw - y \right\Vert_2}^2 + \alpha {\left\Vert w \right\Vert_2}^2 $$

$\alpha$ = regularization hyperparameter. Increase this when you suspect input features have higher linear dependencies.

Notes:

This syntax ${\left\Vert w \right\Vert_2}^2$ means $\sum_1^k {w_j}^2$, where $k$ is the number of parameters

In [ ]:

# http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression
from sklearn.linear_model import Ridge

alpha = .5
ridge = Ridge(alpha=alpha)

ridge.fit(X, y)
print('Coefficients', ridge.coef_, 'Intercept', ridge.intercept_)

X_test = [[1, 1]] # needs to be 2D array
ridge.predict(X_test)

In [ ]:

# Plot the models against samples
fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))

# plot the samples
ax.scatter(X0, X1, y, label='data')

# plot the models
plot_model(model, X, ax, 'linear regression', 'red')
plot_model(ridge, X, ax, 'ridge (alpha %.2f)' % alpha, 'green')

ax.set(xlabel='X0', ylabel='X1', zlabel='y')
ax.legend()
fig.show()

In [ ]:

# Same as above, but using cross-validated search
# on a selection of alpha values (like GridSearchCV)
from sklearn.linear_model import RidgeCV

alphas = [1e3, 1e2, 0.1, 1.0, 10]

ridge_cv = RidgeCV(alphas=alphas)
ridge_cv.fit(X, y)

print('Coefficients', ridge_cv.coef_, 'Intercept', ridge_cv.intercept_, 'Best alpha', ridge_cv.alpha_)

X_test = [[1, 1]] # needs to be 2D array
ridge_cv.predict(X_test)

In [ ]:

# Plot the models against samples
fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))

# plot the samples
ax.scatter(X0, X1, y, label='data')

# plot the models
plot_model(model, X, ax, 'linear regression', 'red')
plot_model(ridge, X, ax, 'ridge (alpha %.2f)' % alpha, 'green')
plot_model(ridge_cv, X, ax, 'ridge CV (alpha %.2f)' % ridge_cv.alpha_, 'black')

ax.set(xlabel='X0', ylabel='X1', zlabel='y')
ax.legend()
fig.show()

Ridge CV came up with a model that has better fit (closer to baseline).

Alpha of 0.5 is probably too much penalty on X0 and X1.

Lasso Regression¶

Linear Regression + L1 Regularization = Lasso

Penalizes the input features by a factor of the sum of the parameters.

Tends to result in fewer coefficients (i.e. sparser model)
More inefficient to compute

Objective:

$$\underset{w} {\arg \min} \frac{1}{2n} {\left\Vert Xw - y \right\Vert_2}^2 + \alpha {\left\Vert w \right\Vert_1} $$

$\alpha$ = regularization hyperparameter. Increase this when you suspect input features have higher linear dependencies.

Notes:

This syntax ${\left\Vert w \right\Vert_1}$ means $\sum_1^k {w_j}$, where $k$ is the number of parameters

In [ ]:

# http://scikit-learn.org/stable/modules/linear_model.html#lasso
from sklearn.linear_model import Lasso

alpha = .5
lasso = Lasso(alpha=alpha)

lasso.fit(X, y)
print('Coefficients', lasso.coef_, 'Intercept', lasso.intercept_)

X_test = [[1, 1]] # needs to be 2D array
lasso.predict(X_test)

In [ ]:

# Plot the models against samples
fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))

# plot the samples
ax.scatter(X0, X1, y, label='data')

# plot the models
plot_model(model, X, ax, 'linear regression', 'red')
plot_model(ridge, X, ax, 'ridge (alpha %.2f)' % alpha, 'green')
plot_model(ridge_cv, X, ax, 'ridge CV (alpha %.2f)' % ridge_cv.alpha_, 'black')
plot_model(lasso, X, ax, 'lasso (alpha %.2f)' % alpha, 'orange')

ax.set(xlabel='X0', ylabel='X1', zlabel='y')
ax.legend()
fig.show()

In [ ]:

# Same as above, but using cross-validated search
# on a selection of alpha values (like GridSearchCV)
from sklearn.linear_model import LassoCV

lasso_cv = LassoCV(alphas=alphas)
lasso_cv.fit(X, y)

print('Coefficients', lasso_cv.coef_, 'Intercept', lasso_cv.intercept_, 'Best alpha', lasso_cv.alpha_)

X_test = [[1, 1]] # needs to be 2D array
lasso_cv.predict(X_test)

In [ ]:

# Plot the models against samples
fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))

# plot the samples
ax.scatter(X0, X1, y, label='data')

# plot the models
plot_model(model, X, ax, 'linear regression', 'red')
plot_model(ridge, X, ax, 'ridge (alpha %.2f)' % alpha, 'green')
plot_model(ridge_cv, X, ax, 'ridge CV (alpha %.2f)' % ridge_cv.alpha_, 'black')
plot_model(lasso, X, ax, 'lasso (alpha %.2f)' % alpha, 'orange')
plot_model(lasso_cv, X, ax, 'lasso CV (alpha %.2f)' % lasso_cv.alpha_, 'purple')

ax.set(xlabel='X0', ylabel='X1', zlabel='y')
ax.legend()
fig.show()

Elastic Net¶

Linear Regression + L1 Reg + L2 Reg = ElasticNet

Tries to reap the benefits from both Ridge and Lasso (stability plus sparser model).

Objective:

$$\underset{w} {\arg \min} \frac{1}{2n} {\left\Vert Xw - y \right\Vert_2}^2 + \alpha\beta {\left\Vert w \right\Vert_1} + \frac{\alpha(1-\beta)}{2} {\left\Vert w \right\Vert_2}^2 $$

$\alpha$, $\beta$ = regularization hyperparameters

Optional Exercise: Elastic Net¶

Try comparing ElasticNet and ElasticNetCV in a similar way as the above examples.

Note that there are two hyperparameters to tune.