Regression with Linear Models (optional)

Linear Regression

Hypothesis space $\mathcal{H}$ : linear functions for continuous valued inputs.

Univariate linear regression

Goal: Fitting a straight line so that we minimize the error (empirical loss)

The weights $w$ are the coefficients to be learned - defined as a vector $\vec w = (w_0,w_1)$ .

hw⃗(x)=w1x+w0h_{\vec w}(x) = w_1x + w_0hw(x)=w1x+w0

When the derivative has a closed form

1. Calculating the squared loss $L_2$

To calculate the empirical loss we use the squared loss $L_2$ summed over all training examples.

Gauss proved that with normally distributed noise, this is the best method.

Loss(hw⃗)=∑j=1NL2(yj,hw⃗(xj))=∑j=1N(yj−hw⃗(xj))2=∑j=1N(yj−(w1x+w0))2Loss(h_{\vec w}) = \\ \sum_{j=1}^N \textcolor{pink}{L_2(}y_j,h_{\vec w}(x_j)\textcolor{pink}) = \sum_{j=1}^N \textcolor{pink}(y_j \textcolor{pink}- h_{\vec w}(x_j)\textcolor{pink}{)^2} = \\ \sum_{j=1}^N (y_j - (w_1x + w_0))^2Loss(hw)=j=1∑NL2(yj,hw(xj))=j=1∑N(yj−hw(xj))2=j=1∑N(yj−(w1x+w0))2

2. calculating partial derivatives of the loss to find the minumum

We want to find the minimal loss and can do so by calculating the partial derivatives and finding the weights for which the output (of our loss function) = 0

∂∂w0Loss(hw⃗)=∂∂w0∑j=1N(yj−(w1x+w0))2=0∂∂w1Loss(hw⃗)=∂∂w1∑j=1N(yj−(w1x+w0))2=0\frac{\partial}{\partial \textcolor{pink}{w_0} } \text{Loss}(h_{\vec w}) = \frac{\partial}{\partial \textcolor{pink}{w_0} } \sum_{j=1}^N (y_j - (w_1x + \textcolor{pink}{w_0}))^2 = 0 \\ \frac{\partial}{\partial \textcolor{pink}{w_1} } \text{Loss}(h_{\vec w}) = \frac{\partial}{\partial \textcolor{pink}{w_1} } \sum_{j=1}^N (y_j - (\textcolor{pink}{w_1}x + w_0))^2 = 0∂w0∂Loss(hw)=∂w0∂j=1∑N(yj−(w1x+w0))2=0∂w1∂Loss(hw)=∂w1∂j=1∑N(yj−(w1x+w0))2=0

Which results in:

w1=N(∑xjyj)−(∑xj)(∑yj)N(∑xj2)−(∑xj)2\textcolor{pink}{w_{1}}=\frac{N\left(\sum x_{j} y_{j}\right)-\left(\sum x_{j}\right)\left(\sum y_{j}\right)}{N\left(\sum x_{j}^{2}\right)-\left(\sum x_{j}\right)^{2}}w1=N(∑xj2)−(∑xj)2N(∑xjyj)−(∑xj)(∑yj)

w0=∑yj−w1(∑xj)N\textcolor{pink}{w_{0}}= \frac{\sum y_{j}-w_{1}\left(\sum x_{j}\right)}{N}w0=N∑yj−w1(∑xj)

Example

Each of the $n$ points: size in square feet vs price of house.

weight space = space defined by all possible settings of the weights.

For univariate linear regression, the weight space is two-dimensional, so we can graph the loss as a function in a 3D plot.

loss function is convex → true for every linear regression problem with an squared loss function that means there are no local minima.

When the derivative does not have a closed form

Often when we want to fit linear data we just use these equations and that's the end of the story.

But other times the derivatives to get to this equation do not have a closed form.

For example, an infinite sum would generally not be considered closed-form.

Then we cant solve this problem with calculus → general optimization search problem in a continuous weight space. → Local search with hill climbing algorithm.

Hill climbing algorithm

💡

We need the derivative of the loss to subtract it from the current weight value.

Convergence to solution is guaranteed but it is very slow.

Update each weight with (Where $\alpha$ = step size / learning rate):

We must individually update $i = \{0,1\}$ .

wi←wi−α⋅∂∂wiLoss⁡(w)w_{i} \leftarrow w_{i}-\alpha \cdot \frac{\partial}{\partial w_{i}} \operatorname{Loss}(\mathbf{w})wi←wi−α⋅∂wi∂Loss(w)

One training example

Calculating the derivative:

Because $\frac{\partial}{\partial x} x^2 = 2x$ and $\frac{\partial}{\partial x} x = 1$

∂∂wiLoss⁡(w)=∂∂wi(y−hw(x))2=2(y−hw(x))⋅∂∂wi(y−hw(x))=2(y−hw(x))⋅∂∂wi(y−(w1x+w0))\begin{aligned}\frac{\partial}{\partial \textcolor{pink}{w_{i}}} \operatorname{Loss}(\mathbf{w}) &=\frac{\partial}{\partial \textcolor{pink}{w_{i}}}\left(y-h_{\mathbf{w}}(x)\right)^{2} \\&=2\left(y-h_{\mathbf{w}}(x)\right) \cdot\frac{\partial}{\partial \textcolor{pink}{w_{i}}}\left(y-h_{\mathbf{w}}(x)\right) \\&=2\left(y-h_{\mathbf{w}}(x)\right) \cdot\frac{\partial}{\partial \textcolor{pink}{w_{i}}}\left(y-\left(w_{1} x+w_{0}\right)\right)\end{aligned}∂wi∂Loss(w)=∂wi∂(y−hw(x))2=2(y−hw(x))⋅∂wi∂(y−hw(x))=2(y−hw(x))⋅∂wi∂(y−(w1x+w0))

Now for the individual weights:

∂∂w0Loss⁡(w)=−2(y−hw(x))\frac{\partial}{\partial \textcolor{pink}{w_0}} \operatorname{Loss}(\mathbf{w})=-2\left(y-h_{\mathbf{w}}(x)\right)∂w0∂Loss(w)=−2(y−hw(x))

∂∂w1Loss⁡(w)=−2(y−hw(x))⋅x\\\frac{\partial}{\partial \textcolor{pink}{w_1}} \operatorname{Loss}(\mathbf{w})=-2\left(y-h_{\mathbf{w}}(x)\right) \cdot x∂w1∂Loss(w)=−2(y−hw(x))⋅x

Therefore the weights must be updated in each step of the algorithm with

w0←w0+α(y−hw(x))\textcolor{pink}{w_{0}} \leftarrow \textcolor{pink}{w_{0}}+\alpha\left(y-h_{\mathbf{w}}(x)\right)w0←w0+α(y−hw(x))

w1←w1+α(y−hw(x))⋅x\textcolor{pink}{w_{1}} \leftarrow \textcolor{pink}{w_{1}}+\alpha\left(y-h_{\mathbf{w}}(x)\right) \cdot xw1←w1+α(y−hw(x))⋅x

Multiple training examples

We want to iteratively lower the loss in regards to all examples. There are two variations of training with multiple examples:

batch gradient descent
The derivative of a sum = the sum of the derivatives of all the components (previously in the sum).
Gradient decent will reach the unique minimum of the loss function by updating each weight individually with the sum of all training examples $(x_j, y_j)$ .
w0←w0+α∑j(yj−hw(xj))w_{0} \leftarrow w_{0}+\alpha \textcolor{pink}{\sum_{j}}\left(y_{\textcolor{pink}j}-h_{\mathbf{w}}(x_{\textcolor{pink}j})\right) \\w0←w0+αj∑(yj−hw(xj))
w1←w1+α∑j(yj−hw(xj))⋅xjw_{1} \leftarrow w_{1}+\alpha \textcolor{pink}{\sum_{j}}\left(y_{\textcolor{pink}j}-h_{\mathbf{w}}(x_{\textcolor{pink}j})\right) \cdot x_{\textcolor{pink}j}w1←w1+αj∑(yj−hw(xj))⋅xj

stochastic gradient descent
We only consider a single randomly chosen example at a time with the equation for single training examples.
More efficient, can be done online with data constantly coming in - however, it does not guarantee convergence. It can oscillate around the minimum without settling down.

Multivariate linear regression

The input $\vec x_j$ from our example $j$ is an $n$ -element vector.

Hypothesis space not limited to $2$ dimensions / weights:

hsw(xj)=w0+w1xj,1+⋯+wnxj,n=w0+∑iwixj,ih_{s w}(\mathbf{x}_{j})=w_{0}+w_{1} x_{j, 1}+\cdots+w_{n} x_{j, n}=w_{0}+\sum_{i} w_{i} x_{j, i}hsw(xj)=w0+w1xj,1+⋯+wnxj,n=w0+i∑wixj,i

In this formula the term $w_0$ (intercept) stands out - we define a dummy input attribute $x_{j,0} = 1$ to make calculations easier.

Then $h$ is simply the dot product of the weights and the input vector (or the matrix product of the transpose of the weights and the input vector):

hsw(xj)=w⋅xj=w⊤xj=∑iwixj,ih_{s w}\left(\mathbf{x}_{j}\right)=\mathbf{w} \cdot \mathbf{x}_{j}=\mathbf{w}^{\top} \mathbf{x}_{j}=\sum_{i} w_{i} x_{j, i}hsw(xj)=w⋅xj=w⊤xj=i∑wixj,i

We now want the minimized squared-error loss $L_2$ over the examples.

Gradient decent will reach the unique minimum of the loss function by updating each weight individually with the sum of all training examples $(x_j, y_j)$ . ( = batch gradient descent)

wi←wi+α∑j(yj−hw(xj))⋅xj,iw_{i} \leftarrow w_{i}+\alpha \sum_{j} \left(y_{j}-h_{\mathbf{w}}(\mathbf{x}_{j})\right) \cdot {x}_{j,i}wi←wi+αj∑(yj−hw(xj))⋅xj,i