Hypothesis space
H
:
linear functions
for continuous valued inputs.
Univariate linear regression
Goal: Fitting a straight line so that we minimize the error (empirical loss)
The weights
w
are the coefficients to be learned - defined as a vector
w=(w0,w1)
.
When the derivative has a closed form
1.
Calculating the squared lossL2
To calculate the empirical loss we use the squared loss
L2
summed over all training examples.
Gauss proved that with normally distributed noise, this is the best method.
2. calculating partial derivatives of the loss to find the minumum
We want to find the minimal loss and can do so by calculating the partial derivatives and finding the weights for which the output (of our
loss function) = 0
Which results in:
Example
Each of the
n
points: size in square feet vs price of house.
weight space
=
space defined by all possible settings of the weights.
For univariate linear regression, the weight space is two-dimensional, so we can graph the loss as a function in a 3D plot.
loss function is convex → true for every linear regression problem with an squared loss function that means there are no local minima.
When the derivative does not have a closed form
Often when we want to fit linear data we just use these equations and that's the end of the story.
But other times the derivatives to get to this equation do not have a closed form.
For example, an infinite sum would generally not be considered closed-form.
Then we cant solve this problem with calculus → general optimization search problem in a continuous weight space. → Local search with hill
climbing algorithm.
Hill climbing algorithm
Convergence to solution is guaranteed but it is very slow.
Update each weight with (Where
α
= step size / learning rate):
We must individually updatei={0,1}.
One training example
Calculating the derivative:
Because∂x∂x2=2xand∂x∂x=1
Now for the individual weights:
Therefore the weights must be updated in each step of the algorithm with
Multiple training examples
We want to iteratively lower the loss in regards to all examples. There are two variations of training with multiple examples:
batch gradient descent
The derivative of a sum = the sum of the derivatives of all the components (previously in the sum).
Gradient decent will reach the unique minimum of the loss function by updating each weight individually with the sum of all training
examples
(xj,yj)
.
stochastic gradient descent
We only consider a single randomly chosen example at a time with the equation for single training examples.
More efficient, can be done online with data constantly coming in - however, it does not guarantee convergence. It can oscillate
around the minimum without settling down.
Multivariate linear regression
The input
xj
from our example
j
is an
n
-element vector.
Hypothesis space not limited to
2
dimensions / weights:
In this formula the term
w0
(intercept) stands out - we define a dummy input attribute
xj,0=1
to make calculations easier.
Then
h
is simply the dot product of the weights and the input vector
(or the matrix product of the transpose of the weights and the input vector):
We now want the minimized squared-error loss
L2
over the examples.
Gradient decent will reach the unique minimum of the loss function by updating each weight individually with the sum of all training
examples
(xj,yj)
. ( = batch gradient descent)