2. calculating partial derivatives of the loss to find the minumum
We want to find the minimal loss and can do so by calculating the partial derivatives and finding the weights for which the output (of our
loss function) = 0
Each of the
n
points: size in square feet vs price of house.
weight space
=
space defined by all possible settings of the weights.
For univariate linear regression, the weight space is two-dimensional, so we can graph the loss as a function in a 3D plot.
loss function is convex → true for every linear regression problem with an squared loss function that means there are no local minima.
When the derivative does not have a closed form
Often when we want to fit linear data we just use these equations and that's the end of the story.
But other times the derivatives to get to this equation do not have a closed form.
For example, an infinite sum would generally not be considered closed-form.
Then we cant solve this problem with calculus → general optimization search problem in a continuous weight space. → Local search with hill
climbing algorithm.
Hill climbing algorithm
💡
We need the derivative of the loss to subtract it from the current weight value.
Convergence to solution is guaranteed but it is very slow.
Update each weight with (Where
α
= step size / learning rate):
We want to iteratively lower the loss in regards to all examples. There are two variations of training with multiple examples:
batch gradient descent
The derivative of a sum = the sum of the derivatives of all the components (previously in the sum).
Gradient decent will reach the unique minimum of the loss function by updating each weight individually with the sum of all training
examples
(xj,yj)
.
We only consider a single randomly chosen example at a time with the equation for single training examples.
More efficient, can be done online with data constantly coming in - however, it does not guarantee convergence. It can oscillate
around the minimum without settling down.
Multivariate linear regression
The input
xj
from our example
j
is an
n
-element vector.
Hypothesis space not limited to
2
dimensions / weights:
We now want the minimized squared-error loss
L2
over the examples.
Gradient decent will reach the unique minimum of the loss function by updating each weight individually with the sum of all training
examples
(xj,yj)
. ( = batch gradient descent)