📎

Linear Classification

💡
This is not part of the curriculum - just here to show that perceptrons are actually linear classifiers.

Hard threshold

A decision boundary is a line (or a surface, in higher dimensions) that separates two classes. Then line then is called a linear separator.

Using the convention of a dummy input x0=1x_0 = 1 we can write the classification hypothesis as:

g(x)=hw(x)={1,ifw⋅x≥00,otherwiseg(\mathbf{x}) = h_{\mathbf{w}}(\mathbf{x}) = \begin{cases} 1,& \text{if } \mathbf{w} \cdot \mathbf{x}\geq 0\\ 0, & \text{otherwise} \end{cases}g(x)=hw​(x)={1,0,​ifw⋅x≥0otherwise​

Alternatively one could see it as a threshold function where wx\vec w \cdot \vec x is the threshold tt .

We want to choose the weights to minimize the loss.

Previously in linear regression we did this both in closed form (by setting the gradient to zero and solving for the weights) and by gradient descent in weight space (hill climb algorithm).

Now we cannot do either of those things because the gradient is zero almost everywhere in weight space except at those points where wx=0\vec w \cdot \vec x = 0 , and at those points the gradient is undefined.

Perceptron learning rule

Weight update rule that converges to the optimal solution most of the time (a linear separator that classifies the data perfectly if data is linearly seperable) - but very slowly.

wi←wi+α(y−hw⃗(x⃗))⋅xiw_i \larr w_i + \alpha(y-h_{\vec w}(\vec x)) \cdot x_iwi​←wi​+α(y−hw​(x))⋅xi​

For multiple examples: Applied like in stochastic gradient decent for each individual weight:

We only consider a single randomly chosen example at a time with the equation for single training examples.

More efficient, can be done online with data constantly coming in - however, it does not guarantee convergence. It can oscillate around the minimum without settling down.

How does exactly it work?

We have our currect output hwh_{\vec w} and the output of the examples y{0,1}y \in \{0,1\} with a given x\vec x .

Soft threshhold

The previous hypothesis hw(x)h_{\vec w}(\vec x) is not differentiable, discontinuous, very unpredictable.

It also always announces a completely confident prediction of 1 or 0, even for examples that are very close to the boundary - we often want more granularity in predictions.

We therefore now want to use the sigmoid funcion gg . It converges slower than the hard threshold but behaves a lot more predictable.

g(x)=hw(x)=Logistic⁡(w⋅x)=11+e−w⋅xg(\mathbf{x}) = h_{\mathbf{w}}(\mathbf{x})=\operatorname{Logistic}(\mathbf{w} \cdot \mathbf{x})=\frac{1}{1+e^{-\mathbf{w} \cdot \mathbf{x}}}g(x)=hw​(x)=Logistic(w⋅x)=1+e−w⋅x1​

Notice that the output, being a number between 0 and 1, can be interpreted as a probability of belonging to the class labeled 1.

Logistic Regression

The process of fitting the weights of this model to minimize loss on a data set is called logistic regression.

There is no easy closed-form solution to find the optimal value of w\vec w → Search for minimal loss by gradient descent:

We can use the Loss\text{Loss} function again because our hypotheses ∉{0,1}\not\in \{0,1\} .

Loss⁡(w)=(y−hw(x))2\operatorname{Loss}(\mathbf{w}) =\left(y-h_{\mathbf{w}}(\mathbf{x})\right)^{2}Loss(w)=(y−hw​(x))2

One training example

Because of the chain rule forg(f(x))g(f(x)):

g(f(x))/x=g(f(x))f(x)/x \partial g(f(x)) / \partial x=g^{\prime}(f(x)) \cdot \partial f(x) / \partial x 

∂∂wiLoss⁡(w)=∂∂wi(y−hw(x))2=2(y−hw(x))⋅∂∂wi(y−hw(x))=−2(y−hw(x))⋅g′(w⋅x)⋅∂∂wiw⋅x=−2(y−hw(x))⋅g′(w⋅x)⋅xi\begin{aligned}\frac{\partial}{\partial w_{i}} \operatorname{Loss}(\mathbf{w}) &=\frac{\partial}{\partial w_{i}}\left(y-h_{\mathbf{w}}(\mathbf{x})\right)^{2} \\&=2\left(y-h_{\mathbf{w}}(\mathbf{x})\right) \cdot \frac{\partial}{\partial w_{i}}\left(y-h_{\mathbf{w}}(\mathbf{x})\right) \\&=-2\left(y-h_{\mathbf{w}}(\mathbf{x})\right) \cdot g^{\prime}(\mathbf{w} \cdot \mathbf{x}) \cdot \frac{\partial}{\partial w_{i}} \mathbf{w} \cdot \mathbf{x} \\&=-2\left(y-h_{\mathbf{w}}(\mathbf{x})\right) \cdot g^{\prime}(\mathbf{w} \cdot \mathbf{x}) \cdot x_{i}\end{aligned}∂wi​∂​Loss(w)​=∂wi​∂​(y−hw​(x))2=2(y−hw​(x))⋅∂wi​∂​(y−hw​(x))=−2(y−hw​(x))⋅g′(w⋅x)⋅∂wi​∂​w⋅x=−2(y−hw​(x))⋅g′(w⋅x)⋅xi​​

The derivative gg' of the logistic function satisfies

g(z)=g(z)(1−g(z))g(z) = g(z)(1 − g(z))g(z)=g(z)(1−g(z))

so we have

g′(w⋅x)=g(w⋅x)(1−g(w⋅x))=hw(x)(1−hw(x))g^{\prime}(\mathbf{w} \cdot \mathbf{x})=g(\mathbf{w} \cdot \mathbf{x})(1-g(\mathbf{w} \cdot \mathbf{x}))=h_{\mathbf{w}}(\mathbf{x})\left(1-h_{\mathbf{w}}(\mathbf{x})\right)g′(w⋅x)=g(w⋅x)(1−g(w⋅x))=hw​(x)(1−hw​(x))
g′(w⋅x)=hw(x)(1−hw(x))g^{\prime}(\mathbf{w} \cdot \mathbf{x})= h_{\mathbf{w}}(\mathbf{x})\left(1-h_{\mathbf{w}}(\mathbf{x})\right)g′(w⋅x)=hw​(x)(1−hw​(x))

so the weight update for minimizing the loss is

wi←wi+α(y−hw(x))⋅hw(x)(1−hw(x))⋅xiw_{i} \leftarrow w_{i}+\alpha\left(y-h_{\mathbf{w}}(\mathbf{x})\right) \cdot h_{\mathbf{w}}(\mathbf{x})\left(1-h_{\mathbf{w}}(\mathbf{x})\right) \cdot x_{i}wi​←wi​+α(y−hw​(x))⋅hw​(x)(1−hw​(x))⋅xi​