Neural Networks

📎 Linear Classification

Mathematical Model of Neuron

McCulloch and Pitts (1943)

The inputs stay the same, only the weights are changed.

Each unit has a dummy input $a_0 = 1$ with the weight $w_{0,j}$ .

The unit $j$ 's output activation $a_j$ is

aj=g(inj)=g(∑i=0nwi,jai)a_j = g(in_j) = g\biggl( \sum_{i=0}^n w_{i,j}a_i \biggl)aj=g(inj)=g(i=0∑nwi,jai)

Where $g$ is an activation function :

perceptron hard threshhold $g(x)=\left\{\begin{array}{ll}1 & \text { if } x \geq 0 \\0 & \text { otherwise }\end{array}\right.$

sigmoid perceptron logistic function $g(x) = \frac{1}{1+e^{-x}}$

Networks Structures

A link from $u_i$ to $u_j$ serves to propagate the activation $a_i$ .

Each link has a numeric weight $w_{i,j}$ from the matrix $\mathbf w$ .

All nodes between each layer are connected. If we dont want a connection we set a weight to 0.

feed-forward neural network(Focus of this chapter)
Connections only in one direction: Upstream nodes → downstream nodes.
Directed acyclic graph DAG.
Has no loops or internal states. Units are arranged in layers.

recurrent neural network RNN
enable internal states (through flip flops, like the RS-Flip-Flop) → have short term memory
can become instable, oscillate, become chaotic
- Hopfield Networks
  Only input and ouput nodes
  birectional connections and symmetric weights, that means $w_{i,j}=w_{j,i}$
  sign activation function (for output $a_i$ ):
  g(x)=sign(x)={1xifx≥0−1xotherwiseg(x) = sign(x) = \begin{cases} 1x \quad if \quad x\geq0 \\ -1x \quad otherwise \end{cases}g(x)=sign(x)={1xifx≥0−1xotherwise
  that enables having a (holographic associative) memory → see RS-FlipFlop
  - training on examples possible
  - stimulus or partial data to get "closest" training example
  - $N$ units can store $\sim 1.4*N$ training examples reliably
- Boltzmann Machines
  Hidden nodes, symmetric weights
  use stochastic activation function → output 1 has a specific probability
  state transitions very similar to simulated annealing
- Long Short-Term Memory (LSTM)
  building units to achieve a memory for long(er) time in RNNs
  record value in a cell using a write gate, a keep gate, and a read gate
  well-suited to deal with time series having time lags of unknown size and duration between events

Single-layer neural networks

Single-layer feed-forward neural networks (Perceptrons)

No hidden layers.

Linear classifiers (whether hard or soft) can represent linear decision boundaries (linear classification or regression) in the input space.

XOR gates can't be seperated, not representable.

Perceptron Learning

We train the weights for each scalar output seperately.

The network stays the same, we want to learn the right weights. We learn the weights for each output in isolation.

We want to learn $f(\mathbf x)$

$h_{\mathbf{w}}(\mathbf{x}) = g(in)$ for unit $u_i$

$\text{Err}=\left(y-h_{\mathbf{w}}(\mathbf{x})\right)$

$\operatorname{Loss}(\mathbf{w})={\left(y-h_{\mathbf{w}}(\mathbf{x})\right)}^{2}$

$\Delta=\operatorname{Err}_{i} ~\cdot ~g^{\prime}(i n_{i})$

We want the minimal loss, through gradient search on $\mathbf w$ :

We want to find the partial derivative of the loss function for each weight $w_i$ :

∂∂wiLoss(w)=−2⋅(y−hw(x))undefined=Err⋅g′(ini)undefined=Δ⋅xi\frac{\partial }{\partial w_{i}}\text {Loss}(\mathbf w)=-2 \cdot \underbrace{\underbrace{(y-h_{\mathbf{w}}(\mathbf{x}))}_{=\text{Err}} \cdot g^{\prime}(i n_i)}_{=\Delta} \cdot x_{i}∂wi∂Loss(w)=−2⋅=Δ=Err(y−hw(x))⋅g′(ini)⋅xi

wi←wi+α⋅(y−hw(x))⋅g′(ini)⋅xiw_i\larr w_i + \alpha \cdot (y-h_{\mathbf{w}}(\mathbf{x})) \cdot g^{\prime}(i n_i) \cdot x_{i}wi←wi+α⋅(y−hw(x))⋅g′(ini)⋅xi

Perceptron learning rule (Hard threshold)
The hard thershold can't be derived.
For each example $(\mathbf{x}, y)$ :
wi←wi+α(y−hw(x))⋅xiw_i \larr w_i + \alpha(y-h_{\mathbf{w}}(\mathbf{x})) \cdot x_iwi←wi+α(y−hw(x))⋅xi
- How does exactly it work?
  We have our currect output $h_{\vec w}$ and the output of the examples $y \in \{0,1\}$ with a given $\vec x$ .
  - $y = h_{\vec w}(\vec x)$
    Our weights are correct. $w_i$ stays unchanged.
  - $y=1$ and $h_{\vec w}(\vec x) = 0$
    Our weights are incorrect. We want to make $h_{\vec w}(\vec x) = 1$
    If $x_i$ is positive $\rarr w_i$ gets increased
    If $x_i$ is negative $\rarr w_i$ gets decreased
  - $y=0$ and $h_{\vec w}(\vec x) = 1$
    Our weights are incorrect. We want to make $h_{\vec w}(\vec x) = 0$
    If $x_i$ is positive $\rarr w_i$ gets decreased
    If $x_i$ is negative $\rarr w_i$ gets increased

Gradient decend rule (Soft threshold)
For each example $(\mathbf{x}, y)$ : (Both are equivalent)
wi←wi+α⋅(y−hw(x))⋅hw(x)(1−hw(x))undefined=Δ⋅xiw_{i} \leftarrow w_{i}+\alpha \cdot \underbrace{ \left(y-h_{\mathbf{w}}(\mathbf{x})\right) \cdot h_{\mathbf{w}}(\mathbf{x})\left(1-h_{\mathbf{w}}(\mathbf{x})\right) }_{=\Delta} \cdot x_{i}wi←wi+α⋅=Δ(y−hw(x))⋅hw(x)(1−hw(x))⋅xi

Multilayer neural networks

Multilayer feed-forward neural networks

Any functionality can be obtained by multilayer neural networks with an arbitrary depth.

Example
$\mathbf{h}_{\mathbf{w}}(\mathbf{x}) = (a_5, a_6)$
Two inputs, two hidden units in hidden layer and two in output. Not shown: dummy inputs and their weights.
$\mathbf{x} = (x_1, x_2) = (a_1, a_2)$
a5=g(w0,5+w3,5a3+w4,5a4)=g(w0,5+w3,5g(w0,3+w1,3a1+w2,3a2)+w4,5g(w04+w1,4a1+w2,4a2))=g(w0,5+w3,5g(w0,3+w1,3x1+w2,3x2)+w4,5g(w04+w1,4x1+w2,4x2))\begin{aligned}a_{5} &=g\left(w_{0,5+} w_{3,5} a_{3}+w_{4,5} a_{4}\right) \\&=g\left(w_{0,5+} w_{3,5} g\left(w_{0,3}+w_{1,3} a_{1}+w_{2,3} a_{2}\right)+w_{4,5} g\left(w_{0} 4+w_{1,4} a_{1}+w_{2,4} a_{2}\right)\right) \\&=g\left(w_{0,5+} w_{3,5} g\left(w_{0,3}+w_{1,3} x_{1}+w_{2,3} x_{2}\right)+w_{4,5} g\left(w_{0} 4+w_{1,4} x_{1}+w_{2,4} x_{2}\right)\right)\end{aligned}a5=g(w0,5+w3,5a3+w4,5a4)=g(w0,5+w3,5g(w0,3+w1,3a1+w2,3a2)+w4,5g(w04+w1,4a1+w2,4a2))=g(w0,5+w3,5g(w0,3+w1,3x1+w2,3x2)+w4,5g(w04+w1,4x1+w2,4x2))

Able to represent any continuous function with arbitrary accuracy
Above we see a nested non-linear function as the solution to the output.
With the sigmoid function as $g$ and a hidden layer, each output unit computes a soft-thresholded linear combination of several sigmoid functions.
For example, by adding two opposite-facing soft threshold functions and thresholding the result, we can obtain a "ridge" function.
1 layer (no hidden layer): all linearly seperable functions
2 layers: all continuous function, can be as precise as one wants
3 layers: all functions
Combining two such ridges at right angles to each other (i.e., combining the outputs from four hidden units), we obtain a "bump".
With more hidden units, we can produce more bumps of different sizes in more places.
In fact, with a single, sufficiently large hidden layer, it is possible to represent any continuous function of the inputs with arbitrary accuracy. With two layers, even discontinuous functions can be represented.
Unfortunately, for any particular network structure, it is harder to characterize exactly which functions can be represented and which ones cannot.

Multi-Layer Perceptron Learning

💡

The index

k

ranges over nodes in the output layer. The index

j

ranges over nodes in the hidden layer.

We want to learn the function

input $\mathbf x =(x_1, x_2, \dots, x_n)$

output $\mathbf{h}_{\mathbf{w}}(\mathbf{x}) = (a_k,a_{k+1}, \dots a_n)$

We have a training set with examples $(\mathbf x, \mathbf y)$ .

$\mathbf x =(x_1, x_2, \dots, x_n)$

$\mathbf{y} = (y_1, y_2, \dots, y_n)$

We want to minimize the loss (= the sum of gradient losses for all output nodes).

∂∂wkLoss⁡(w)=∂∂wk∣y−hw(x)∣2=∂∂wk∑k(yk−ak)2=∑k∂∂wk(yk−ak)2\frac{\partial}{\partial w_k} \operatorname{Loss}(\mathbf{w})= \textcolor{grey}{ \frac{\partial}{\partial w_k}\left|\mathbf{y}-\mathbf{h}_{\mathbf{w}}(\mathbf{x})\right|^{2}=\frac{\partial}{\partial w_k} \sum_{k}\left(y_{k}-a_{k}\right)^{2} } =\sum_{k} \frac{\partial}{\partial w_k}\left(y_{k}-a_{k}\right)^{2}∂wk∂Loss(w)=∂wk∂∣y−hw(x)∣2=∂wk∂k∑(yk−ak)2=k∑∂wk∂(yk−ak)2

Problem: dependency

We can not learn the weights the same way we did for the single layer network.

The multilayer network has a hidden layer and the output functions are nested functions that contain the nodes behind them.

The error for the hidden layers is not clear - we want to be able to push errors back.

output nodes

Each term in the final summation is computed as if the other outputs did not exist.

Same rule as in the single-layer perceptron.

$\text{Err}_k =$ the $k$ th component of the vector $(\mathbf{y}-\mathbf{h}_{\mathbf{w}}(\mathbf x))$

$\Delta_k = \text{Err}_k \cdot g^{\prime}(in_j)$

wj,k←wj,k+α⋅Δk⋅ajw_{j,k} \leftarrow w_{j,k}+\alpha \cdot \Delta_k \cdot a_{j}wj,k←wj,k+α⋅Δk⋅aj

wj,k←wj,k+α⋅Errk⋅g′(inj)undefined=Δk⋅ajw_{j,k} \leftarrow w_{j,k}+\alpha \cdot \underbrace{ \text{Err}_k \cdot g^{\prime}(in_j)}_{=\Delta_k} \cdot a_{j}wj,k←wj,k+α⋅=ΔkErrk⋅g′(inj)⋅aj

wj,k←wj,k+α⋅(y−hw(x))k⋅g′(∑iwi,kai)undefined=Δk⋅ajw_{j,k} \leftarrow w_{j,k}+\alpha \cdot \underbrace{ (\mathbf{y}-\mathbf{h}_{\mathbf{w}}(\mathbf x))_k \cdot g' \bigl( \sum_{i} w_{i,k}a_i \bigl)}_{=\Delta_k} \cdot a_{j}wj,k←wj,k+α⋅=Δk(y−hw(x))k⋅g′(i∑wi,kai)⋅aj

Where $w_{j,k}$ stands for the weight between the nodes $j$ and $k$ .

The idea is that the hidden node $j$ is "responsible" for some fraction of the error $\Delta_k$ in each of the output nodes to which it connects.

Thus, the $\Delta_k$ values are divided according to thestrengthof the connection between the hidden node and the output node and are propagated back to provide the $\Delta_j$ values for the hidden layer.

hidden nodes: Back-propagation

We back propagate the errors from the output to the hidden layers.

Δj=g′(inj)∑kwj,k⋅Δk\Delta_{j}=g'(i n_{j}) \sum_{k} w_{j, k} \cdot\Delta_{k}Δj=g′(inj)k∑wj,k⋅Δk

Δj=g′(∑i=0nwi,jai)undefined=g′(inj)⋅∑kwj,k⋅(y−hw(x))k⋅g′(∑iwi,kai)undefined=Δk\Delta_{j}=\underbrace{ g' \bigl( \sum_{i=0}^n w_{i,j}a_i \bigl) }_{=g'(in_j)} \cdot\sum_{k} w_{j, k} \cdot \underbrace{ (\mathbf{y}-\mathbf{h}_{\mathbf{w}}(\mathbf x))_k \cdot g' \bigl( \sum_{i} w_{i,k}a_i \bigl)}_{=\Delta_k}Δj==g′(inj)g′(i=0∑nwi,jai)⋅k∑wj,k⋅=Δk(y−hw(x))k⋅g′(i∑wi,kai)

wi,j←wi,j+α⋅Δj⋅aiw_{i,j} \larr w_{i,j} + \alpha \cdot \Delta_j\cdot a_iwi,j←wi,j+α⋅Δj⋅ai

The back propagation algorithm:

The math behind back propagation
∂∂wLoss⁡(w)=∂∂w∣y−hw(x)∣2=∂∂w∑k(yk−ak)2=∑k∂∂w(yk−ak)2=∑k∂∂wLossk\frac{\partial}{\partial w} \operatorname{Loss}(\mathbf{w})=\frac{\partial}{\partial w}\left|\mathbf{y}-\mathbf{h}_{\mathbf{w}}(\mathbf{x})\right|^{2}=\frac{\partial}{\partial w} \sum_{k}\left(y_{k}-a_{k}\right)^{2}=\sum_{k} \frac{\partial}{\partial w}\left(y_{k}-a_{k}\right)^{2} =\sum_{k} \frac{\partial}{\partial w} \text{Loss}_k∂w∂Loss(w)=∂w∂∣y−hw(x)∣2=∂w∂k∑(yk−ak)2=k∑∂w∂(yk−ak)2=k∑∂w∂Lossk
For the $k$ th output nodes
∂Loss⁡k∂wj,k=−2(yk−ak)∂ak∂wj,k=−2(yk−ak)∂g(ink)∂wj,k=−2(yk−ak)g′(ink)∂ink∂wj,k=−2(yk−ak)g′(ink)∂∂wj,k(∑jwj,kaj)=−2(yk−ak)g′(ink)aj=−ajΔk\begin{aligned}\frac{\partial \textcolor{pink}{\operatorname{Loss}_{k}}}{\partial \textcolor{pink}{w_{j, k}}} &=-2\left(y_{k}-a_{k}\right) \frac{\partial a_{k}}{\partial w_{j, k}}=-2\left(y_{k}-a_{k}\right) \frac{\partial g\left(i n_{k}\right)}{\partial w_{j, k}} \\&=-2\left(y_{k}-a_{k}\right) g^{\prime}\left(i n_{k}\right) \frac{\partial i n_{k}}{\partial w_{j, k}}=-2\left(y_{k}-a_{k}\right) g^{\prime}\left(i n_{k}\right) \frac{\partial}{\partial w_{j, k}}\left(\sum_{j} w_{j, k} a_{j}\right) \\&=-2\left(y_{k}-a_{k}\right) g^{\prime}(i n_{k}) a_{j}\\&=-a_{j} \Delta_{k}\end{aligned}∂wj,k∂Lossk=−2(yk−ak)∂wj,k∂ak=−2(yk−ak)∂wj,k∂g(ink)=−2(yk−ak)g′(ink)∂wj,k∂ink=−2(yk−ak)g′(ink)∂wj,k∂(j∑wj,kaj)=−2(yk−ak)g′(ink)aj=−ajΔk
With $\Delta_k$ defined as before.
For the $j$ th hidden nodes
∂Loss⁡k∂wi,j=−2(yk−ak)∂ak∂wi,j=−2(yk−ak)∂g(ink)∂wi,j=−2(yk−ak)g′(ink)∂ink∂wi,j=−2Δk∂∂wi,j(∑jwj,kaj)=−2Δkwj,k∂aj∂wi,j=−2Δkwj,k∂g(inj)∂wi,j=−2Δkwj,kg′(inj)∂inj∂wi,j=−2⋅Δk⋅wj,k⋅g′(inj)⋅∂∂wi,j(∑iwi,jai)=−2Δk⋅wj,k⋅g′(inj)ai=−ai⋅Δj\begin{aligned}\frac{\partial \operatorname{Loss}_{k}}{\partial w_{i, j}} &=-2\left(y_{k}-a_{k}\right) \frac{\partial a_{k}}{\partial w_{i, j}}=-2\left(y_{k}-a_{k}\right) \frac{\partial g\left(i n_{k}\right)}{\partial w_{i, j}} \\&=-2\left(y_{k}-a_{k}\right) g^{\prime}\left(i n_{k}\right) \frac{\partial i n_{k}}{\partial w_{i, j}}=-2 \Delta_{k} \frac{\partial}{\partial w_{i, j}}\left(\sum_{j} w_{j, k} a_{j}\right) \\&=-2 \Delta_{k} w_{j, k} \frac{\partial a_{j}}{\partial w_{i, j}}=-2 \Delta_{k} w_{j, k} \frac{\partial g\left(i n_{j}\right)}{\partial w_{i, j}} \\&=-2 \Delta_{k} w_{j, k} g^{\prime}(i n_{j}) \frac{\partial i n_{j}}{\partial w_{i, j}} \\&=-2 \cdot\Delta_{k}\cdot w_{j, k} \cdot g^{\prime}(i n_{j})\cdot \frac{\partial}{\partial w_{i, j}}(\sum_{i} w_{i, j} a_{i}) \\&=-2 \Delta_{k}\cdot w_{j, k}\cdot g^{\prime}(i n_{j}) a_{i}\\&=-a_{i}\cdot \Delta_{j}\end{aligned}∂wi,j∂Lossk=−2(yk−ak)∂wi,j∂ak=−2(yk−ak)∂wi,j∂g(ink)=−2(yk−ak)g′(ink)∂wi,j∂ink=−2Δk∂wi,j∂(j∑wj,kaj)=−2Δkwj,k∂wi,j∂aj=−2Δkwj,k∂wi,j∂g(inj)=−2Δkwj,kg′(inj)∂wi,j∂inj=−2⋅Δk⋅wj,k⋅g′(inj)⋅∂wi,j∂(i∑wi,jai)=−2Δk⋅wj,k⋅g′(inj)ai=−ai⋅Δj
With $\Delta_j$ defined as before.

Learning neural network structures

What is the ideal network structure for each problem?

The usual approach is to try several and keep the best. → "cross-validation"

big networks will memotize all examples and form a large lookup table but wont generalize well to unseen inputs. (It has been observed that very large networks do generalize well as long as the weights are kept small.)

Networks are subject to overfitting when there are too many parameters in the model

If we want to consider networks that are not fully connected, then we need to find some effective search method through the very large space of possible connection topologies.

optimal brain damage algorithm

Begins with a fully connected network and removes connections (or units) from it.

The network is then retrained, and if its performance has not decreased then the process is repeated.

tiling

growing a larger network from a smaller one.

Applications of Neural Networks

Prediction of stock developments

Prediction of air pollution (e.g., ozone concentration)

(better than physical/chemical, statistical models)

Prediction of electric power prizes

Desulfurization in steel making

Pattern recognition (e.g., handwriting)

Face recognition

Speech recognition

Handwritten Digit Recognition (OCR)

Different learning strategy performances:

3-nearest-neighbor classification: 2.4% error

2 layer MLP (400–300–10 units): 1.6% error

LeNet (1989;1995+): 0.9% error, using specialized neural net family

current/recent best: ≈ 0.23% error, multi-col. convoluted deep NNs

Deep Learning

Hierarchical structure learning: NNs with many layers

Feature extraction and learning (un)supervised

Problem: requires lots of data

Push for hardware (GPUs, designated chips): “Data-Driven” vs. “Model-Driven” computing

Advantages and Disadvantages of Neural Networks

Pros

Less need for determining relevant input factors

Inherent parallelism

Easier to develop than statistical methods

Capability to learn and improve

Useful for complex pattern recognition tasks, "unstructured" and "difficult" input (e.g., images)

Fault tolerance

Cons

Choice of parameters (layers, units) requires skill

Requires sufficient training material

Resulting hypotheses cannot be understood easily

Knowledge implicit (subsymbolic representation)

Verification, behavior prediction difficult (if not impossible)

Challenges

Interpretability

Explainability

Trustworthiness

Robustness (adversarial examples)

Combining symbolic and subsymbolic approaches