Neural Networks

Mathematical Model of Neuron

McCulloch and Pitts (1943)

The inputs stay the same, only the weights are changed.

Each unit has a dummy input a0=1a_0 = 1 with the weight w0,jw_{0,j} .

The unit jj 's output activation aja_j is

aj=g(inj)=g(∑i=0nwi,jai)a_j = g(in_j) = g\biggl( \sum_{i=0}^n w_{i,j}a_i \biggl)aj​=g(inj​)=g(i=0∑n​wi,j​ai​)

Where gg is an activation function :

perceptron hard threshhold g(x)={1 if x00 otherwise  g(x)=\left\{\begin{array}{ll}1 & \text { if } x \geq 0 \\0 & \text { otherwise }\end{array}\right. 

sigmoid perceptron logistic function g(x)=11+exg(x) = \frac{1}{1+e^{-x}}

Networks Structures

A link from uiu_i to uju_j serves to propagate the activation aia_i .

Each link has a numeric weight wi,jw_{i,j} from the matrix w\mathbf w .

All nodes between each layer are connected. If we dont want a connection we set a weight to 0.

  1. feed-forward neural network(Focus of this chapter)

    Connections only in one direction: Upstream nodes → downstream nodes.

    Directed acyclic graph DAG.

    Has no loops or internal states. Units are arranged in layers.

  1. recurrent neural network RNN

    enable internal states (through flip flops, like the RS-Flip-Flop) → have short term memory

    can become instable, oscillate, become chaotic

    • Hopfield Networks

      Only input and ouput nodes

      birectional connections and symmetric weights, that means wi,j=wj,iw_{i,j}=w_{j,i}

      sign activation function (for output aia_i ):

      g(x)=sign(x)={1xifx≥0−1xotherwiseg(x) = sign(x) = \begin{cases} 1x \quad if \quad x\geq0 \\ -1x \quad otherwise \end{cases}g(x)=sign(x)={1xifx≥0−1xotherwise​

      that enables having a (holographic associative) memory → see RS-FlipFlop

      • training on examples possible
      • stimulus or partial data to get "closest" training example
      • NN units can store 1.4N\sim 1.4*N training examples reliably
    • Boltzmann Machines

      Hidden nodes, symmetric weights

      use stochastic activation function → output 1 has a specific probability

      state transitions very similar to simulated annealing

    • Long Short-Term Memory (LSTM)

      building units to achieve a memory for long(er) time in RNNs

      record value in a cell using a write gate, a keep gate, and a read gate

      well-suited to deal with time series having time lags of unknown size and duration between events

Single-layer neural networks

Single-layer feed-forward neural networks (Perceptrons)

No hidden layers.

Linear classifiers (whether hard or soft) can represent linear decision boundaries (linear classification or regression) in the input space.

XOR gates can't be seperated, not representable.

Perceptron Learning

We train the weights for each scalar output seperately.

The network stays the same, we want to learn the right weights. We learn the weights for each output in isolation.


We want to learn f(x)f(\mathbf x)

hw(x)=g(in)h_{\mathbf{w}}(\mathbf{x}) = g(in) for unit uiu_i

Err=(yhw(x))\text{Err}=\left(y-h_{\mathbf{w}}(\mathbf{x})\right)

Loss(w)=(yhw(x))2 \operatorname{Loss}(\mathbf{w})={\left(y-h_{\mathbf{w}}(\mathbf{x})\right)}^{2} 

Δ=Errig(ini) \Delta=\operatorname{Err}_{i} ~\cdot ~g^{\prime}(i n_{i}) 

We want the minimal loss, through gradient search on w\mathbf w :

We want to find the partial derivative of the loss function for each weight wiw_i :

∂∂wiLoss(w)=−2⋅(y−hw(x))undefined=Err⋅g′(ini)undefined=Δ⋅xi\frac{\partial }{\partial w_{i}}\text {Loss}(\mathbf w)=-2 \cdot \underbrace{\underbrace{(y-h_{\mathbf{w}}(\mathbf{x}))}_{=\text{Err}} \cdot g^{\prime}(i n_i)}_{=\Delta} \cdot x_{i}∂wi​∂​Loss(w)=−2⋅=Δ=Err(y−hw​(x))​​⋅g′(ini​)​​⋅xi​
wi←wi+α⋅(y−hw(x))⋅g′(ini)⋅xiw_i\larr w_i + \alpha \cdot (y-h_{\mathbf{w}}(\mathbf{x})) \cdot g^{\prime}(i n_i) \cdot x_{i}wi​←wi​+α⋅(y−hw​(x))⋅g′(ini​)⋅xi​

  1. Perceptron learning rule (Hard threshold)

    The hard thershold can't be derived.

    For each example (x,y)(\mathbf{x}, y) :

    wi←wi+α(y−hw(x))⋅xiw_i \larr w_i + \alpha(y-h_{\mathbf{w}}(\mathbf{x})) \cdot x_iwi​←wi​+α(y−hw​(x))⋅xi​
    • How does exactly it work?

      We have our currect output hwh_{\vec w} and the output of the examples y{0,1}y \in \{0,1\} with a given x\vec x .

      • y=hw(x)y = h_{\vec w}(\vec x)

        Our weights are correct. wiw_i stays unchanged.

      • y=1y=1 and hw(x)=0h_{\vec w}(\vec x) = 0

        Our weights are incorrect. We want to make hw(x)=1h_{\vec w}(\vec x) = 1

        If xix_i is positive wi\rarr w_i gets increased

        If xix_i is negative wi\rarr w_i gets decreased

      • y=0y=0 and hw(x)=1h_{\vec w}(\vec x) = 1

        Our weights are incorrect. We want to make hw(x)=0h_{\vec w}(\vec x) = 0

        If xix_i is positive wi\rarr w_i gets decreased

        If xix_i is negative wi\rarr w_i gets increased

  1. Gradient decend rule (Soft threshold)

    For each example (x,y)(\mathbf{x}, y) : (Both are equivalent)

    wi←wi+α⋅(y−hw(x))⋅hw(x)(1−hw(x))undefined=Δ⋅xiw_{i} \leftarrow w_{i}+\alpha \cdot \underbrace{ \left(y-h_{\mathbf{w}}(\mathbf{x})\right) \cdot h_{\mathbf{w}}(\mathbf{x})\left(1-h_{\mathbf{w}}(\mathbf{x})\right) }_{=\Delta} \cdot x_{i}wi​←wi​+α⋅=Δ(y−hw​(x))⋅hw​(x)(1−hw​(x))​​⋅xi​

Multilayer neural networks

Multilayer feed-forward neural networks

Any functionality can be obtained by multilayer neural networks with an arbitrary depth.

Multi-Layer Perceptron Learning

💡
The index kk ranges over nodes in the output layer. The index jj ranges over nodes in the hidden layer.

We want to learn the function

inputx=(x1,x2,,xn)\mathbf x =(x_1, x_2, \dots, x_n)

outputhw(x)=(ak,ak+1,an) \mathbf{h}_{\mathbf{w}}(\mathbf{x}) = (a_k,a_{k+1}, \dots a_n) 

We have a training set with examples (x,y)(\mathbf x, \mathbf y) .

x=(x1,x2,,xn)\mathbf x =(x_1, x_2, \dots, x_n)

y=(y1,y2,,yn)\mathbf{y} = (y_1, y_2, \dots, y_n)

We want to minimize the loss (= the sum of gradient losses for all output nodes).

∂∂wkLoss⁡(w)=∂∂wk∣y−hw(x)∣2=∂∂wk∑k(yk−ak)2=∑k∂∂wk(yk−ak)2\frac{\partial}{\partial w_k} \operatorname{Loss}(\mathbf{w})= \textcolor{grey}{ \frac{\partial}{\partial w_k}\left|\mathbf{y}-\mathbf{h}_{\mathbf{w}}(\mathbf{x})\right|^{2}=\frac{\partial}{\partial w_k} \sum_{k}\left(y_{k}-a_{k}\right)^{2} } =\sum_{k} \frac{\partial}{\partial w_k}\left(y_{k}-a_{k}\right)^{2}∂wk​∂​Loss(w)=∂wk​∂​∣y−hw​(x)∣2=∂wk​∂​k∑​(yk​−ak​)2=k∑​∂wk​∂​(yk​−ak​)2

Problem: dependency

We can not learn the weights the same way we did for the single layer network.

The multilayer network has a hidden layer and the output functions are nested functions that contain the nodes behind them.

The error for the hidden layers is not clear - we want to be able to push errors back.

output nodes

Each term in the final summation is computed as if the other outputs did not exist.

Same rule as in the single-layer perceptron.

Errk=\text{Err}_k = the kk th component of the vector (yhw(x))(\mathbf{y}-\mathbf{h}_{\mathbf{w}}(\mathbf x))

Δk=Errkg(inj)\Delta_k = \text{Err}_k \cdot g^{\prime}(in_j)

wj,k←wj,k+α⋅Δk⋅ajw_{j,k} \leftarrow w_{j,k}+\alpha \cdot \Delta_k \cdot a_{j}wj,k​←wj,k​+α⋅Δk​⋅aj​
wj,k←wj,k+α⋅Errk⋅g′(inj)undefined=Δk⋅ajw_{j,k} \leftarrow w_{j,k}+\alpha \cdot \underbrace{ \text{Err}_k \cdot g^{\prime}(in_j)}_{=\Delta_k} \cdot a_{j}wj,k​←wj,k​+α⋅=Δk​Errk​⋅g′(inj​)​​⋅aj​
wj,k←wj,k+α⋅(y−hw(x))k⋅g′(∑iwi,kai)undefined=Δk⋅ajw_{j,k} \leftarrow w_{j,k}+\alpha \cdot \underbrace{ (\mathbf{y}-\mathbf{h}_{\mathbf{w}}(\mathbf x))_k \cdot g' \bigl( \sum_{i} w_{i,k}a_i \bigl)}_{=\Delta_k} \cdot a_{j}wj,k​←wj,k​+α⋅=Δk​(y−hw​(x))k​⋅g′(i∑​wi,k​ai​)​​⋅aj​

Wherewj,kw_{j,k}stands for the weight between the nodesjjandkk.

The idea is that the hidden nodejjis "responsible" for some fraction of the errorΔk\Delta_kin each of the output nodes to which it connects.

Thus, theΔk\Delta_kvalues are divided according to thestrengthof the connection between the hidden node and the output node and are propagated back to provide theΔj\Delta_jvalues for the hidden layer.

hidden nodes: Back-propagation

We back propagate the errors from the output to the hidden layers.

Δj=g′(inj)∑kwj,k⋅Δk\Delta_{j}=g'(i n_{j}) \sum_{k} w_{j, k} \cdot\Delta_{k}Δj​=g′(inj​)k∑​wj,k​⋅Δk​
Δj=g′(∑i=0nwi,jai)undefined=g′(inj)⋅∑kwj,k⋅(y−hw(x))k⋅g′(∑iwi,kai)undefined=Δk\Delta_{j}=\underbrace{ g' \bigl( \sum_{i=0}^n w_{i,j}a_i \bigl) }_{=g'(in_j)} \cdot\sum_{k} w_{j, k} \cdot \underbrace{ (\mathbf{y}-\mathbf{h}_{\mathbf{w}}(\mathbf x))_k \cdot g' \bigl( \sum_{i} w_{i,k}a_i \bigl)}_{=\Delta_k}Δj​==g′(inj​)g′(i=0∑n​wi,j​ai​)​​⋅k∑​wj,k​⋅=Δk​(y−hw​(x))k​⋅g′(i∑​wi,k​ai​)​​
wi,j←wi,j+α⋅Δj⋅aiw_{i,j} \larr w_{i,j} + \alpha \cdot \Delta_j\cdot a_iwi,j​←wi,j​+α⋅Δj​⋅ai​

The back propagation algorithm:

Learning neural network structures

What is the ideal network structure for each problem?

The usual approach is to try several and keep the best. → "cross-validation"

If we want to consider networks that are not fully connected, then we need to find some effective search method through the very large space of possible connection topologies.

optimal brain damage algorithm

Begins with a fully connected network and removes connections (or units) from it.

The network is then retrained, and if its performance has not decreased then the process is repeated.

tiling

growing a larger network from a smaller one.

Applications of Neural Networks

Prediction of stock developments

Prediction of air pollution (e.g., ozone concentration)

(better than physical/chemical, statistical models)

Prediction of electric power prizes

Desulfurization in steel making

Pattern recognition (e.g., handwriting)

Face recognition

Speech recognition

Handwritten Digit Recognition (OCR)

Different learning strategy performances:

Deep Learning

Hierarchical structure learning: NNs with many layers

Feature extraction and learning (un)supervised

Problem: requires lots of data

Push for hardware (GPUs, designated chips): “Data-Driven” vs. “Model-Driven” computing

Advantages and Disadvantages of Neural Networks

Pros

  • Less need for determining relevant input factors
  • Inherent parallelism
  • Easier to develop than statistical methods
  • Capability to learn and improve
  • Useful for complex pattern recognition tasks, "unstructured" and "difficult" input (e.g., images)
  • Fault tolerance

Cons

  • Choice of parameters (layers, units) requires skill
  • Requires sufficient training material
  • Resulting hypotheses cannot be understood easily
  • Knowledge implicit (subsymbolic representation)
  • Verification, behavior prediction difficult (if not impossible)

Challenges