Neural Networks
Mathematical Model of Neuron
McCulloch and Pitts (1943)
The inputs stay the same, only the weights are changed.
Each unit has a dummy input with the weight .
The unit 's output activation is
Where is an activation function :
perceptron hard threshhold
sigmoid perceptron logistic function
Networks Structures
A link from to serves to propagate the activation .
Each link has a numeric weight from the matrix .
All nodes between each layer are connected. If we dont want a connection we set a weight to 0.
- feed-forward neural network(Focus of this chapter)
Connections only in one direction: Upstream nodes → downstream nodes.
Directed acyclic graph DAG.
Has no loops or internal states. Units are arranged in layers.
- recurrent neural network RNN
enable internal states (through flip flops, like the RS-Flip-Flop) → have short term memory
can become instable, oscillate, become chaotic
Hopfield Networks
Only input and ouput nodes
birectional connections and symmetric weights, that means
sign activation function (for output ):
that enables having a (holographic associative) memory → see RS-FlipFlop
- training on examples possible
- stimulus or partial data to get "closest" training example
- units can store training examples reliably
Boltzmann Machines
Hidden nodes, symmetric weights
use stochastic activation function → output 1 has a specific probability
state transitions very similar to simulated annealing
Long Short-Term Memory (LSTM)
building units to achieve a memory for long(er) time in RNNs
record value in a cell using a write gate, a keep gate, and a read gate
well-suited to deal with time series having time lags of unknown size and duration between events
Single-layer neural networks
Single-layer feed-forward neural networks (Perceptrons)
No hidden layers.
Linear classifiers (whether hard or soft) can represent linear decision boundaries (linear classification or regression) in the input space.
XOR gates can't be seperated, not representable.
Perceptron Learning
We train the weights for each scalar output seperately.
The network stays the same, we want to learn the right weights. We learn the weights for each output in isolation.
We want to learn
for unit
We want the minimal loss, through gradient search on :
We want to find the partial derivative of the loss function for each weight :
- Perceptron learning rule (Hard threshold)
The hard thershold can't be derived.
For each example :
How does exactly it work?
We have our currect output and the output of the examples with a given .
-
Our weights are correct. stays unchanged.
-
and
Our weights are incorrect. We want to make
If is positive gets increased
If is negative gets decreased
-
and
Our weights are incorrect. We want to make
If is positive gets decreased
If is negative gets increased
-
- Gradient decend rule (Soft threshold)
For each example : (Both are equivalent)
Multilayer neural networks
Multilayer feed-forward neural networks
Any functionality can be obtained by multilayer neural networks with an arbitrary depth.
Example
Able to represent any continuous function with arbitrary accuracy
Above we see a nested non-linear function as the solution to the output.
With the sigmoid function as and a hidden layer, each output unit computes a soft-thresholded linear combination of several sigmoid functions.
For example, by adding two opposite-facing soft threshold functions and thresholding the result, we can obtain a "ridge" function.
1 layer (no hidden layer): all linearly seperable functions
2 layers: all continuous function, can be as precise as one wants
3 layers: all functions
Combining two such ridges at right angles to each other (i.e., combining the outputs from four hidden units), we obtain a "bump".
With more hidden units, we can produce more bumps of different sizes in more places.
In fact, with a single, sufficiently large hidden layer, it is possible to represent any continuous function of the inputs with arbitrary accuracy. With two layers, even discontinuous functions can be represented.
Unfortunately, for any particular network structure, it is harder to characterize exactly which functions can be represented and which ones cannot.
Multi-Layer Perceptron Learning
We want to learn the function
input
output
We have a training set with examples .
We want to minimize the loss (= the sum of gradient losses for all output nodes).
Problem: dependency
We can not learn the weights the same way we did for the single layer network.
The multilayer network has a hidden layer and the output functions are nested functions that contain the nodes behind them.
The error for the hidden layers is not clear - we want to be able to push errors back.
output nodes
Each term in the final summation is computed as if the other outputs did not exist.
Same rule as in the single-layer perceptron.
the th component of the vector
Wherestands for the weight between the nodesand.
The idea is that the hidden nodeis "responsible" for some fraction of the errorin each of the output nodes to which it connects.
Thus, thevalues are divided according to thestrengthof the connection between the hidden node and the output node and are propagated back to provide thevalues for the hidden layer.
hidden nodes: Back-propagation
We back propagate the errors from the output to the hidden layers.
The back propagation algorithm:
The math behind back propagation
For theth output nodes
With defined as before.
For theth hidden nodes
With defined as before.
Learning neural network structures
What is the ideal network structure for each problem?
The usual approach is to try several and keep the best. → "cross-validation"
- big networks will memotize all examples and form a large lookup table but wont generalize well to unseen inputs. (It has been observed that very large networks do generalize well as long as the weights are kept small.)
- Networks are subject to overfitting when there are too many parameters in the model
If we want to consider networks that are not fully connected, then we need to find some effective search method through the very large space of possible connection topologies.
optimal brain damage algorithm
Begins with a fully connected network and removes connections (or units) from it.
The network is then retrained, and if its performance has not decreased then the process is repeated.
tiling
growing a larger network from a smaller one.
Applications of Neural Networks
Prediction of stock developments
Prediction of air pollution (e.g., ozone concentration)
(better than physical/chemical, statistical models)
Prediction of electric power prizes
Desulfurization in steel making
Pattern recognition (e.g., handwriting)
Face recognition
Speech recognition
Handwritten Digit Recognition (OCR)
Different learning strategy performances:
- 3-nearest-neighbor classification: 2.4% error
- 2 layer MLP (400–300–10 units): 1.6% error
- LeNet (1989;1995+): 0.9% error, using specialized neural net family
- current/recent best: ≈ 0.23% error, multi-col. convoluted deep NNs
Deep Learning
Hierarchical structure learning: NNs with many layers
Feature extraction and learning (un)supervised
Problem: requires lots of data
Push for hardware (GPUs, designated chips): “Data-Driven” vs. “Model-Driven” computing
Advantages and Disadvantages of Neural Networks
Pros
- Less need for determining relevant input factors
- Inherent parallelism
- Easier to develop than statistical methods
- Capability to learn and improve
- Useful for complex pattern recognition tasks, "unstructured" and "difficult" input (e.g., images)
- Fault tolerance
Cons
- Choice of parameters (layers, units) requires skill
- Requires sufficient training material
- Resulting hypotheses cannot be understood easily
- Knowledge implicit (subsymbolic representation)
- Verification, behavior prediction difficult (if not impossible)
Challenges
- Interpretability
- Explainability
- Trustworthiness
- Robustness (adversarial examples)
- Combining symbolic and subsymbolic approaches