Learning
Learning Agent
Goal: making agents improve themselves by studying their own experiences
Learning instead of programming because designers
- can't foresee all the situations the agent might be in
- can't foresee all the environment changes
- don't know the solution
Utility Agent → Learning Agent
Learning Agent = performance element + learning element
Each component from the utility agent can be learned
- actions their causing conditions
- percept sequence relevant world properties
- prior knowledge about world
- Utility function: world states, actions their desirability
- Goals that maximize the agents utility
Example: Taxi
Performance element knowledge about driving
Critic reactions of customer / other drivers / pedestrians / bikers / police
Learning element create rules - ie. not to turn steering wheel to quickly, break softly etc.
Problem generator try brakes on slippery road, push back with side (wing) mirror only
Designing a learning element
Design is defined by:
- performance element type
- functional component model that describes what actions do (we learn this)
- representation mathematical model for functional component
- available feedback
Examples:
Components
Performance element
Everything in between sensors (input) and actuators (output) abstracted away.
Takes in percepts and chooses actions.
Critic
Feedback on performance (based on a fixed performance standard).
Learning element LE
Change performance element for improvement.
Change "knowledge components" of agent (problem generator and performance element - independent of agents architecture).
Problem generator
suggest actions for new experiences to learn from - based on learning goals
_________________
Learning
Types of Representation (of states)
Different levels of granularity to represent states of the "world".
Structured bayesian network (for decision-theoretic agent)
Atomic states with labels like "image" - similar to blackbox.
Factored A vector of attribute values and outputs, continuous, numerical or discrete
(focus of this lecture)
Types of learning
analytical / deductive learning : going from general rule to more specific and efficient rule
Inductive Learning : discovering the general rule from examples
- Ignores prior knowledge
- Assumes a deterministic, observable "environment"
- Assumes examples are given (selection of examples for training set is challenging)
- In its simplest form: learning a function from examples
- Assumes agent wants to learn
Types of feedback ("Learning Modes")
There are 3 feedback types = 3 learning types
unsupervised learning
agent learns patterns in input, without feedback or labeled examples to learn from.
ie. clustering - finding useful clusters in input examples, concept formation
reinforcement learning
With rewards / punishment. - agents decides which actions were most responsible for it.
supervised learning
Input-output pairs that agent learns a function from.
semi-supervised learning
A back and forth between supervised and unsupervised learning - agent is given a few labeled examples that it must guess the unlabeled examples of a large collection from.
Mix of labeled and unlabeled examples but not all the given labels are correct - there is systematic noise in the data, that agent must detect and consider without supervision.
Supervised Learning
the aim is to find a simple hypothesis that is approximately consistent with training examples
Training set (of data points)
Example input-output pairs:
Where each was generated by an unknown function
Task of agent
Discover a function hypothesis that approximates the true function .
Learning: search through the hypothesis space for one that performs well even for examples beyond the training set.
We say a problem is realizable if the . (We dont always know because the true function is unknown).
Variations
- If function is stochastic we have to learn a conditional probability distribution.
- If output is out of a finite / discrete set of values then the problem is called boolean / binary classification.
- If output is a number the problem is called regression .
Test set
Measures the accuracy of the hypothesis.
If it performed well one says it generalized well.
If the hypothesis agrees with all the data then it is called a consistent hypothesis .
Choosing hypotheses
Curve fitting
Were searching a hypothesis function for this example.
- No consistent straight line for this data set.
- 6 data points → 5th degree polynomial for exact fit but does not generalize well.
Ockham's Razor: "Maximize simplicity under consistency"
How do we choose among all the consistent hypothesis in the hypothesis space? → We choose the most simple one.
is consistent if it agrees with on all examples.
Tradeoff between complexity and simplicity
precision, fitting training data vs. generalizing, finding patterns
complex hypotheses - fit the data well vs. simpler hypotheses- might generalize better
Choosing hypothesis spaces
Why not choose the space of all possible turing machine instructions? Then we would cover every possible computable function.
We want our hypothesis to be useful and fast to compute.
Fitting a straight line to data is an easy computation but fitting turing machines is impossible.
Tradeoff between expressiveness and complexity
expressiveness of the hypothesis space (quanitity) vs. finding a good hypothesis in it (quality)
Generalization and overfitting
The DTL algorithm will use any pattern it finds in the examples even if they are not correlated.
Overfitting
occurs when a function is too closely aligned to training set - then it does not generalize well.
likelyhood of overfitting if hypothesis space, number of inputs / attributes (hypothesis may consider irrelevant attributes ie. color of dices)
likelyhood of overfitting if examples.
To lower likelyhood of overfitting:
-
Ockham's Razor: "Maximize simplicity under consistency"
We choose the most simple consistent hypothesis. - simpler hypotheses might generalize better
- Choose hypothesis space with lower expressiveness
- remove unneccessary attributes from input
- use more examples
- Cross-Validation : Use part of the data for training, part for testing
Overfitting
occurs when a function is too closely aligned to training set - then it does not generalize well.
likelyhood of overfitting if hypothesis space, number of inputs / attributes (hypothesis may consider irrelevant attributes ie. color of dices)
likelyhood of overfitting if examples.
To lower likelyhood of overfitting:
-
Ockham's Razor: "Maximize simplicity under consistency"
We choose the most simple consistent hypothesis. - simpler hypotheses might generalize better
- Choose hypothesis space with lower expressiveness
- remove unneccessary attributes from input
- use more examples
- Cross-Validation : Use part of the data for training, part for testing
Decision tree pruning
Combats overfitting, eliminates nodes that are not clearly relevant after tree got generated.
How do we detect nodes that is testing an irrelevant attribute?
Imagine being at a node with positive and negative examples, if the attribute is irrelevant we would expect it to split the examples into subsets that have roughly the same proportion of positive examples as the whole set.
Splitting after sufficient information gain
Now we know what amount of gain is insufficient, but what information gain is suffucient for splitting on a particular attribute? → statistical significance test .
Measuring Learning Performance
How do we know that ? (Hume’s Problem of Induction)
Learning curve = % correct on test set when using
Learning curve depends on realizability of the target function.
Non-realizability can be due to
- missing attributes
- too restrictive hypothesis space (e.g.,linear function)
- redundant expressiveness (e.g., loads of irrelevant attributes)