Learning

Learning Agent

Goal: making agents improve themselves by studying their own experiences

Learning instead of programming because designers

can't foresee all the situations the agent might be in

can't foresee all the environment changes

don't know the solution

Utility Agent → Learning Agent

Learning Agent = performance element + learning element

Each component from the utility agent can be learned

actions $\mapsto$ their causing conditions

percept sequence $\mapsto$ relevant world properties

prior knowledge about world

Utility function: world states, actions $\mapsto$ their desirability

Goals that maximize the agents utility

Example: Taxi
Performance element knowledge about driving
Critic reactions of customer / other drivers / pedestrians / bikers / police
Learning element create rules - ie. not to turn steering wheel to quickly, break softly etc.
Problem generator try brakes on slippery road, push back with side (wing) mirror only

Designing a learning element

Design is defined by:

performance element type

functional component model that describes what actions do (we learn this)

representation mathematical model for functional component

available feedback

Examples:

Components

Performance element

Everything in between sensors (input) and actuators (output) abstracted away.

Takes in percepts and chooses actions.

Critic

Feedback on performance (based on a fixed performance standard).

Learning element LE

Change performance element for improvement.

Change "knowledge components" of agent (problem generator and performance element - independent of agents architecture).

Problem generator

suggest actions for new experiences to learn from - based on learning goals

_________________

Learning

Types of Representation (of states)

Different levels of granularity to represent states of the "world".

Structured bayesian network (for decision-theoretic agent)

Atomic states with labels like "image" - similar to blackbox.

Factored A vector of attribute values and outputs, continuous, numerical or discrete

(focus of this lecture)

Types of learning

analytical / deductive learning : going from general rule to more specific and efficient rule

Inductive Learning : discovering the general rule from examples

Ignores prior knowledge

Assumes a deterministic, observable "environment"

Assumes examples are given (selection of examples for training set is challenging)

In its simplest form: learning a function from examples

Assumes agent wants to learn

Types of feedback ("Learning Modes")

There are 3 feedback types = 3 learning types

unsupervised learning

agent learns patterns in input, without feedback or labeled examples to learn from.

ie. clustering - finding useful clusters in input examples, concept formation

reinforcement learning

With rewards / punishment. - agents decides which actions were most responsible for it.

supervised learning

Input-output pairs that agent learns a function from.

semi-supervised learning

A back and forth between supervised and unsupervised learning - agent is given a few labeled examples that it must guess the unlabeled examples of a large collection from.

Mix of labeled and unlabeled examples but not all the given labels are correct - there is systematic noise in the data, that agent must detect and consider without supervision.

Supervised Learning

the aim is to find a simple hypothesis that is approximately consistent with training examples

Training set (of data points)

Example input-output pairs: $(x_1, y_1), (x_2,y_2), ..., (x_n,y_n)$

Where each $y_i$ was generated by an unknown function $f(x) = y$

Task of agent

Discover a function $h$ hypothesis that approximates the true function $f$ .

Learning: search through the hypothesis space $\mathcal{H}$ for one that performs well even for examples beyond the training set.

We say a problem is realizable if the $f\in \mathcal H$ . (We dont always know because the true function is unknown).

Variations

If function is stochastic we have to learn a conditional probability distribution $P(Y|x)$ .

If output is out of a finite / discrete set of values then the problem is called boolean / binary classification.

If output is a number the problem is called regression .

Test set

Measures the accuracy of the hypothesis.

If it performed well one says it generalized well.

If the hypothesis agrees with all the data then it is called a consistent hypothesis .

Choosing hypotheses

Curve fitting

Were searching a hypothesis function for this example.

No consistent straight line for this data set.

6 data points → 5th degree polynomial for exact fit but does not generalize well.

Ockham's Razor: "Maximize simplicity under consistency"

How do we choose among all the consistent hypothesis in the hypothesis space? → We choose the most simple one.

$h$ is consistent if it agrees with $f$ on all examples.

Tradeoff between complexity and simplicity

precision, fitting training data vs. generalizing, finding patterns

complex hypotheses - fit the data well vs. simpler hypotheses- might generalize better

Choosing hypothesis spaces

Why not choose the space of all possible turing machine instructions? Then we would cover every possible computable function.

We want our hypothesis $h(x)$ to be useful and fast to compute.

Fitting a straight line to data is an easy computation but fitting turing machines is impossible.

Tradeoff between expressiveness and complexity

expressiveness of the hypothesis space (quanitity) vs. finding a good hypothesis in it (quality)

Generalization and overfitting

The DTL algorithm will use any pattern it finds in the examples even if they are not correlated.

Overfitting

occurs when a function is too closely aligned to training set - then it does not generalize well.

$\uparrow$ likelyhood of overfitting if $\uparrow$ hypothesis space, $\uparrow$ number of inputs / attributes (hypothesis may consider irrelevant attributes ie. color of dices)

$\downarrow$ likelyhood of overfitting if $\uparrow$ examples.

To lower likelyhood of overfitting:

Ockham's Razor: "Maximize simplicity under consistency"
We choose the most simple consistent hypothesis. - simpler hypotheses might generalize better

Choose hypothesis space with lower expressiveness

remove unneccessary attributes from input

use more examples

Cross-Validation : Use part of the data for training, part for testing

Overfitting

occurs when a function is too closely aligned to training set - then it does not generalize well.

$\uparrow$ likelyhood of overfitting if $\uparrow$ hypothesis space, $\uparrow$ number of inputs / attributes (hypothesis may consider irrelevant attributes ie. color of dices)

$\downarrow$ likelyhood of overfitting if $\uparrow$ examples.

To lower likelyhood of overfitting:

Ockham's Razor: "Maximize simplicity under consistency"
We choose the most simple consistent hypothesis. - simpler hypotheses might generalize better

Choose hypothesis space with lower expressiveness

remove unneccessary attributes from input

use more examples

Cross-Validation : Use part of the data for training, part for testing

Decision tree pruning

Combats overfitting, eliminates nodes that are not clearly relevant after tree got generated.

How do we detect nodes that is testing an irrelevant attribute?
Imagine being at a node with $p_k$ positive and $n_k$ negative examples, if the attribute is irrelevant we would expect it to split the examples into subsets that have roughly the same proportion of positive examples as the whole set.
pkpk+nk≈pp+n\frac{p_k}{p_k+n_k} \approx \frac{p}{p+n}pk+nkpk≈p+np

Splitting after sufficient information gain

Now we know what amount of gain is insufficient, but what information gain is suffucient for splitting on a particular attribute? → statistical significance test .

Measuring Learning Performance

How do we know that $h ≈ f$ ? (Hume’s Problem of Induction)

Learning curve = % correct on test set when using $h$

Learning curve depends on realizability of the target function.

Non-realizability can be due to

missing attributes

too restrictive hypothesis space (e.g.,linear function)

redundant expressiveness (e.g., loads of irrelevant attributes)