Learning

Learning Agent

Goal: making agents improve themselves by studying their own experiences

Learning instead of programming because designers

  1. can't foresee all the situations the agent might be in
  1. can't foresee all the environment changes
  1. don't know the solution

Utility Agent → Learning Agent

Learning Agent = performance element + learning element

Each component from the utility agent can be learned

  • actions \mapsto their causing conditions
  • percept sequence \mapsto relevant world properties
  • prior knowledge about world
  • Utility function: world states, actions \mapsto their desirability
  • Goals that maximize the agents utility

Designing a learning element

Design is defined by:

  • performance element type
  • functional component model that describes what actions do (we learn this)
  • representation mathematical model for functional component
  • available feedback

Examples:

Components

Performance element

Everything in between sensors (input) and actuators (output) abstracted away.

Takes in percepts and chooses actions.

Critic

Feedback on performance (based on a fixed performance standard).

Learning element LE

Change performance element for improvement.

Change "knowledge components" of agent (problem generator and performance element - independent of agents architecture).

Problem generator

suggest actions for new experiences to learn from - based on learning goals

_________________

Learning

Types of Representation (of states)

Different levels of granularity to represent states of the "world".

Structured bayesian network (for decision-theoretic agent)

Atomic states with labels like "image" - similar to blackbox.

Factored A vector of attribute values and outputs, continuous, numerical or discrete

(focus of this lecture)

Types of learning

analytical / deductive learning : going from general rule to more specific and efficient rule

Inductive Learning : discovering the general rule from examples

  • Ignores prior knowledge
  • Assumes a deterministic, observable "environment"
  • Assumes examples are given (selection of examples for training set is challenging)
  • In its simplest form: learning a function from examples
  • Assumes agent wants to learn

Types of feedback ("Learning Modes")

There are 3 feedback types = 3 learning types

unsupervised learning

agent learns patterns in input, without feedback or labeled examples to learn from.

ie. clustering - finding useful clusters in input examples, concept formation

reinforcement learning

With rewards / punishment. - agents decides which actions were most responsible for it.

supervised learning

Input-output pairs that agent learns a function from.

semi-supervised learning

A back and forth between supervised and unsupervised learning - agent is given a few labeled examples that it must guess the unlabeled examples of a large collection from.

Mix of labeled and unlabeled examples but not all the given labels are correct - there is systematic noise in the data, that agent must detect and consider without supervision.

Supervised Learning

the aim is to find a simple hypothesis that is approximately consistent with training examples

Training set (of data points)

Example input-output pairs: (x1,y1),(x2,y2),...,(xn,yn)(x_1, y_1), (x_2,y_2), ..., (x_n,y_n)

Where each yiy_i was generated by an unknown functionf(x)=yf(x) = y

Task of agent

Discover a function hhhypothesis that approximates the true function ff .

Learning: search through the hypothesis spaceH\mathcal{H} for one that performs well even for examples beyond the training set.

We say a problem is realizable if the fHf\in \mathcal H . (We dont always know because the true function is unknown).

Variations

  • If function is stochastic we have to learn a conditional probability distributionP(Yx)P(Y|x).
  • If output is out of a finite / discrete set of values then the problem is called boolean / binary classification.
  • If output is a number the problem is called regression .

Test set

Measures the accuracy of the hypothesis.

If it performed well one says it generalized well.

If the hypothesis agrees with all the data then it is called a consistent hypothesis .

Choosing hypotheses

Curve fitting

Were searching a hypothesis function for this example.

Ockham's Razor: "Maximize simplicity under consistency"

How do we choose among all the consistent hypothesis in the hypothesis space? → We choose the most simple one.

hh is consistent if it agrees with ff on all examples.

Tradeoff between complexity and simplicity

precision, fitting training data vs. generalizing, finding patterns

complex hypotheses - fit the data well vs. simpler hypotheses- might generalize better

Choosing hypothesis spaces

Why not choose the space of all possible turing machine instructions? Then we would cover every possible computable function.

We want our hypothesis h(x)h(x) to be useful and fast to compute.

Fitting a straight line to data is an easy computation but fitting turing machines is impossible.

Tradeoff between expressiveness and complexity

expressiveness of the hypothesis space (quanitity) vs. finding a good hypothesis in it (quality)

Generalization and overfitting

The DTL algorithm will use any pattern it finds in the examples even if they are not correlated.

Overfitting

occurs when a function is too closely aligned to training set - then it does not generalize well.

\uparrow likelyhood of overfitting if \uparrow hypothesis space, \uparrow number of inputs / attributes (hypothesis may consider irrelevant attributes ie. color of dices)

\downarrow likelyhood of overfitting if \uparrow examples.


To lower likelyhood of overfitting:

  • Ockham's Razor: "Maximize simplicity under consistency"

    We choose the most simple consistent hypothesis. - simpler hypotheses might generalize better

  • Choose hypothesis space with lower expressiveness
  • remove unneccessary attributes from input
  • use more examples
  • Cross-Validation : Use part of the data for training, part for testing

Overfitting

occurs when a function is too closely aligned to training set - then it does not generalize well.

\uparrow likelyhood of overfitting if \uparrow hypothesis space, \uparrow number of inputs / attributes (hypothesis may consider irrelevant attributes ie. color of dices)

\downarrow likelyhood of overfitting if \uparrow examples.


To lower likelyhood of overfitting:

  • Ockham's Razor: "Maximize simplicity under consistency"

    We choose the most simple consistent hypothesis. - simpler hypotheses might generalize better

  • Choose hypothesis space with lower expressiveness
  • remove unneccessary attributes from input
  • use more examples
  • Cross-Validation : Use part of the data for training, part for testing

Decision tree pruning

Combats overfitting, eliminates nodes that are not clearly relevant after tree got generated.

Splitting after sufficient information gain

Now we know what amount of gain is insufficient, but what information gain is suffucient for splitting on a particular attribute? → statistical significance test .

Measuring Learning Performance

How do we know that hfh ≈ f ? (Hume’s Problem of Induction)

Learning curve = % correct on test set when using hh

Learning curve depends on realizability of the target function.

Non-realizability can be due to

  • missing attributes
  • too restrictive hypothesis space (e.g.,linear function)
  • redundant expressiveness (e.g., loads of irrelevant attributes)