All exam questions so far + answers
Agents
Environments that agents can be situated in
fully observable vs. partly observable
whether sensors can detect all relevant properties
single-agent vs. multi-agent
single agent, or multiple with cooperation, competition
deterministic vs. stochastic
whether the next state can be determined by current state + the performed action or is fully independent and can't be foreseen (multiple possible following states) - and we know the probabilities
episodic vs. sequential
Episodic: the choice of action only depends on the current episode - percept history divided into independent episodes.
Sequential: storing entire history (accessing memories lowers performance)
static vs. dynamic vs. semi-dynamic
Static: world does not change during the reasoning / processing time of the agent (waits for agents response)
Semi-dynamic: environment is static, but performance score decreases while processing time.
discrete vs. continuous
property values of the world
known vs. unknown
state of knowledge about the "laws of physics" of the environment
Example
A game of poker is in a discrete environment.
Name the 4 components of an agents task description
Task description of Agents. ie: Autonomous car
Performance measure safety, destination, profits, legality, comfort, . . .
Environment streets/freeways, traffic, pedestrians, weather, . . .
Actuators steering, accelerator, brake, horn, speaker/display, . . .
Sensors video, accelerometers, gauges, engine sensors, keyboard, GPS, . . .
Rational agent
Chooses action that maximizes performance measure outuput for any percept sequence.
Decisions based on evidence:
- percept history
- built-in knowledge of the environment (if the environment is known)- ie. fundamental knowledge laws like physics
Agent types
Simple reflex agents
- No memory (percept sequences)
- no internal states
- Possible non-termination
- Only suitable for very specific tasks
Model-based reflex agent
- memory
- internal states of the world
- model of environment and changes in it
- can reason about unobservable parts
- can deal with uncertainty
- has implicit goals
Goal-based agents
- explicit goals
- explicitly model: the world , goals, actions and their effects
- Search and planning of action sequences
- more flexible, better maintainability
Utility-based agents
- evaluate goals with utility function
- use expected utility for decisions
- resolve conflicting goals - goals are weighted differently, the aim is to optimize a set of values
goal based agent vs. utility based agent
The goal based agent
- can not deal with conflicting goals because it does not have a utility function.
- Does not consider expected utility when planning - just goals, actions, effects.
Uninformed search algorithms
Formal description of a search problem
Real World → Search problem → Search Algorithm
We define states, actions, solutions (= selecting a state space) through abstraction.
Then we turn states into nodes for the search algorithm.
Solution
A solution is a sequence of actions leading from the initial state to a goal state.
- Initial state
- Successor function
- Goal test
- path cost (additive)
All algorithms
Breadth-first search BFS
Frontier as queue.
The space and time complexity is too high.
Here the goal test is done at generation time.
Pseudocode
function BREADTH-FIRST-SEARCH(problem) returns a solution, or failure node ← a node with STATE = problem.INITIAL-STATE, PATH-COST = 0 if problem.GOAL-TEST(node.STATE) then return SOLUTION(node) frontier ← a FIFO queue with node as the only element explored ← an empty set loop do if EMPTY?(frontier) then return failure node ← POP(frontier) add node.STATE to explored for each action in problem.ACTIONS(node.STATE) do child ← CHILD-NODE(problem, node, action) if child.STATE is not in explored or frontier then if problem.GOAL-TEST(child.STATE) then return SOLUTION(child) frontier ← INSERT(child,frontier)
completeness yes, if is finite
time complexity
in depth
If solution is at depth and max branching is :
Goal test at generation time:
Goal test at expansion time:
space complexity
solution optimality yes, if step costs are identical
Uniform-cost search UCS
Fronier as priority queue ordered by costs.
Here the goal test is done at expansion time.
Pseudocode
function UNIFORM-COST-SEARCH(problem) returns a solution, or failure node ← a node with STATE = problem.INITIAL-STATE, PATH-COST = 0 frontier ← a priority queue ordered by PATH-COST, with node as the only element explored ← an empty set loop do if EMPTY?(frontier) then return failure node ← POP(frontier) if problem.GOAL-TEST(node.STATE) then return SOLUTION(node) add node.STATE to explored for each action in problem.ACTIONS(node.STATE) do child ← CHILD-NODE(problem, node, action) if child.STATE is not in explored or frontier then frontier ← INSERT(child,frontier) else if child.STATE is in frontier with higher PATH-COST then replace that frontier node with child
completeness Yes, if step-cost and is finite
time complexity
in depth
if all step-costs
and is the cost of the optimal solution
then and therefore
if goal-tested at expansion, not generation - therefore .
space complexity
solution optimality Yes, if goal-tested at expansion.
Depth-first search DFS
Fronier as stack.
Much faster than BFS if solutions are dense.
Backtracking variant: store one successor node at a time - generate child node with change.
completeness No - only complete with explored-set and in finite spaces.
time complexity - Terrible results if .
space complexity
solution optimality No
Depth-limited search DLS
DFS limited with - returns a subtree/subgraph.
Pseudocode
function DEPTH-LIMITED-SEARCH(problem, limit) returns a solution, or failure/cutoff return RECURSIVE-DLS(MAKE-NODE(problem.INITIAL-STATE), problem, limit)function RECURSIVE-DLS(node, problem, limit) returns a solution, or failure/cutoff if problem.GOAL-TEST(node.STATE) then return SOLUTION(node) else if limit = 0 then return cutoff else cutoff occurred? ← false for each action in problem.ACTIONS(node.STATE) do child ← CHILD-NODE(problem, node, action) result ← RECURSIVE-DLS(child, problem, limit − 1) if result = cutoff then cutoff occurred? ← true else if result != failure then return result if cutoff occurred? then return cutoff else return failure
completeness No - May not terminate (even for finite search space)
time complexity
space complexity
solution optimality No
Iterative deepening search IDS
Iteratively increasing in DLS to gain completeness.
Pseudocode
function ITERATIVE-DEEPENING-SEARCH(problem) returns a solution, or failure for depth = 0 to ∞ do result ← DEPTH-LIMITED-SEARCH(problem, depth) if result 6= cutoff then return result
completeness Yes - if is finite
time complexity
in depth
Only times more inefficient than BFS therefore
space complexity
in depth
The total number of nodes generated in the worst case is:
Most of the nodes are in the bottom level, so it does not matter much that the upper levels are generated multiple times. In IDS, the nodes on the bottom level (depth ) are generated once, those on the next-to-bottom level are generated twice, and so on, up to the children of the root, which are generated times.
Therefore asymptotically the same as DFS.
solution optimality Yes, if step cost are identical
Overview
under the conditions:
- if is finite
- if step costs for positive
- optimal with identical step costs
Does BFS expand as many nodes as DFS?
No - they have different frontier expansion strategies
but in some cases this can be true (as it depends on the graph).
But BFS expands nodes in the same tree level first while DFS expands all children of the same node until there aren't any left.
Is IDS optimal for limited tree depth?
No - it is only optimal with with identical step costs.
Does BFS expand at least as often as it has nodes?
BFS can goal check:
- At expansion time - expands as often as it has nodes at most
- At generation time (more efficient - expands less than it has nodes)
Is BFS a special case of A* search?
If we set the heuristic to for every node and then A* acts as BFS:
Choosing function with the lowest depth in the tree first.
Informed Search
consistency: Admissible heuristics have a problem with graph search. Which one is it? How can this problem be solved?
Implies that -value is non-decreasing on every path .
For every node and its successor :
consistent for graph search is optimal
If decreases during optimal path to goal it is discarded by graph-search .
(But in tree search it is not a problem since there is only a single path to the optimal goal).
Solutions to problem:
- consistency of (non decreasing on every path)
- additional book-keeping
Given admissible heuristics . Which one is the best?
The heuristic that dominates the others:
For two admissible heuristics, we say dominates , if:
Then is better for search.
admissibility
Optimistic, never overestimating, never negative.
goals: (follows from the above)
Prove: If the heuristic is consistent and is the neighbour of , then
(because of consistency)
Prove: is consistent for all consistent heuristics if:
We know that and are consistent and therefore:
This means that:
Which proves the consistency of .
Local Search
local beam search and its advantages / disadvantages
Idea
- keep states instead of just to start searching from
- Not running searches in parallel: Searches that find good states recruit other searches to join them
- choose top of all their successors
Problem
often, all states end up on same local hill
Solution
choose successors randomly, biased towards good ones
Is local beam search a modification of the generic algorithm with cross-over?
yes - the genetic algorithm = stochastic local beam search + successors from pairs of states
hill climbing algorithm - how do we deal with its weaknesses?
Advantage: able of escaping shoulders
Disadvantage: getting stuck in loop at local maximum
Solution: Random-restart hill climbing = restart after step
Decision Trees
Why is entropy relevant for decision trees? How is defined?
We use entropy to determine the importance of attributes.
A lower entropy is better because it means we have to ask less questions (less testing = shallower tree) to reach a decison.
Information gain measured with entropy
Entropy measures uncertainty of the occurence of a random variable in [shannon] or [bit].
Entropy of a random variable with values each with probability is defined as:
Entropy is at its maximum when all outcomes are equally likely.
Entropy of a boolean random variable
Boolean random variable : Variable is true with probability and false with .
Same formula as above.
Learning
overfitting, ways to avoid it
Overfitting
occurs when a function is too closely aligned to training set - then it does not generalize well.
To lower likelyhood of overfitting:
-
Ockham's Razor: "Maximize simplicity under consistency"
We choose the most simple consistent hypothesis. - simpler hypotheses might generalize better
- Choose hypothesis-space with lower expressiveness
- remove unneccessary attributes from input
- use more examples
- Cross-Validation : Use part of the data for training, part for testing
-
Ockham's Razor: "Maximize simplicity under consistency"
What is inductive learning?
analytical / deductive learning : going from general rule to more specific and efficient rule
Inductive Learning : discovering the general rule from examples
- Ignores prior knowledge
- Assumes a deterministic, observable "environment"
- Assumes examples are given (selection of examples for training set is challenging)
- In its simplest form: learning a function from examples
- Assumes agent wants to learn
What is a learning curve? In which 2 cases is it not optimal?
Learning curve = % correct on test set when using
Learning curve depends on realizability of the target function.
Non-realizability can be due to
- missing attributes
- too restrictive hypothesis space (e.g.,linear function)
- redundant expressiveness (e.g., loads of irrelevant attributes)
Learning agents
Goal: making agents improve themselves by studying their own experiences
Learning instead of programming because designers
- can't foresee all the situations the agent might be in
- can't foresee all the environment changes
- don't know the solution
Utility Agent → Learning Agent
Learning Agent = performance element + learning element
Each component from the utility agent can be learned
- actions their causing conditions
- percept sequence relevant world properties
- prior knowledge about world
- Utility function: world states, actions their desirability
- Goals that maximize the agents utility
Example: Taxi
Performance element knowledge about driving
Critic reactions of customer / other drivers / pedestrians / bikers / police
Learning element create rules - ie. not to turn steering wheel to quickly, break softly etc.
Problem generator try brakes on slippery road, push back with side (wing) mirror only
Learning agent components
Performance element
Everything in between sensors (input) and actuators (output) abstracted away.
Takes in percepts and chooses actions.
Critic
Feedback on performance (based on a fixed performance standard).
Learning element LE
Change "knowledge components" of agent (problem generator and performance element - independent of agents architecture).
Problem generator
suggest actions for new experiences to learn from - based on learning goals
Designing a learning element
Design is defined by:
- performance element type
- functional component model that describes what actions do (we learn this)
- representation mathematical model for functional component
- available feedback
Examples:
Neural Networks
construct a neural network:
Activation function:
Alternative activation functions
Activation function:
neuron unit
Mathematical Model of Neuron
McCulloch and Pitts (1943)
The inputs stay the same, only the weights are changed.
Each unit has a dummy input with the weight .
The unit 's output activation is
Where is an activation function :
perceptron hard threshhold
sigmoid perceptron logistic function
activation functions
Determine a threshhold, that makes the neuron fire if it is reached.
We use:
- step function / hard limiter
- linear function / treshold function
- sigmoid function
Advantages and disadvantages of neural networks / learning by observation
Advantages and Disadvantages of Neural Networks
Pros
- Less need for determining relevant input factors
- Inherent parallelism
- Easier to develop than statistical methods
- Capability to learn and improve
- Quite good for complex pattern recognition tasks
- Usable for "unstructured" and "difficult" input (e.g., images)
- Fault tolerance
Cons
- Choice of parameters (layers, units for network) requires skill
- Requires sufficient training material
- Resulting hypotheses cannot be understood easily
- Knowledge implicit (subsymbolic representation)
- Verification and behavior prediction is difficult (if not impossible)
Challenges
- Interpretability
- Explainability
- Trustworthiness
- Robustness (adversarial examples)
- Combining symbolic and subsymbolic approaches
Perceptron learning rule
We train the weights for each scalar output seperately.
The network stays the same, we want to learn the right weights. We learn the weights for each output in isolation.
We want to learn
for unit
We want the minimal loss, through gradient search on :
We want to find the partial derivative of the loss function for each weight :
- Perceptron learning rule (Hard threshold)
The hard thershold can't be derived.
For each example :
How does exactly it work?
We have our currect output and the output of the examples with a given .
-
Our weights are correct. stays unchanged.
-
and
Our weights are incorrect. We want to make
If is positive gets increased
If is negative gets decreased
-
and
Our weights are incorrect. We want to make
If is positive gets decreased
If is negative gets increased
-
- Gradient decend rule (Soft threshold)
For each example : (Both are equivalent)
- Perceptron learning rule (Hard threshold)
What is backwards propagation? How does it work for inner nodes?
We want to learn the function
input
output
We have a training set with examples .
We want to minimize the loss (= the sum of gradient losses for all output nodes).
Problem: dependency
We can not learn the weights the same way we did for the single layer network.
The multilayer network has a hidden layer and the output functions are nested functions that contain the nodes behind them.
The error for the hidden layers is not clear - we want to be able to push errors back.
output nodes
Each term in the final summation is computed as if the other outputs did not exist.
Same rule as in the single-layer perceptron.
the th component of the vector
Wherestands for the weight between the nodesand.
The idea is that the hidden nodeis "responsible" for some fraction of the errorin each of the output nodes to which it connects.
Thus, thevalues are divided according to thestrength of the connection between the hidden node and the output node and are propagated back to provide the values for the hidden layer.
hidden nodes: Back-propagation
We back propagate the errors from the output to the hidden layers.
The back propagation algorithm:
The math behind back propagation
For theth output nodes
With defined as before.
For theth hidden nodes
With defined as before.
What is deep learning, when does it work well, where are its shortcomings?
Deep Learning
Deep learning: Hierarchical structure learning - Artificial neural network with many layers.
Feature extraction and learning (un)supervised
Problem: requires lots of data
Which functions can be represented with single and multi-layer perceptrons?
One hidden layer all continuous function, can be as precise as one wants
Two hidden layers all functions
Single-layer feed-forward neural networks (Perceptrons)
No hidden layers: Always represents a linear separator in the input space - therefore can not represent the XOR function.
Multilayer feed-forward neural networks
With a single, sufficiently large hidden layer (2 layers in total), it is possible to represent any continuous function of the inputs with arbitrary accuracy.
in depth
Above we see a nested non-linear function as the solution to the output.
With the sigmoid function as and a hidden layer, each output unit computes a soft-thresholded linear combination of several sigmoid functions.
For example, by adding two opposite-facing soft threshold functions and thresholding the result, we can obtain a "ridge" function.
Combining two such ridges at right angles to each other (i.e., combining the outputs from four hidden units), we obtain a "bump".
With more hidden units, we can produce more bumps of different sizes in more places.
With two hidden layers, even discontinuous functions can be represented.
CSP
Cryptographic puzzle as a CSP
E G G + O I L ------- M A Y O
Defined as a constraint satisfaction problem:
Constraints:
Each letter stands for a different digit:
The variables stand for the digit that gets carried over into the next column by the addition. Therefore stands for in our unknown numeric system, stands for , stands for .
Explain Backtracking Search for CSPs - why isn't it efficient?
CSPs as standard search problems
If we do not consider commutativity we have leafs else we have leafs.
The order of the assignments has no effect on the outcome. - Choosing a single variable variable in each tree level.
Backtracking search algorithm
Uninformed search algorithm (not effective for large trees).
Depth-first search:
- chooses values for one unassigned variable at a time - does not matter which one because of commutativity.
- tries all values in the domain of that variable
- backtracks when a variable has no legal values left to assign.
Because its ineffective we use heuristics / General-purpose methods for backtracking search.
backtracking heuristics for choosing variables and values
Variable ordering: fail first
- Minimum remaining values MRV
- degree heuristic - (used as a tie breaker)
Value ordering: fail last
- least-constraining-value - a value that rules out the fewest choices for the neighbouring variables in the constraint graph.
Does a CSP with variables and values in its domain have possible assignemnts?
Yes - this is because we are considering the commutativity of operatins and pick a single variable to assign for each tree level.
Classical Planning
STRIPS vs. ADL
STRIPS
- only positive literals in states
- closed-world-assumption: unmentioned literals are false
- Effect means add and delete
- only ground atoms in goals
- Effects are conjunctions
- No support for equality or types
ADL
- positive and negative literals in states
- open-world-assumption: unmentioned literals are unknown
- Effect means add and delete and
- goals contain quantor
- goals allow conjunction and disjunction
- conditional effects are allowed: means is only an effect if is satisfied
- equality and types built in
formalize a STRIPS action
Planning: What 3 problems occur in reasoning about actions?
Reasoning about the results of actions
frame problem ... what are things that stay unchanged after action?
would require a lot of axioms to describe.
ramification problem ... what are the implicit effects?
ie.: passengers of a car move with it
qualification problem ... what are preconditions for actions?
deals with a correct conceptualisation of things (no answer).
ie.: finding the right formalization of a human conversation
Classical Planning with state space search
explain POP algorithm
Partial-order Planning
Order is not specified: can place to actions into a plan without specifying which one comes first.
Search in space of partial-order-plans.
Components of a plan
- actions set
not actions in the world but actions on plans . ie. adding an ordering to the plan, ...
- ordering constraints set
Must be free of contradictions/cycles
- casual links set
Executing action achieves precondition for action .
Means must remain from time of action to the time of action .
else there is a conflict .
ie:
- open preconditions set
Minimizing open preconditions (= preconditions not achieved by an action in the plan)
Algorithm
Defines execution order of plan by searching in the space of partial-order plans.
Returns a consistent plan : without cycles in the ordering constraints and conflicts in the casual links.
- Initiating empty plan
only contains the actions:
no preconditions, effect has all literals, all preconditions are open
no open preconditions, effect has no literals.
ordering constraint:
no casual links
- Searching
2.1 successor function: repeatedly choosing next possible state.
Chooses an open precondition of action and generates a successor plan (subtree) for every consistent way of choosing an action that achieves .
2.2 enforce consistency by defining new casual links in the plan:
Ifis new:
2.3 resolve conflicts
add
2.4 goal test
because only consistent plans are generated - just check if preconditions are open.
If goal not reached, add successor states
- actions set
Does POP it only produce consistent plans?
Yes - because only consistent plans are generated the goal test is checking if any preconditions are open. If the goal is not reached it adds successor states.
all types of state space search algorithms
Totally ordered plan searches
in strictly linear order.
The state space is finite without function symbols - any graph search algorithm can be used.
Variants:
- progression planing: initial state goal
-
regression planing: initial state
goal
(possible because of declarative representation of PDDL and strips)
Partial-order Planning
Order is not specified: can place to actions into a plan without specifying which one comes first.
Search in space of partial-order-plans.
consistent plans (in partial order planning) and solutions
solutionto a declared planning-problem= a plan
- Any action sequence that when executed in the initial state - results in a state that satisfies the goal.
-
Every action sequence that maintains the partial order is a solution.
(The in POP algorithm the goal is therefore only defined by not having open preconditions)
Consistent actions
Chosen actions should not undo the progress made.
Consistent plans (in partial order planning)
Must be free of contradictions / cycles in the ordering constraint set :
Must be free of conflicts in the casual links set : precondition must remain if .
order constraint and when does it lead to problems?
Action must be executed before action .
Must be free of contradictions/cycles :
Utility
risk aversion: What would the risk averse agent prefer or ?
as we gain more money, its utility does not rise proportionally.
The Utility for getting your first million dollars is very high, but the utility for the additional million is smaller.
For the risk averse agent:
the utility of being faced with that lottery than the utility of being handed the expected monetary value of the lottery with absolute certainty
expected utility, maximum estimated utility
Expected Utility (average)
Its impicit that can follow from the current state .
Sum of: Probability of state occuring after action times its utility
Principle of maximum expected utility MEU
rational agent should choose the action that maximizes the agents expected utility
The axioms of utility theory
Constraints on rational preferences of an agent
Notation:
agent prefers state over state
agent is indifferent between state and state
one of the above
Constraints
- Orderability
The agent must have a preference.
- Transivity
- Continuity
- Substitutability
If agent is indifferent to and then agent is indifferent to complex lotteries with same probabilities.
- Monotonicity
Agent prefers a higher probability of the state that it prefers.
- Decomposability
Compound lotteries can be reduced to simpler ones.
If an agent violates these axioms it will exhibit irrational behaviour.
Example: intransitive preferences
Agent can be induced to give away all its money:
Agent has
- We offer + 1 cent for (agent accepts)
- We offer + 1 cent for (agent accepts)
- We offer + 1 cent for (agent accepts)
We repeat all over again.
- Orderability
Making decisions
Components of a decision network
= influence diagrams.
General framework for rational decisions. Return the action with highest utility.
Decision networks are an extension of Bayesian networks.
Presents:
- current state
- possible actions
- resulting state from actions
- utility of each state
Node types:
Chance nodes (ovals) Random variables, uncertainty - Bayes Network
Decision nodes (rectangles) Decision maker has choice of action
Utility nodes (diamonds) Utility function
VPI
value of information - Can it be calculated without additional information?
Value of Information
Example: Oil Company
There are different indistinguishable blocks.
blocks has oil worth dollars.
Others are worthless.
Blocks cost dollars.
How much is the information whether a block 3 has oil or not worth to the company?
-
If the block has oil, the company will buy it and then profit dollars:
-
If the block has no oil, the company will buy a different block because the probability of finding oil in other ones changes from to .
Average profit: of dollars.
We then calculate the expected / average profit with the information:
Which is equal to the price we would be for a block if we would not have had this information.
-
Example: Oil Company (simplified)
There are boxes.
Opening a box costs costs dollars.
boxes contains dollars, others are worthless.
How much is the information whether box 3 has oil or not worth to the company?
-
Then we will pay the price to open it and take the money inside:
-
Then we will open another box - the probability of finding the prize changes from to and on average we make dollars.
We then calculate the average profit with the information:
Therefore this information has a value of .
This is what it would cost us to find it out ourselves and we are willing to pay someone to figure it out for us.
-
Value of perfect information VPI(= expected value of information)
Lets say the exact evidence (= perfect information) of the random variable is currently unknown.
We define:
Best actionbefore learningunder all actions
Best actionafter learningunder all actions
Value of learning the exact evidence is the cost of discovering it for ourselves under by averaging over all possible values of .
VPI is non-negative
In the worst case, one can just ignore the received information.
Important: this is about the expected value, not the actual value.
Additional information can lead to plans that turn out to be worse than the original plan.
Example: a medical test that gives a false positive result may lead to unnecessary surgery; but that does not mean that the test shouldn’t be done.
VPI is non-additive
The VPI can get higher or lower as new information gets aquired as combined information can have different effects.
VPI is order independent