# Monte Carlo methods

This is part 2 in the reinforcement learning for economists notes. For part 1 see Reinforcement Learning Intro. For a list of all entries in the series go here

## Monte Carlo methods

As defined in Reinforcement Learning Intro *GPI* encompasses the core ideas for all the RL algorithms in this book. One family of such algorithms is Monte Carlo (MC) methods. One distinguishing feature of these methods is that they can be implemented in a completely **model-free** way. This means that to apply them we need to know nothing about the dynamics of the model; we simply need to be able to observe sequences of states, actions, and rewards.

### Prediciton methods

**Monte Carlo prediciton** is a method for learning the state-value function for a given policy. The algorithm is given by:

Notice that the MC prediction algorithm does not update any estimates of the value function until *after* the entire episode is realized. In this sense we would say that the method does not **bootstrap**, meaning it does not use approximations of the value in other states to update the value in a particular state (value iteration is bootstrapping).

### Control methods

**Monte Carlo control** is an iterative algorithm where each iteration contains a Monte Carlo prediction step followed by a policy improvment step. The improvement step is simply done by making the policy greedy with the predicted value function.

Some technical conditions must be satisfied to ensure convergence of this naive algorithm:

- episodes have exploring starts (meaning that they don’t attempt to be greedy or optimal at the start)
- There are an infinite number of episodes

These are ok theoretically, but prohibative computationally.

**Monte Carlo ES** is an algorithm that does away with the assumption that you need an infinite number of episodes. It proceeds as

Notice that we don’t need the assumption of an infinite number of episodes because we don’t do a full policy evaluation each step.

To do away with the exploring starts assumption we need to define a few more terms.

- A policy $\pi$ is said to be
**$\varepsilon$-soft**if $\pi(a|s) > \frac{\varepsilon}{| A(s) |}$ for all $a \in A$ and $s \in S$. - An
**$\varepsilon$-greedy**is a policy where all non-greedy actions are chosen with probability $\frac{\varepsilon}{| A(s) |}$, while the greedy policy is chosen with probability $1 - \varepsilon + \frac{\varepsilon}{| A(s) |}$. **On-policy**methods evaluate and improve the policy that is used to make decisions.**Off-policy**methods use one policy to make decisions while trying to evaluate and improve another policy.- A
**first-visit**MC method uses only the*return*(not reward) after the*first*occurance of a state (or state-action pair) when doing evaluation. An**every-visit**method uses the*return*after every occurance of a state (or state-action pair) when doing evaluation.

An on-policy first-visit MC control algorithm for $\varepsilon$-soft policies is given below:

### Importance sampling

This section will include more prediction and control methods, but is an important enough concept that it deserves its own section.