This is part 3 in the reinforcement learning for economists notes. For part 1 see Reinforcement Learning Intro. For a list of all entries in the series go here
Temporal Difference methods
We continue our study of applying GPI to the RL problem by looking now at temporal difference (TD) methods.
One step TD (TD(0))
Let’s begin our exploration of TD methods by considering the problem of evaluating or predicting the state-value function $V(s)$. The simplest TD algorithm will update $V(s)$ according to the following rule:
$$V(s) \leftarrow V(s) + \alpha \left[G - V(S) \right],$$
where $G$ is the return from state $s$. The term in the brackets is the difference between the actual reward in state $s$ ($G$) and the current estimate of that reward ($V(s)$) and is called the temporal difference.