Temporal Difference methods
We continue our study of applying GPI to the RL problem by looking now at temporal difference (TD) methods.
One step TD (TD(0))
Let’s begin our exploration of TD methods by considering the problem of evaluating or predicting the state-value function $V(s)$. The simplest TD algorithm will update $V(s)$ according to the following rule:
$$V(s) \leftarrow V(s) + \alpha \left[G - V(S) \right],$$
where $G$ is the return from state $s$. The term in the brackets is the difference between the actual reward in state $s$ ($G$) and the current estimate of that reward ($V(s)$) and is called the temporal difference.