Plannning by Dynamic Programming

The planning problem consists of 2 subproblems:

The prediction problem. Compute the state-value function under a given policy
The control problem. Find the optimal policy

Prediction: Iterative Policy Evaluation

The goal is to compute the state-value function. Since the environment is known, we can compute it exactly through the Bellman Expectation Equation. Recall that in an MDP, the Bellman Expectation Equation is expressed as $$v_\pi=R^\pi + \gamma P^\pi v_\pi$$ where

$R_s^\pi=\sum_a \pi(a|s)R_s^a$.
$P_{ss'}^\pi=\sum_a \pi(a|s)P_{ss'}^a$.

Solution. Apply iteratively $v_\pi^{k+1}=R^\pi + \gamma P^\pi v_\pi^{k}$, starting from random $v_\pi^{0}$. Note that this is a synchronous update (all states are updated simultaneously).

Correcteness. Iterative policy evaluation converges to true $v_\pi(s)$. Furthermore, it converges linearly.

Proof. Given two any value vectors $u,v\in\mathbb{R}^{|S|}$, we have that

$$\|u-v\|_{\infty}=\max_{s\in S}|u(s)-v(s)|$$

Define the Bellman Expectation operator $T^\pi(\cdot)=R^\pi + \gamma P^\pi \cdot$. Now note that,

\[\begin{align}\|T^\pi(u)-T^\pi(v)\|_{\infty} &= \|R^\pi + \gamma P^\pi u - R^\pi - \gamma P^\pi v\|_{\infty} \\ &= \|\gamma P^\pi u - \gamma P^\pi v\|_{\infty} \\ &= \gamma\|P^\pi(u - v)\|_{\infty} \\ &\leq \gamma\|P^\pi\|_{\infty}\|u - v\|_{\infty} \quad\text{in fact}\quad \|A\|_p\doteq\sup_{x\neq 0}\frac{\|Ax\|_p}{\|x\|_p}\geq \frac{\|Ax\|_p}{\|x\|_p} \quad\text{therefore}\quad \|Ax\|_p\leq\|A\|_p\|x\|_p \\ &\leq\gamma\|u - v\|_{\infty} \quad\text{in fact}\quad \|P^\pi\|_\infty=\sup_{\|x\|_\infty=1}\|P^\pi x\|_\infty=\sup_{\|x\|_\infty=1}\max_i\sum_j P_{ij}^\pi x_j \leq \max_i\sum_j P_{ij}^\pi=1\end{align}\]

Therefore, the Bellman Expectation operator is a $\gamma$-contraction mapping.

Now, the contraction mapping theorem (Banach fixed-point theorem) says that the mapping converges to a unique point at linear convergence rate defined by $\gamma$. Therefore, this convergence point is the solution of the Bellman expectation equation. QED

Control: Policy Iteration

The goal is to compute an optimal deterministic policy.
Solution. Iterative solution

Stage 1. Given policy $\pi$, compute state-value function using iterative policy evaluation.
Stage 2. Get greedy deterministic policy $\pi'$, viz. search in the space of deterministic policies.

$$\pi'(s)=\arg\max_a q_\pi(a,s) \quad\text{ where }\quad q_\pi(a,s)=R_s^a+\gamma\sum_{s'}P_{ss'}^av_\pi(s')$$
Correctness. At every iteration, stage 2 improves the policy. The algorithm converges to optimal policy.

Proof. First, note that the second stage searches over deterministic policy. Therefore, any candidate satisfies the following relation:

$$\begin{align}v_\pi(s) &= \sum_a \pi(a|s)q_\pi(a,s) \\ &= q_\pi(s,\pi(s)) \quad\text{ as the policy is deterministic}\end{align}$$

Now consider the policy chosen in stage 2 (i.e. policy $\pi'$ which chooses action $a'$).

$$q_\pi(s,\pi'(s))=q_\pi(s,a')=\max_a q_\pi(s,a) \geq q_\pi(s,a)=q_\pi(s,\pi(s))=v_\pi(s)$$

Therefore,

$$\begin{align}v_\pi(s) &\leq q_\pi(s,\pi'(s)) \\ &= E_{\pi'}\{R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s\} \quad\text{current action using }\pi'\text{, while others following }\pi \\ &\leq E_{\pi'}\{R_{t+1}+\gamma q_\pi(S_{t+1},\pi'(S_{t+1}))|S_t=s\} \quad\text{ as }\quad v_\pi(S_{t+1})\leq q_\pi(S_{t+1},\pi'(S_{t+1})) \\ &= E_{\pi'}\{R_{t+1}+\gamma R_{t+2}+\gamma^2 v_\pi(S_{t+2})|S_t=s\} \\ &= \dots \\ &\leq E_{\pi'}\{R_{t+1}+\gamma R_{t+2}+\dots|S_t=s\}=v_{\pi'}(s)\end{align}$$

And this is valid for every $s$. QED [first part]

Once the algorithm stops, we have that $v_\pi(s)=v_{\pi'}(s)$ for all $s$. Now, since $v_\pi(s)=q_\pi(s,\pi(s))\leq q_\pi(s,\pi'(s)) \leq v_{\pi'}(s)$, we have that

$$v_\pi(s) = q_\pi(s,\pi(s)) = q_\pi(s,\pi'(s))$$

But note that $q_\pi(s,\pi'(s)) = \max_a q_\pi(s,a)$. Therefore,

$$v_\pi(s) = \max_a q_\pi(s,a)$$

which is exactly the Bellman Optimality Equation. Consequently, this is the optimal solution. QED [second part].

Control: Modified Policy Iteration

Speeding up the algorithm of policy iteration, in particular speeding up policy evaluation. Two possibilities, both approximate convergence to $v_\pi$, namely:

Stopping early $\epsilon$-convergence.
Stopping after $k$ iterations of policy evaluation.

Control: Value Iteration

The main drawback with policy iteration is that the algorithm draws a sequence of policies and for each of them it runs policy evaluation. How can we be more efficient?
Let's imagine that we can access the optimal policy. We know that the optimal policy has a state-value function that is solution of the Bellman Optimality Equation. Can we apply it iteratively?

Solution. For every $s$, apply $v^{k+1}(s)=\max_a R_s^a + \gamma\sum_{s'}P_{ss'}^a v^{k}(s)$, starting from random $v^{0}(s)$. Note that this is a synchronous update (all states are updated simultaneously).

Correctness. The algorithm converges to optimal state-value function.
Proof. Show that Bellman Optimality Equation is $\gamma$-contraction. Then, the proof is the same as the one used in Policy evaluation.

Important Remark. Value iteration is equivalent to modified policy iteration with $k=1$. Therefore, it is much more efficient than policy iteration! Note also that the intermediate solutions obtained by the algorithm may not have a policy!

Overview of Dynamic Programming Algorithms

Problem	Bellman Equation	Algorithm	Time Complexity
Prediction	Expectation	Iterative Policy Evaluation	$O(mn^2)$
Control	Expectation + Greedy Policy Improvement	Policy Iteration	$O(mn^2)$
Control	Optimality	Value Iteration	$O(mn^2)$

$m$ is the number of actions, $n$ is the number of states.
Problem. These solutions are not scalable in the number of states!

Practical Tips for Dynamic Programming Algorithms

Synchronous update requires to look and update each state at each iteration, but this is inefficient. You can use asynchronous updates as long as you pick all states.

In-Place Dynamic Programming

In synchronous dynamic programming, you have two state-value functions and one iteration is given by the following code

Code.

for all s in S

$v_{new}(s) = \max_a R_s^a + \gamma\sum_{s'}P_{ss'}^av_{old}(s')$

end for

$v_{old} = v_{new}$

But note that the Bellman Optimality Equation is a contraction mapping and convergence is guaranteed independently from the starting point. Therefore, we can use in-place operations

Code.

for all s in S

$v(s) = \max_a R_s^a + \gamma\sum_{s'}P_{ss'}^av(s')$

end for

Prioritized Sweeping

Note that the for loop in the in-place DP runs the update of the states in the same order. Nevertheless, we can improve this by sweeping the order and choose first the states which make larger updates. This can be implemented by using a priority queue (namely a heap) based on the Bellman error

$$\Big|\max_a\big(R_s^a + \gamma\sum_{s'}P_{ss'}^av(s')\big)-v(s)\Big|$$

Real-time Dynamic Programming

Use only the state that are relevant to the agent $\tilde{S}\subseteq S$

Open Problem

Dynamic programming requires the knowledge of $R$ and $P$. How can we get rid of it? See next lecture (model-free RL).

Emanuele Sansone