\ie\cf
\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}}  \newcommand{\R} % real numbers  \newcommand{\N}} % natural numbers  \newcommand{\Z} % integers  \newcommand{\F} % a field  \newcommand{\Q} % the rationals  \newcommand{\C}{\mathbb{C}} % the complexes  \newcommand{\poly}}  \newcommand{\polylog}}  \newcommand{\loglog}}}  \newcommand{\zo}{\{0,1\}}  \newcommand{\suchthat}  \newcommand{\pr}[1]{\Pr\left[#1\right]}  \newcommand{\deffont}{\em}  \newcommand{\getsr}{\mathbin{\stackrel{\mbox{\tiny R}}{\gets}}}  \newcommand{\Exp}{\mathop{\mathrm E}\displaylimits} % expectation  \newcommand{\Var}{\mathop{\mathrm Var}\displaylimits} % variance  \newcommand{\xor}{\oplus}  \newcommand{\GF}{\mathrm{GF}}  \newcommand{\eps}{\varepsilon}  \notag
\no
Next |
| Previous

Solving Basic Problems

100%

Now that we have set up the general framework of problems of interest, let's think about how to solve them. We begin with a standard example:

Given a directed tree with non-negative weighted nodes, find the path from root to leaf which maximizes the path sum of weights.


We could brute force this problem by looking at every path, but there is a far better approach. Consider the following pseudo-code

Input: A directed non-negative weighted tree T
for node in reverse_topological_sort T
    if node.isLeaf:
        node.path = [node]
    else:
        max_child = max(node.children on weight)
	node.weight += max_child.weight
	node.path = max_child.path + [node]
Output: root.path
This algorithm is an example of backwards recursion. The idea is that at any particular node, we can treat it as the root node, find the optimal path from that node onwards, and treat the maximum sum as the weight of the node. Thus, we can recurse backwards until we reach the root node. Amazingly, this algorithm only involves a maximum of two calls to any particular node's weights (an update and a read), vs the brute force method might call a particular node an arbitrarily large number of times.

This is a classic algorithm which appears in most algorithm textbooks, but can be generalized into an algorithm for solving general basic problems.

The Dynamic Programming Algorithm

In the example above, we saw how solving the path problem backwards helped avoid the computational complexity of the brute force approach. Let's generalize our intuition for why that process worked.

Let \(\pi^*:\set{\mu_0,\ldots,\mu_{n-1}}\) be an optimal policy for the basic problem, and let \(x_i\) be a state that appears with non-zero probability at time \(i\) in the Markov Chain induced by the policy. Consider the problem of minimizing the "cost-to-go" from time \(i\) to time \(n\)$$E\Set{g_n(x_n)+\sum\limits_{k=i}^{n-1}g_k(x_k,\mu_k(x_k),w_k)}$$The truncated policy \(\set{\mu_i,\ldots,\mu_{n-1}}\) is optimal for this subproblem.


Suppose not: let there exist \(\set{\eta_i,\ldots,\eta_{n-1}}\) optimizing the subproblem. Let \(\eta^*_i\) be the action which is identical to \(\eta_i\) on \(x_i\) and identical to \(\mu_i\) elsewhere. Then by linearity of expectation, the policy \(\set{\mu_0,\ldots,\mu_{i-1},\eta^*_i,\eta_{i+1},\ldots,\eta_{n-1}}\) is optimal on the whole basic problem.

This result is useful since it now allows us to define the solution to the basic problem generally

For each time \(k\in[n+1]\), define a maximal-reward function \(V_k:S_k\to\Rnn\) defined by \(V_n(x)=g_n(x)\) and:$$V_k(x)=\max\limits_{a\in C_k(x)}E\Bracket{g_k(x,a,P_k(\cdot\mid x,a))+V_{k+1}(P_k(\cdot\mid x,a))}$$(To ensure the definition above is well-defined, we need to define \(V_k\) over probability distributions. We do so as follows: \(V_k(P)=\sum V_k(y)\cdot P(x_k=y)\)). Thus, the optimal policy is defined by:$$\mu_k(x)=\operatorname{argmax}\limits_{a\in C_k(x_k)}E\Bracket{g_k(x,a,P_k(\cdot\mid x,a))+V_{k+1}(P_k(\cdot\mid x,a))}$$


Applying the Principle of Optimality as the inductive step allows one to prove this result (the base case is pretty trivial). I'll leave this result as an exercise for the reader.

Notice that \(V_k\) is the direct analog of weight in the example and \(\mu_k\) the direct analog of path. We often call the set of \(V_k\)s the Bellman surface of the basic problem

To see the computational complexity of the dynamic programming algorithm, note that we have to compute all \(V_k\)s, and then \(n\) more expected value computations to generate the optimal policy. Assuming there are \(n\) time steps, no more than \(m\) valid states at any time, and no more than \(c\) valid actions at any time, the maximal number of \(V_k\) terms to be computed is \(n\cdot m\cdot c\).

Thus, an upper bound for the computational complexity of the algorithm is compute \(nmc+n\) expected values. This is linear in state space size, action space size, and time.

The beauty of this algorithm is that most additional complexities (time delays, partial information, etc) can be easily dealt with by expanding the state space. However, in practice, this can lead to enormous state spaces which aren't great to work with. We'll talk about how to deal with that in the next post.

Top