\ie\cf
\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}}  \newcommand{\R} % real numbers  \newcommand{\N}} % natural numbers  \newcommand{\Z} % integers  \newcommand{\F} % a field  \newcommand{\Q} % the rationals  \newcommand{\C}{\mathbb{C}} % the complexes  \newcommand{\poly}}  \newcommand{\polylog}}  \newcommand{\loglog}}}  \newcommand{\zo}{\{0,1\}}  \newcommand{\suchthat}  \newcommand{\pr}[1]{\Pr\left[#1\right]}  \newcommand{\deffont}{\em}  \newcommand{\getsr}{\mathbin{\stackrel{\mbox{\tiny R}}{\gets}}}  \newcommand{\Exp}{\mathop{\mathrm E}\displaylimits} % expectation  \newcommand{\Var}{\mathop{\mathrm Var}\displaylimits} % variance  \newcommand{\xor}{\oplus}  \newcommand{\GF}{\mathrm{GF}}  \newcommand{\eps}{\varepsilon}  \notag
\no
Chapter 1 |
| Previous

Mathematical Issues

100%

In this post, we take a slight detour into making rigorous some of the handwaving earlier on we did to establish the basic problem. For those of you who don't care about definitional considerations, you may want to skip this post.

What really is a basic problem?

We can summarize our setup of the basic problem into the following steps:

Of course, if we are at the final state, we skip the last step and instead add the terminal cost \(g_n(x_n)\) to finish the process. This all seems fine, and it is, until we want to do anything interesting with the basic problem.

Minimizing Costs, rigorously

In our Dynamic Programming Algorithm, we define our "cost-to-go" \(V_k\) by evaluting the expected value of \(g_k\). This, of course, assumes \(g_k\) can be interpreted as a random variable. Let's try to do that.

In order to do so without much effort, we make some basic assumptions about basic problems (haha). In particular, we'd like to assume that \(P_k(\cdot\mid x_k,u_k)\) for any \(x_k,u_k\) are distributions over at most countable sets, and for any policy, the cost \(g_k\) at any step is finite.

In this case, we can define a new term \(\tilde{g_k}\) as follows:$$\tilde{g_k}(x_k,\mu_k(x_k))=E_{w_k\sim P_k(\cdot\mid x_k,\mu_k(x_k))}\Set{g_k(x_k,\mu_k(x_k),w_k)\mid x_k,\mu_k(x_k)}$$In other words, \(\tilde{g_k}\) is a random variable dependent on \(x_k,\mu_k(x_k)\). The expected value is well defined because of the countability of the state space of \(P_k\) and finiteness of the cost. Thus, we can let the total cost be $$V_\pi(x_0)=E_{x_1,\ldots,x_n}\Set{g_n(x_n)+\sum\limits_{k=0}^{n-1}\tilde{g_k}(x_k,\mu_k(x_k))}$$This cost can be interpreted as a random variable by letting the state space be the cartesian product of \(S_k\) for each \(k\).

Show that with our new definition of the total cost, the Dynamic Programming Algorithm generates an optimal policy.

I've been a tad bit sloppy with how I use \(g_k\) as I both use it as \(g^*_k(x_k,u_k,x_{k+1})\) (the cost is dependent on the state, the action, and the resulting state) and as \(\hat{g_k}(x_k,u_k,w_k)\) (the cost is dependent on the state, the action, and the noise term). We can express the latter in terms of the former:$$\hat{g_k}(x_k,u_k,w_k)=g^*_k(x_k,u_k,f(x_k,u_k,w_k))$$So why use both? The answer is that the literature isn't consistent on defining \(g_k\). Bertsekas insists on the \(\hat{g_k}\) variation since it allows for slightly more general formulations, but others use the \(g^*_k\) variation since it avoids some of the random variable complexity in this post.
Top