\ie\cf
\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}}  \newcommand{\R} % real numbers  \newcommand{\N}} % natural numbers  \newcommand{\Z} % integers  \newcommand{\F} % a field  \newcommand{\Q} % the rationals  \newcommand{\C}{\mathbb{C}} % the complexes  \newcommand{\poly}}  \newcommand{\polylog}}  \newcommand{\loglog}}}  \newcommand{\zo}{\{0,1\}}  \newcommand{\suchthat}  \newcommand{\pr}[1]{\Pr\left[#1\right]}  \newcommand{\deffont}{\em}  \newcommand{\getsr}{\mathbin{\stackrel{\mbox{\tiny R}}{\gets}}}  \newcommand{\Exp}{\mathop{\mathrm E}\displaylimits} % expectation  \newcommand{\Var}{\mathop{\mathrm Var}\displaylimits} % variance  \newcommand{\xor}{\oplus}  \newcommand{\GF}{\mathrm{GF}}  \newcommand{\eps}{\varepsilon}  \notag
\no
Next |
Chapter 1

The Basic Problem

100%

Our goal is to set up a framework for problems of interest in the dynamic programming model, in the hopes of generating a general solution to the framework. For that, we start with the DDS.

Discrete-time Dynamic Systems

Given \(k\in[n]=\{0,\ldots,n-1\}\) for some \(n\in\bbN\), we define the following collections of non-empty sets indexed by \([n]\)

  • \(S_k\): the set of possible states for the system to be in at time \(k\). It is dependent on the state \(x_{k-1}\) (the state at time \(k-1\)) if \(k>0\)
  • \(C_k\): the set of possible actions for the player to take at time \(k\). It is dependent on state \(x_k\).
  • \(P_k\): the probability distribution governing the noise on the outcome of an action. As such, it is dependent on \(x_k\), the state, and \(u_k\), the action, at time \(k\). We write this as $$w_k\sim P_k(\cdot\mid x_k,u_k)$$

Let \(f_k\) be a function \(S_k\times C_k\times P_k(\cdot\mid x_k,u_k)\to S_{k+1}\) that represents the transition of states from time \(k\) to time \(k+1\). Or in other words, $$x_{k+1}=f_k(x_k,u_k,w_k)$$(It should be noted that the definition of \(f\) is a slight abuse of notation since \(C_k\) is dependent on the choice of element of \(S_k\) and \(P_k\) is dependent on both, but it's not hard to rectify this situation, albeit messy).

We call the data represented by \(S_k,C_k,P_k,f_k\) a Discrete-time Dynamic System, or DDS for short.

The idea behind a DDS is that the state update of a dynamic system should be solely dependent on the current state, the action taken, and a noise parameter. In this sense, it is Markovian. Now, we'd like to consider possible collections of actions on this system.

Let \(\mu_k:S_k\to C_k\) be a function that determines the action taken by an individual at time \(k\). A policy is the data of \(\mu_k\) for each \(k\in[n]\).
Observe that \(\text{Pr}(x_{k+1}=b\mid x_k=a)\) on a DDS with policy \(\{\mu_0,\ldots,\mu_{n-1}\}\) is $$\Pr(x_{k+1}=b\mid x_k=a)=P_k(\cdot\mid x_k=a,u_k=\mu_k(a))$$This necessarily implies that imposing a policy on a DDS converts the DDS into a (inhomogenous) Markov chain. In other words, we are taking the decisions out of the DDS. The transition matrix of this Markov chain is (given \(S_k=\{y_1,\ldots,y_m\}\)$$T_k(\mu_k)=\begin{bmatrix}P_k(y_1\mid y_1,\mu_l(y_1))&\ldots&P_k(y_1\mid y_m,\mu_k(y_m))\\\vdots&\ddots&\vdots\\P_k(y_m\mid y_1,\mu_l(y_1))&\ldots&P_k(y_m\mid y_m,\mu_k(y_m))\end{bmatrix}$$

The Basic Problem

To evaluate which policy is best, we need a cost (or reward) function on the DDS. By convention, we consider additive cost functions, i.e., the total cost is the sum of the costs accrued throughout the process and a final cost. We define the cost function as follows. Given a policy \(\pi=\{\mu_0,\ldots,\mu_{n-1}\}\), we define two different types of costs (or rewards).

We let the total cost of a DDS under a policy be the sum of the two types of costs. Clearly, we could set up the problem similarly with a reward structure instead of a cost one, but there is no difference in the modeling powers of these setups. Now, we can define our problem by asking for the policy that minimizes the cost. This is close to the basic problem setup, but the main issue is that as designed, this problem (known as the finite horizon problem) is intractable. To rectify that, we consider the following modification.

Given a policy \(\pi=\{\mu_0,\ldots,\mu_{n-1}\) on a DDS as defined above, we define the expected cost of the policy to be the following:$$V(\pi):=E\left[g_n(x_n)+\sum\limits_{k=0}^{n-1}g_k(x_k,\mu_k(x_k),x_{k+1})\right]$$where \(x_k\) is the stochastic variable representing the \(k\)th state of the inhomogenous Markov chain generated by the DDS and policy.
We call a basic problem the following data:
  • A DDS \(S_k,C_k,P_k,f_k\) for \(k\in[n]\)
  • A collection of transition cost functions \(g_k\)
  • A terminal cost function \(g_n\)
The basic problem is as follows: find the policy \(\pi_*\) such that for all policies \(\pi\), \(V(\pi_*)\leq V(\pi)\).

An interesting question (for pedantic mathematicians like me) to ask about this setup is whether the basic problem even has a solution. The answer is yes, since it is possible to show that the set of possible costs in the reals is a closed set (and has a lower bound of \(0\)). That being said, to avoid these sorts of questions, we'll often require that \(C_k\) is finite for all \(k\), which also means that the set of all policies is finite. With this assumption, we are free to use min instead of inf.

I have been somewhat sloppy with my definition of the basic problem, since I've left out a few nuances. The first (and less problematic) nuance is that of starting position. In defining \(V(\pi)\), implicit in the expected value is the knowledge of the starting state \(x_0\). Thus, it is often meaningful to denote \(V(\pi)\) as \(V(\pi,x_0)\) to emphasize that the starting position is \(x_0\).

The somewhat larger sin is my fuzzy handling of open vs closed loop basic problems. What we have defined above is called a closed loop minimization basic problem, since the action, \(u_k\), can depend on \(x_k\) (and similarly for the policy). In an open loop minimization basic problem, \(u_k\) must be chosen at time \(0\), to avoid any feedback to be involved in the policy choice.

It should be immediately obvious that the minimal cost of the closed loop problem should be at most that of the open loop problem, since any open loop policy is also a closed loop policy. Because of that, we'll mostly deal with closed loop problems, and I'll be very explicit when we aren't.

Consider the following (silly) one turn DDS. We can either choose to gamble or not. If we gamble, we earn a randomly chosen value from \(\cN(0,1)\), the standard Gaussian. Otherwise, we earn nothing. We want to maximize our earnings

Observe that an open loop minimization problem wouldn't permit a solution to do better on average than earning 0 points, since the random variable has mean zero, and we can't base our strategy on the output of the Gaussian. On the other hand, if we gamble only if the Gaussian is positive, our expected number of points is the expected value of the half-normal distribution, which is \(\sqrt{\pi^{-1}}>0\).
Top