“Try not to become a man of success. Rather become a man of value.”

Imagine that you are driving a car down the street. Like with many other tasks, we can think about driving using an MDP, with the state being the configuration of your car and the other cars in relation to the road. The actions are all the different controls you have in the car (steering, how much gas, which gear), and normally the reward is $0$, except when you crash, in which case the reward will be $-1$. In this post I will explain some of the most important concepts when thinking about MDPs, and some tools we can use to solve them.


When you are driving, you are using some policy $\pi$. Your policy defines which action you will take in each state. In this post, our policies will be deterministic, so we can write $\pi(s) = a$. This means that given a state, we know with certainty what our action will be. However, that is not always the case. Sometimes policies are stochastic, and the policy maps states to a probability distribution over possible actions. The agent will then sample an action from the distribution at every step. However, there is always an optimal deterministic policy.

Let's define the value of a state $s$, $V^{\pi}(s)$, as the expected (discounted) future reward given that our agent starts in state $s$ and acts according to the policy $\pi$ for the rest of its life. If that seems complicated, imagine driving down a peaceful highway with no cars in sight. If you are a decent driver, the value of our state is very close to zero and we probably won't crash soon. However, if instead you look up at the road and see a car coming straight at you, the value of your state will be very negative. Intuitively, a good driver will try to take actions that result in the car being in high valued states.

Value functions are clearly an important building block for defining optimality, but first we need one more tool: the Q-value. For some state $s$ and action $a$, $Q^{\pi}(s, a)$ is equal to the expected future reward of the agent if it takes the action $a$ and then continues following its policy $\pi$. In reinforcement learning, an optimal policy is one that maximizes the agent's value at each state. It turns out that in order to maximize the agent's value, it should pick the action which has the highest optimal Q-value. Thus, we can transform the task of finding an optimal policy to finding the optimal Q-values $Q^*(s, a)$.$\newcommand{\E}{\mathrm{E}}$

It turns out that we are very close to having a real algorithm to find these $Q^*(s, a)$. Using our above optimal policy, we can see that: $$V^*(s) = \max_a Q^*(s, a)$$ Likewise, by definition: $$Q^*(s, a) = \E[r(s, a, s') + \gamma V^*(s')]$$ This leads us to our first algorithm:

Value Iteration

Repeat until convergence:
$Q(s, a) \leftarrow \E[r(s, a, s') + \gamma V(s')]$
$V(s) \leftarrow \max_a Q(s, a)$

The algorithm will continue to update $Q(s, a)$ and $V(s)$ for each state-action pair until it reaches the point where our $Q$ and $V$ are no longer being changed and it has reached $Q^*$ and $V^*$. It can be proven that since our values will grow in each step, this algorithm will eventually converge.

This works well when we know the exact dynamics, but it won't work without them. This is because:

$$\E[r(s, a, s') + \gamma V(s')] = \sum_{s'} \Pr(s, a, s')(r(s, a, s') + \gamma V(s'))$$

In most problems that we are interested in, we don't know these transition probabilities. However, in practice we can sample many trajectories in order to approximate $\E[r(s, a, s')]$. We can also approximate the value function $V(s) \approx \max_{a} Q(s, a)$. This leads to a second algorithm:


Take an action $a$ and observe $(s, a, s', r)$.
$Q(s, a) \leftarrow (1 - \alpha)Q(s, a) + \alpha(r + \gamma \max_{a'} Q(s', a'))$

The $\alpha$ above is called the learning rate, and is some value between zero and one. If $\alpha = 1$, the agent will completely disregard prior information in favor of its recent observation. If $\alpha = 0$, the agent will never update its beliefs. An $\alpha$ that decreases to zero is necessary for our algorithm to converge when the transition dynamics are not deterministic. Since we usually do not know about the transition dynamics, in practice a small constant learning rate is used. However, if we know that transitions are deterministic, then $\alpha = 1$ is optimal.


When will Q-Learning find $Q^*(s_0, a)$? Suppose it takes five actions to reach the optimal reward in some MDP. Since our algorithm will propogate the reward backward by one state each iteration, our algorithm will need to take these five actions five times until the Q-value even reaches the initial state $s_0$. Therefore, in order for our algorithm to find the optimal Q-values, it must try the optimal $(s, a, s', r)$ transitions at least a few times. Notice that there are no restrictions on how to choose actions during learning. The $(s, a, s', r)$ that we need to sample are independent of any policy, so we call Q-Learning an off-policy algorithm. We can be smart in how we choose actions to take during learning so that we are likely to find a good solution.$\DeclareMathOperator*{\argmax}{arg\max}$

One of the most popular methods for choosing actions is $\epsilon$-greedy. In this, we choose actions according to our current policy, $\pi(s) = \argmax_a Q(s, a)$ with some probability $1 - \epsilon$, and to sample actions randomly with probability $\epsilon$. Intuitively, this lets us propogate the current policy's Q-values while also exploring to see if there are better actions to take. As we get further into training, we are more confident that there are no better actions and we gradually lower $\epsilon$.

To get a sense for what these parameters $\epsilon$ and $\alpha$ mean intuitively, I included an example Q-Learning agent below. You can use the arrow keys to manually provide actions, or click the blue button to use the $\epsilon$-greedy method. Each tile has four numbers, representing the Q-values for each action in that state. The agent will get ten points for reaching the honey, lose ten for falling in the hole, or get one through six points for reaching the die. As the agent gathers more experience, these values will become more and more accurate. Can you calculate what the optimal Q-values are in this example? What would the effect be of letting $\alpha = 1$? If the bear only figures out how to reach the die instead of optimally going towards the honey, what parameter might be too low?

The above is an example of Tabular Q-Learning. In the implementation, I initialize Q-values to zero for every state-action pair. Tabular Q-Learning has good convergence properties, but in practice there are some problems. In many MDPs there are too many states to store a separate value for each state-action pair. Even if we could store every Q-value, there is another problem. After the bear has learned to reach the honey without falling in the hole, try pausing the learning and moving it with the arrow keys to a state which it hasn't seen much in training.

A human with the same experience as this bear would hypothesize that reaching the honey gives the bear reward and the actions result in the bear changing its location, so the best action is to move to upwards. However, the bear has no idea what to do. This is because there is no generalization in Tabular Q-Learning. Ideally, we want our agent to be able to use the information it has received during its training to make predictions about the Q-values for other states that it hasn't seen. This is what modern Q-Learning algorithms do, and it will be the topic of the next post in this series.

Read the previous post: MDPs

Show Comments