In 2014, I was listening to an instructor talk about the history of artificial intelligence. (This was just two years after Hinton's group at the University of Toronto published AlexNet, which revitalized interest in deep neural networks, and was followed by several huge advances in AI.) The instructor was giving examples of problems that have been solved by AI. He told us about Deep Blue, which was the first computer to beat a human in a regular game of chess. It did this by extensively searching through possible futures, and choosing the moves which gave the best odds of winning according to that search. It is a difficult problem in chess to enumerate all the possibilities from any board position, since the search space is estimated to have a size of $\sim 10^{50}$ configurations. The creators of Deep Blue used several techniques to narrow down the search space, including hand crafted heuristics that told the computer which configurations were more desirable than others. While computers had proven really good at Chess, there was one game that he said we were decades away from solving: Go.

Go has an estimated state space of $\sim 10^{170}$, making it impossible to do an exhaustive search and solve the problem the same way we can with Chess. Computers are better than humans at Chess because they don't get tired. They are capable of evaluating millions of possibilities per second and combined with some hard coded knowledge about the game, machines have reached a point where humans cannot even hope to win. However, in the game of Go, pattern recognition and intuition seem to be more important. And with exponentially more states, even the most clever searching algorithms didn't yield an AI that could beat the best human players. There would have to be a major breakthrough in order to ever beat humans at this game.

Just two years later, an almost unstoppable AI would emerge in DeepMind's AlphaGo, which beat one of the all-time greatest Go players four games to one in 2016 with no handicap. The technique that they used was called reinforcement learning, and this is going to be the first post of many in a series where I go over some of the basics.

In essence, in a reinforcement learning algorithm, we hand the computer a task, which it attempts over and over and improves itself based on the feedback it gets. In the case of Go, AlphaGo would play millions of games and try to improve itself based on the results of each game.

It might not be obvious at first what it even *means* to have a computer do a task or play a game. There is common language in the field that we can use to represent many different tasks: the **Markov Decision Process** (MDP). An MDP is represented by the following:

- A set of states $S$. In the game of Go, this set would contain every combination of possible board positions, times remaining, and the player whose turn it is.
- A set of actions $A$. In Go, the actions include any location where we could place a stone.
- We define some probabilities $P(s, a, s') = \Pr(s_{t+1} = s' | s_t = s, a_t = a)$. This is the probability that we end up in state $s'$ given that we start in state $s$ an take action $a$.
- We define rewards $\operatorname{r}(s, a, s')$, which are the rewards that our agent receives when going from state $s$ to state $s'$. Classically, a higher reward is better.
- A discount factor $\gamma$. This parameter controls how much our agent cares about immediate reward versus long term reward. There may be two ways to solve a given task which take a different amount of time but get the same reward. If $\gamma < 1$, then we will prefer solving it in the way that takes less time.

Of course, we are interested in finding the optimal actions to take for every state. To see a concrete example of how it is possible to find the optimal action, let's consider a game of rock-paper-scissors where we know our opponent only plays rock and paper, both with equal probability. We will have the following states: "start" and one state for each possible (our move, opponent move) pair. The actions will be playing rock, paper, and scissors. We will be rewarded one point if we win the game, half a point if we tie, and zero if we lose. Our goal is to find the action to take at the state "start" such that our expected reward is maximized. Let's enumerate over every possibility to find out the expected reward for each possible action.

- If we play rock, then with $\frac{1}{2}$ probability, our opponent will also pick rock and we will tie. Otherwise, our opponent will pick paper and we will lose. Therefore, $\newcommand{\E}{\mathrm{E}}$$\E[r|a = \text{rock}] = \frac{1}{2}\cdot\frac{1}{2} + \frac{1}{2}\cdot0 = \frac{1}{4}$.
- If we play paper, then with $\frac{1}{2}$ probability, our opponent will play rock and we will win. Otherwise, our opponent will pick paper and we will tie. Therefore, $\E[r|a = \text{paper}] = \frac{1}{2}\cdot1 + \frac{1}{2}\cdot\frac{1}{2} = \frac{3}{4}$.
- If we play scissors, then with $\frac{1}{2}$ probability, our opponent will play rock and we will lose. Othewise, our opponent will pick paper and we will win. Therefore, $\E[r|a = \text{scissors}] = \frac{1}{2}\cdot0 + \frac{1}{2}\cdot1 = \frac{1}{2}$.

Since we want to maximize our expected reward, the optimal choice is so play paper. This is obviously a very simple example, but anything more complicated gets exponentially more expensive to compute.

When we know the exact values of $(S, A, P, r, \gamma)$ for an MDP, it is technically possible to exactly solve for the optimal actions at each state using the procedure that we just used for rock-paper-scissors. However, sometimes we don't know the exact dynamics of problems. And imagine enumerating over every possible state in Chess. Sometimes, even for seemingly simple problems, there are so many states that even if we do have knowledge of the problem, it will take too long to solve. Reinforcement learning can find approximate solutions to MDPs, even when we don't have access to the exact transition probabilities or reward function. Imagine that we have some black-box reinforcement learning algorithm that takes as input an MDP (even without the transition probabilities or a reward function) and figures out optimal actions to take at each state. In order to convince you that this would be extremely powerful, let me give you an example of what we could do with this tool.

Let's earn some money by defining an MDP for trading stocks. For simplicity, assume that there is only one stock. Since I am writing about AI, let's trade Google. The set of states $S$ can include the current price history of Google, some recent news or tweets about the company, how much money we have in the bank, and how many shares we own. An example of a state would be then: $(p_0, \ldots, p_{t - 1}, p_t, b, g)$, where $p_t$ is the price of a share of Google at time $t$, $b$ is how many dollars we have in the bank, and $g$ is how many shares of Google we own. The actions will include buying or selling any number of shares. We will give reward proportional to the change in the money in our bank and the value of our shares. Since humans don't live forever, we prefer to earn money sooner than later, so we will let our discount factor $\gamma$ be less than one.

If we could set loose our algorithm to experiment and figure out how to optimally solve the problem of when to buy and sell our shares of Google, we would be extremely rich. Of course, the above problem is not so easily solved, and even state of the art reinforcement learning algorithms would not be able to trade stocks very effectively. But many problems in the real world can be modeled with MDPs and humans have efficiently learned how to solve them, which serves as proof that there can exist an efficient learning algorithm. This is why I think that researching reinforcement learning is worthwhile. Hopefully you find it interesting as well. In the next post in this series, I will write about one of the most popular reinforcement learning algorithms that made the news for outperforming humans on several Atari games.

Read the next post: Value Functions and Q-Learning