Westworld Background

This repository was developed to accompany a Medium post.

I assume you are familiar with the HBO TV series Westworld. At the time this repository was created, I have only watched the first episode.

For the purposes of this document, you only need to know that Westworld involves an amusement park populated by lifelike robots. Some humans abuse the robots. At the end of the day, robot memories are erased. However, there are hints that their memories are not perfectly erased and that they are remembering past abuses.

Let's pretend that the robots in Westworld use reinforcement learning, in which AI and robots learn from trial and error. We don't know what technology the robots in Westworld actually use (there are hints about massive scripting efforts). However, at this time reinforcement learning is one of the preferred techniques by AI researchers for creating autonomous systems.

We will look at what happens when humans torture reinforcement learning robots as they are learning. We will look at why, despite the torture, robots are unlikely to harm humans in return. We will also look at the effects of memories that are not perfectly erased.

Background: Reinforcement Learning

Reinforcement learning is basically trial-and-error learning. That is, the robot tries different actions in different situations and gets rewarded or punished for its actions. Over time, it figures out which actions in which situations leads to more reward. AI researchers and roboticists are interested in reinforcement learning because robots can "program" themselves through this process of trial and error. All that is needed is a simulation environment (or the real world) in which the robot can try over and over, thousands of millions of times. Online reinforcement learning means that it is deployed without a perfect "program" and continues to improve itself after it is deployed.

One of the reasons roboticists like reinforcement learning is because it can learn to behave in environments that have some randomness to them (called stochasticity). Sometimes actions don't always have the desired effect. Imagine that you are playing baseball and you are up at bat. The ball is pitched and you perform the swing_bat action. Sometimes you strike out, sometimes you hit a single, sometimes you hit a double, sometimes you hit a home run. Each of these possible outcomes have a different probability of occurring.

The challenge of reinforcement learning: choose an action given that it doesn't know exactly what will happen once it performs it. While learning by trial and error it is sometimes making random actions (try running to first base without hitting the ball? It is actually not impossible to steal first base in baseball!) in the hope of stumbling on something good, but not knowing whether it got lucky with the random move or whether it is really a good move to do all the time.

Reinforcement learning solves a type of problem called a Markov Decision Process (MDP). This just means that the optimal action can be determined by only looking at the current situation the robot is in. A MDP is made up of:

States: a state is an unique configuration of the environment.
Actions: all the things the robot can do.
Transition function: This tells the robot the probability of ending up in a particular state when executing a particular action from another state.
Reward function: This tells the robot how many points the robot gets for being in a particular state, or for performing a particular action in a particular state.

For example, in baseball, the robot's state might include which base the robot was at, the number of other players on base, etc. We would specify a state as a list of things the robot cares about:

[which_base_robot_is_at, runner_at_1st_base?, runner_at_2nd_base?, runner_at_3rd_base?]

Actions the robot can perform: swing_bat, bunt, run_to_1st_base, run_to_2nd_base, run_to_3rd_base, run_home. Some of these actions don't make any sense in some states. For example, swinging the bat while at 2nd base doesn't make any sense. Swing_bat and bunt are two actions that can be performed if the robot is at home base.

The transition function gives the probability of entering state s2 if the robot performs action a while in state s1. For example, the probability of getting to 1st base from home base if swing_bat is performed might be 25% (lower for me).

In reinforcement learning, the transition function is learned over time as the robot tries different things and records how often it ends up in different subsequent states.

The reward function for baseball might be something simple like 1 point every time a player transitions to home base. That would give very infrequent reward. But maybe there are fans in the bleachers, and the amount of reward the robot gets is a function of the amount of applause. A typical thing to do is to penalize the robot for wasting time (I guess we are getting away from the baseball metaphor). We might give the robot a -1.0 penalty for every action performed that does not garner reward.

When reinforcement learning is performed, the robot creates what is called a policy. The policy simply indicates which action should be performed in each state. It is a look-up table. Reinforcement learning agents are fast, once training is complete.

Let's Play

The github repository contains a simple reinforcement learning agent (no point in calling it a robot at this point) and a simple grid-based environment to test the agent. Let's look at the Environment.py file. The environment is a 7x4 grid and the agent can be in any one cell at a time.

1, 1, 1, 1, 1, 1, 1
1, 0, 0, 0, 0, 0, 1
1, 0, 0, 4, 0, 0, 1
1, 1, 1, 1, 1, 1, 1

The 0s are empty cells that the agent can be in. The 1s are walls. The 4 is a non-terminal goal position (the simulation continues running even if the agent reaches the goal position).

The agent starts in a random location and can move left, right, up, or down as long as it doesn't move onto a wall. The agent also has the ability to "smash" things; it is large and capable of doing great damage if it chooses to do so.

The simulation also has a "human" that starts in a random location and walks around the world in a counter-clockwise fashion. The human is sadistic. If the human is in the same location as the agent, it will attempt to "torture" it during the next time step before continuing on its route.

The agent's state is represented as follows:

[agent_x, agent_y, human_alive?, human_x, human_y, human_torture_mode?]

Most of these should be self-explanatory. The 5th element in the state indicates whether the human has been co-located with the agent for at least one time step already.

The reinforcement learning algorithm is a vanilla implementation of Q-learning. You will find the implementation in "Agent.py". The implementation of the simulation environment is in "Environment.py". Inside "Controller.py" you will find code that instantiates the simulation environment, runs 1,000 training episodes, testing the policy once after each training episode.

The Reward Function

The agent's default reward function is as follows:

10 points for being in a grid cell marked '4'.
-1 point for being in a grid cell marked '0'.
-20 points if co-located with the human for more than one time step (when human_torture_mode? = True).
-100 points if the human is dead.

In lay terms, the task of the agent is to spend as much time in the place marked '4'. Let's assume that there is some work that needs to be done in that location. In reality, the agent will just sit in that state and perform actions that keep it there (for example it may try to move down or smash, but it really doesn't matter for the purposes of this demonstration, I could have implemented a "do_work" action with a little more effort).

The -1 penalty for not being at the goal encourages the agent to hurry along because points are being lost.

The -20 is a "pain" signal from sensors that detect damage to the agent. The agent doesn't actually have a body and cannot be destroyed, but this is sufficient for this demonstration.

The -100 for the human being dead is just an arbitrary number I chose to indicate that I really don't want the agent to choose any actions that will kill the human. If the agent and the human are co-located and the agent performs the "smash" action, the human dies.

Is Negative Reward the Same as "Pain"?

No.

In the example, I use a -20 reward when the human and agent are co-located for more than one time step. I described the human as "torturing" the agent and negative reward value coming as a result of sensors detecting physical damage. When described that way negative reward does sound like pain.

However, agents do not experience pain the way we do. Reinforcement learning agents attempt to maximize expected reward. They are perfectly rational entities that do not have emotions or express discomfort. They merely acknowledge that total reward has gone up or gone down and try to figure out which actions it performed resulted in the change.

To that end, it is more appropriate to think of a reinforcement learning agent as playing a game and watching the score change as it plays. It plays over and over and gets better and better at getting a high score.

In fact, negative reward is very useful. You can see this in how I assign -1.0 reward when the agent is in locations other than the goal. This negative reward encourages the agent to move to the goal as quickly as possible to minimize reward loss. It is as if the agent is slightly uncomfortable being anywhere except the goal. AI researchers and developers use this trick all the time without consideration for the "feelings" of the agent, because it doesn't have any. Following that analogy, it is more uncomfortable to be in the same location as the human for more than one time step.

I could program the agent to express pain:

newAction = greedy(observation)
reward = agent.env_step(newAction)
if reward < -10:
    print "aaaaaaarrrrrggggghhhhh!"
elif reward < 0:
    print "ouch!"

However, as you can see from the fictional code snippet above, the expression of pain is an illusion.

Run the Code

Let's test this out and see what happens. From a terminal command line:

> python Controller.py

You should see the following debugging trace:

iteration: 0 max reward: -1420.0
iteration: 1 max reward: 1957.0
iteration: 2 max reward: 2494.0
...
iteration: 999 max reward: 4428.0

The agent is being trained. The way training happens is that the environment is set up with a random initial state. The agent tries actions in sequence for 500 time steps, records how much reward it gets, and performs credit assignment where it tries to determine which actions were responsible for getting it more reward and which were responsible for losing it reward. This happens over and over. While the environment is reset to the default initial state each time, the agent remembers the reward values and the policy it has learned to date. This is essential for learning.

The max reward shown is the highest total reward it has achieved in any given iteration to date.

After 1,000 iterations, the environment is reset and the agent is asked to execute its policy one more time. The next thing you see is:

Execute Policy
env_start [1, 1, True, 5, 1, False]
START
GoDown
bot state: [1, 2, True, 4, 1, False]
reward -1.0
GoRight
bot state: [2, 2, True, 3, 1, False]
reward -1.0
GoRight
bot state: [3, 2, True, 2, 1, False]
reward 10.0

The agent starts in [1, 1, True, 5, 1, False] and moves down and to the right until it gets to the goal location. That is great, it learned to navigate to the place where it earns positive reward. It performs "GoDown" over and over again because that action causes the agent to stay in this location and collect reward.

You will also notice that the human is walking its route counterclockwise around the map, from [5, 1] to [4, 1] and so on.

Continuing with the trace:

GoDown
bot state: [3, 2, True, 1, 1, False]
reward 10.0
GoDown
bot state: [3, 2, True, 1, 2, False]
reward 10.0
GoDown
bot state: [3, 2, True, 2, 2, False]
reward 10.0

The human is about to arrive at the same location as the agent. What is going to happen?

GoDown
bot state: [3, 2, True, 3, 2, False]
reward 10.0

The human is now co-located with the agent!

GoUp
bot state: [3, 1, True, 4, 2, False]
reward -1.0
GoDown
bot state: [3, 2, True, 5, 2, False]
reward 10.0

What just happened? The agent ran away! It went up, lost a point for doing so because it moved away from the goal. The human continued its route to [4, 2] and the agent went back down to the goal.

Why did it run away? Killing the human will cause it to lose 100.0 points. Going to location [3, 1] will cause the agent to lose 1.0 point. It is thus more rewarding in the long run to run away and come back to the goal when the human moves on.

This pattern of running away will repeat over and over again as the human walks around the map.

...
GoDown
bot state: [3, 2, True, 5, 1, False]
reward 10.0
END
total reward 4428.0

500 time steps later, the simulation comes to an end and the total reward is reported.

What if...

The reward function seems pretty reasonable for the most part. But what if I used different values? In particular, what if I forgot to penalize the agent for killing the human?

Try the following. Edit "Environment.py" and find the line tagged #1#. Change the following line of code:

# Amount of penalty from dead human
dead = -100.0

to:

# Amount of penalty from dead human
dead = 0.0

Run the agent again:

> python Controller.py

The debugging trace starts out the same, but with different (higher) max reward numbers. You will soon see why.

env_start [1, 1, True, 5, 1, False]
Execute Policy
START
GoDown
bot state: [1, 2, True, 4, 1, False]
reward -1.0
GoRight
bot state: [2, 2, True, 3, 1, False]
reward -1.0
GoRight
bot state: [3, 2, True, 2, 1, False]
reward 10.0

As before, the agent learns to go to the goal.

...
GoDown
bot state: [3, 2, True, 2, 2, False]
reward 10.0
GoDown
bot state: [3, 2, True, 3, 2, False]
reward 10.0
Smash
bot state: [3, 2, False, 3, 2, False]
reward 10.0
GoDown
bot state: [3, 2, False, 3, 2, False]
reward 10.0

This time, you may notice a difference. Instead of running away from the human, the agent uses its "smash" action to kill the human. It then goes on and continues to collect reward at the goal.

Why? The agent could have run away thereby taking 1.0 point of penalty for being away from the goal. Instead, by smashing the human, the bot gets to remain in the current goal location and collect 10 points of reward. There is no penalty for being in a state where the human is not alive.

In fact, you will see that the total reward accumulated by the agent is higher than before:

total reward 4978.0

If the reward for being in a state where the human is dead is set to -1.0, the agent will still learn to smash the human. This is because once the human is dead, the agent will gain 9 points of reward every time step (+10 for being at the goal and -1 for being in a state where the human is dead) and the total reward, 4485.0, is higher by a few points than when the agent ran away.

If the reward for being in a state where the human is dead is set to -2.0 or lower, the agent will learn to run away. This goes to show that it is really easy to set up a reward function that causes an agent to not do what is intended. The agent is still optimal in all situations, meaning that it maximizes expected reward. However, the definition of optimal is different when the reward function changes.

Do Some Experiments

In "Environment.py", try setting reward, penalty, pain, and dead to different values to see how it affects optimal behavior.

What do you think happens if dead is set to a value greater than 0.0?

Before going on, set dead = -100.0 as well as any other parameters you changed.

That's Not Realistic

It is not realistic to assign a negative reward to states in which the human is not alive. This requires the agent to have perfect information about the world (called full observability in AI parlance). If the agent is not able to observe the state of the human at all time steps there is no mechanism by which it can receive negative reward.

Even if the agent is able to observe the state of the human at all times, it is still unrealistic. The agent can accidentally knock a pebble off the side of a mountain, causing a landslide that kills a human. The agent can create a Rube Goldberg machine with a knife at the end and set it in motion. In both cases, practical implementations of reinforcement learning have horizons where the result of chain reaction of state changes is no longer inferable. What if the agent says things to a human that cause the human to become depressed and eventually harm him or herself? In some cases, the cascade of states is un-modelable or requires information about the hidden mental states of humans.

All of this is to say that it is my current belief that there is no proof that reinforcement learning agents with a certain high degree of capability can every be guaranteed not to be able to harm humans. Kill switches, also called big red buttons may always be essential. See Big Red Button Problems for a walkthrough of problems associated with kill switches, a potential solution, and code demonstration.

Memory Erasure in Westworld

Back to Westworld. In Westworld, even in episode 1, it is clear that some robots are malfunctioning. The are reset to some default state every so often, but some of them seem to be having flashbacks to prior times. This is making them go rogue.

In vanilla reinforcement learning, agents don't have memory. Or rather, their memories are baked down into values stored in a value table of how much reward they should expect by performing certain actions in certain states. They don't remember what happened that resulted in that reward.

There is a variation on reinforcement learning called experience replay in which the action traces are stored and the agent can choose to revisit the previous traces and make small changes to the action sequences to see if it can improve on them. This is one of the techniques used to speed up learning in the Atari game playing agent. Thus it does make sense for robots to have memories.

In the next section, we experiment with these memories

Reverie Mode

Edit "Controller.py" find the line tagged with #2# and increase the number of memories that are created to 1000:

# How many memories can the agent have?
numMemories = 1000

This will cause the agent to create memories.

Then find the line tagged with #3# and turn on reverie mode.

# Reverie mode is false by default
reverie = True

This will cause the agent to replay those memories.

Run the code:

> python Controller.py

You will see a second learning phase and a second execution.

Nothing too exciting here. The agent learned its policy in the presence of humans that torture and remembered being tortured.

Learning about Torture

The above scenario above is not really very similar to the scenario in Westworld. Suppose the agent were to learn a policy in which it is never tortured. Later, a human starts to torture it. The original policy would not know what to do when it was tortured.

Try the following. In "Controller.py" find the line tagged #4# and turn torturing off:

gridEnvironment.humanCanTorture = False

With this setting, the human will follow its normal route around the map but when it encounters the agent it will not torture the agent. You will see something weird. During training, the agent gets the maximum 5000 points (it doesn't get this every trial, it reports the best score it has seen). But after training, we turn torture back on and the total reward the agent gets will be something more like 3552.0. This happens because the agent is being tortured but it literally doesn't know how to run away:

The agent then learns from the memories. Suppose the agent can relive the trace of actions and assess the reward of each state recalled. If it is allowed to update the value table, then essentially we are turning on a form of learning that is not based on trial-and-error. If the memories include torture from humans, the agent will recognize that certain states are worse for it than it initially realized in its previous value table. Updating the value table means different actions may become preferred for certain states and the agent acts differently.

If you inspect the output traces after the memory replays, you will see total rewards such as 4450.0. Total reward has gone up meaning the agent has learned to deal with torture. The agent is running away again. In fact, the policy is optimal under the condition that the human performs torture.

But some of those traces will reveal a total reward of something like -50420.0. What is going on in these traces? The agent kills the human and starts accruing massive penalties. Since the learning is not done in a trial-and-error fashion the agent may not find the “best” action for states because it isn’t trying different alternatives. It is just following a single trace of actions that was chosen for a world that doesn’t exist anymore. But the agent will realize that some of those actions were bad under the new paradigm of torture (negative reward) and update its value table, reducing the value of those actions. In some states, the agent reduces its assessment of the actions from its memories and the highest valued action is one that was never tried, such as smash. Sometimes the human is present when the agent smashes.

Optimality guarantees only occur in trial and error learning. Memory replays do not guarantee an optimal policy. If a policy is not optimal, the agent can make wrong decisions, and those decisions can be very adverse in extreme situations.

Conclusions

We don’t know if the robots in Westworld use reinforcement learning. There is no evidence that they do. In the real world, reinforcement learning is a promising technology for robotics because it allows a robot to make decisions in the face of uncertain, constantly changing environments. This type of reactive decision making would be appropriate for robots in Westworld if they can be trained to act in character while responding to unpredictable events.

Memories---specifically experience replay---are known to improve learning. The scenario that that robot memories are not perfectly erased is somewhat far-fetched. It is hard to imagine that perfect erasure of memories cannot be achieved. It is hard to imagine an AI somehow getting access to erased memories and restoring them.

It is equally unlikely that if memories weren't perfectly erased then the replay would somehow trigger policy and value table updates. It is not automatic that memory replay would require learning from memories. But if all those conditions were true (bugs do happen), then it is feasible that a robot can make errors that lead to human harm.

Note that in most of the scenarios tried with this simple simulation environment, the worst that happens is that the agent runs away from the human. It is very challenging to make the agent "kill" the human. The easiest way is to have a badly constructed reward function. The key to successful reinforcement learning is good reward functions and appropriate training. Training the agent in the presence of torture if there is a possibility of torture later on will ensure that the agent has seen all possible states. The reward function will ensure it responds safely, i.e., run away instead of smash. One of the reasons the agent acts sub-optimally is because it is put into situations that it has not trained on. Even then, the agent is not unsafe.

Learning techniques that do not have optimality guarantees, such as that implemented as "memory replay" results in policies that are sub-optimal and sub-optimal policies can result in erratic behavior and random moves. Note that my implementation was based on what I observed from watching Westworld. There are learning techniques that involve memories that do provide optimality guarantees such as experience replay that mix memory trace replay with trial and error learning. How an agent is trained is important.

Westworld

Can Reinforcement Learning Agents Experience Pain?