Offline Off-Policy Reinforcement Learning with Contextual Bandits

In healthcare, it is not always possible to create a simulation environment (e.g. a patient recieving a treatment) or to interact with the environment by letting the policy choose actions. Running a reinforcement learning algorithm to estimate the policy value may be infeasible because doing so is expensive, risky, or unethical/illegal. However, we do have a wealth of retrospective data stored in electronic medical records (EMR) that we want to use to learn an optimal target policy and also use this data to evaluatte our targett policy. In these scenarios training and evaluating a target policy is referred to as the offline and off-policy setting.

Offline learning: training samples come from a retrospective dataset (aka logged data). The agent is not allowed to explore, meaning the agent does not get to see rewards or next states after choosing an action.

Off-policy evaluation: we are not able to let the agent choose actions and collect rewards to explore the world. So we do off-policy evaluation by calculating the expected reward of the agent policy on logged data collected under a very different behavior policy.

What is challenging about off-policy learning?

Take Q-learning as an example. In Q-learning we aim to learn the long-term value of a State-Action pair. During evaluation, given a State we would choose the Action with highest value.

Q-table	Action 1	Action 2
State 1	0.5	0.1
State 2	1.3	?
State 3	?	2.8

Q-function can be as simple as a lookup table filled with value estimates.

\[ Q(s,a) \leftarrow r + \gamma Q(s',a') \]

\[ (s,a,r,s') \sim \text{Dataset} \]

\[ a' \sim \pi(s') \]

where \(s\), \(a\), and \(r\) are known and \(a'\) is generated by our target policy.

The model that generates \(a'\) will have a large error if it has never seen the State-Action pair in the dataset because we've never trained on that value before. In other words, the missing cells in the Q-table above can't be completed with the value estimates if we've never observed the State-Action pair.

\[ (s',a') \notin \text{Dataset} \rightarrow Q(s',a') = \text{bad} \]

\[ (s',a') \notin \text{Dataset} \rightarrow Q(s,a) = \text{bad} \]

One solution is to only choose actions that were observed in the dataset for a given state. This is called behavior cloning.

Contextual Bandits

Contextual bandits, which can be thought of as a 1-state RL model, offers a convenient and easy to implement algorithm to create a target policy. Here are some ways Contextual Bandits differ from other RL methods:

Bandits	Reinforcement learning
Actions don’t change the state of the world: State, Action, Reward	Actions change the state of the world and the new state is observed: State, Action, Reward, Next State
Assumes samples (S, A, R tuples) are IID	Models a sequential decision process
Optimizes for immediate reward	Optimizes for cumulative (discounted) long-term reward so it needs to explore all possible sequences of actions
1-state or stateless RL	Models the State-Next State transition probability matrix
Multi-armed bandits (stateless RL): current state is unknown, only Action & Reward are observed; learn to choose the best action according to the reward. Contextual bandits (1-state RL): current state is known; learn to choose the best action according to the state and reward	Markov decision process, Actor-Critic, Temporal difference learning, Q-learning

One remarkable finding is that Contextual Bandits can be expressed as a cost-sensitive supervised learning problem.

Paper	Description
Off-policy Evaluation and Learning. Agarwal A.	How to do offline learning with contextual bandits? Reduces Contextual Bandits to a supervised learning problem with cost-sensitive classification.
Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. Wang Y-X, Agarwal A, Dudík M.	How to evaluate contextual bandit models? Explains and compares importance sampling, direct method, and doubly robust off-policy evaluation.
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. Jiang N, Li L.	Extends doubly robust evaluation metric from contextual bandits to sequential decision-making models like RL.
Off-Policy Deep Reinforcement Learning without Exploration. Fujimoto S, Meger D, Precup D.	Solves problem of large errors in value estimates from unobserved State-Action pairs. Introduces behavior cloning (BCQ): agent only chooses actions that were observed in the dataset.
Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. Kumar A, Fu J, Tucker G, Levine S.	Improves BCQ with a new off-policy Q-learning model.