Two Small Extensions To Contextual Bandits

Researcher in Machine Learning, Causality and Recommender Systems. On a career break to write a book on Reward Optimizing Recommender Systems. Previously led a team of researchers and engineers at Criteo.
I have spoken quite a bit about how real system is a multi turn problem, but a lot of causal approaches to recommender systems use a single turn contextual bandit. This simplification occurs in part because the multi turn problem is so very complicated. Here I will talk about some simple extensions to the contextual bandit, that can take this simplified framework quite a bit further.
Traditional contextual bandits assume that “reward” is an observed variable (and it is the only thing that is observed after a recommendation. Let’s extend this framework in a way that is more in line with traditional Bayesian decision theory. The model then takes the form:
P(response|action, context)
The utility (reward) takes the form: u(response).
These can be combined to compute the optimal decision for a given context is computed:
best_action(context)
\= argmax_action ∫ u(response) P(response|action, context) d response,
where ∫ is the integral sign. That is the best action is the one that has the maximum expected utility, following the expected utility hypothesis. How is this an improvement on just treating the reward (utility) as observed? Firstly, there can be aspects of the response that help in model fitting, but do not contribute to reward. This is the case in the probabilistic rank and reward model. Another reasons is the response could contain both a value and a cost, in computational advertising targeting someone will cost some money, but bring some value, the response can keep track of both cost and the value.
There is a second advantage, which occurs when we think about the utility not of a single users, but a group of users. Let’s use capital U to denote the utility over all users from 1 to N that we are interested in. If we have the form:
U(response_1, …, response_N) = ∑_i u(response_i)
Then maximizing U() can be accomplished by maximizing each u() individually. In some problems U(response_1, …, response_N) can not be written as a sum. Think of a computational advertising problem where the goal is to maximize value subject to a fixed marketing spend. In this case there is a knapsack style problem where the value of targeting the N users can be computed and sorted and the finite budget is used to determine which users ought to be targeted and which ones should not be.
I think of this “extended contextual bandit” as the richest sort of modelling that can be done without going to the complexity of multi turn problems.




