The Great Horvitz-Thompson Estimator Controversy and Implications for Causal Inference

Researcher in Machine Learning, Causality and Recommender Systems. On a career break to write a book on Reward Optimizing Recommender Systems. Previously led a team of researchers and engineers at Criteo.
The Great Horvitz-Thompson Estimator Controversy and Implications for Causal Inference
Thirteen years a go a remarkable controversy appeared in the statistical blog-o-sphere. James Robins and Larry Wasserman wrote a blog article title “Robins and Wasserman Respond to a Nobel Prize Winner” which disagreed with arguments made by Christopher Sims on an estimator that is often used for off-policy estimation (a branch of causal inference).
A policy is the process that delivers actions to users. Estimating the expected reward from a novel policy is often one of the most practically interesting causal questions, as usually you want to select the policy with the highest expected utility. It is this question that off-policy estimation targets and it is usually done using the Horvitz-Thompson estimator. The concept of the estimator is that a new proposed policy is compared with the past logging policy, if the new policy does an action more often than the old logging policy then rewards are scaled upward appropriately (often by a lot), if it is done less often or not at all the reward is scaled downwards or ignored. The reward of the new policy is then is computed using a ratio of propensity scores.
Drawing upon a 1997 paper by Robins and Ritrov the post identifies that the Horvitz-Thompson estimator is frequentist, in fact it’s use of the propensity scores violate the the likelihood principle. The central argument is that any estimator that ignores the propensity scores will not have uniform convergence, yet ignoring the propensity scores is contrary to Bayesian principles hence no Bayesian method will have uniform convergence.
In a causal inference vocabulary, if we have a stratified random experiment where the covariate is sufficiently high dimensional and the logging policy is sufficiently complex then estimators for the average treatment effect should use the logging policy to have uniform convergence. Yet, Bayesian principles insist that the propensity scores must be ignored.
In fact the last sentence is controversial, and the Wasserman-Sims debate mostly hinges on if Bayesian estimators that incorporate the propensity score are valid. If we accept the Robins-Ritrov argument then a Bayesian that ignores the score is going to have to let go of uniform convergence in order to have other properties that they like such as coherence, admissibility and the likelihood principle.
What are the practical implications for this rather academic “epic debate”? Essentially it is this. If you have a sufficiently complex logging policy (the stratification is complex) and you want to estimate the expected utility of a new policy, then you may have to choose between having good frequentist or good Bayesian properties. At least this is a subject for debate. The Bayesian approach or (“direct method”) is usually dismissed in the literature, but for other reasons, the Robins Ritrov result would seem to provide additional support for this position.
But do we need to know the expected utility of a policy? If there is only a single turn, then it is sufficient to know the expected reward for every action in every context, and Bayesian methods are quite appealing here as they can incorporate relevant information (the context and the action properties) and ignore irrelevant (the propensity score).
First appeared in the Causal Python Weekly Newsletter.




