Confounding is a Pervasive Problem in Real World Recommender Systems

Last week I wrote about why you shouldn’t have to worry about confounding if you operate an interactive system (such as a recommender system). This week I want to talk about my upcoming talk at the CONSEQUENCES workshop at RecSys 2025: Confounding is a Pervasive Problem in Real World Recommender Systems. Although these statements might sound contradictory, they aren’t because of the word shouldn’t.

Put, simply there are no unobserved confounders, however common practices in productionizing machine learning models can lead to a confounder being ignored.

Consider a recommender system that fits a click model and uses the click model to find recommendations that have the highest click through rate for a given covariate (often called a context). If the production system is currently using the covariate x1, then the causal graph will have a link x1→a (where a is the recommendation). In practice this means future recommendations will be influenced by x1. We could take another covariate x2 and also add this, now the graph delivering the recommendations will have the links x1→a←x2. The personalization of the recommendations becomes more fine grained, now being determined by both x1 and x2.

What if you decide that adding x2 isn’t helpful? Can you just remove it from the model? Unfortunately, even though exactly the same model fit correctly and without confounding in the past, now the presence of the link x2→a means that x2 can’t simply be ignored. The bottom line is that you can add features into the model, but you can’t take them out without doing something complicated like implementing the backdoor rule (which would remove the simplifying and reduced variance of dropping the covariate).

There are other common situations which can result in confounding in real systems. If the recommendation engine is built from a hierarchy of models then if you add a feature into one of these models then that model will improve, but every other model in the system will suffer from confounding. Similarly if there is a separate click and sales model and they have different features then these models will also confound each other.

So called “good engineering practices” (modularization) or “good machine learning practices” (feature engineering, A/B testing) are all potential causes of confounding. There are “easy fixes” such as use the same features in all models, but such an approach is likely unacceptable from both an engineering point of view and a model fitting point of view. Avoiding all confounding will not only take a dedicated company wide effort, it will also likely require new machine learning best practices.

You might also be wondering if propensity score based methods can help here. They aren’t necessary, but they are an option. One possible approach is to use the propensity score as a balancing score that is a low dimensional substitute for the high dimensional covariates. The other is to bypass the model altogether and use the inverse propensity score estimator (or Horvitz-Thompson estimator) which directly estimates the expected clicks for a new recommendation algorithm (policy).

First appeared in the Causal Python Weekly Newsletter.