Skip to main content

Command Palette

Search for a command to run...

Does "Large Scale AI" Need Bayesian Deep Learning?

Updated
3 min read
Does "Large Scale AI" Need Bayesian Deep Learning?
D

Researcher in Machine Learning, Causality and Recommender Systems. On a career break to write a book on Reward Optimizing Recommender Systems. Previously led a team of researchers and engineers at Criteo.

A 2024 position paper argues: Bayesian Deep Learning is Needed in the Age of Large-Scale AI.

This paper’s argument might seem like something that would resonate with me, there is Bayes and deep learning and an argument from very well known academics promoting Bayes as important for the hottest topic right now.

I am not convinced.

At least, I am not convinced on the use of Bayes on the initial part of the training which is about learning a joint distribution on tokens. Although I do agree that the problem they describe is in fact a very big problem, let’s start with their illustration:

The way LLMs generate text that looks authoritative but is actually garbage is perhaps their main limitation. Why does this happen though? It’s mainly because the (initial) training involves learning a joint distribution on tokens (factorized using the product rule, and parameterized using a massive transformer architecture). In the large data limit, the next token prediction model simply will output the prior frequency that completes a fragment. If this exact question was in the training data (numerous times), then the next token prediction will converge to the fraction of times this fragment is followed with a “yes”, a “no” or some other response. This is the frequentist notion of probability, in the large data scenario the LLM will simply output the prior frequency. What does Bayes brings? A model trained with Bayesian deep learning will also output if there are many examples of the fragment in the training data, but if the fragment is rare then the output will become more uncertain and more influenced by similar fragments. This means that the next token prediction will be more uncertain, and more “smoothed” with other similar fragments. It does not mean it will respond with “I am not at all certain” when prompted.

The big advantage of Bayes is that it allows you to be very confident about things you have seen frequently, and more uncertain and more influenced by similar data about other things. I don’t think this property is particularly useful when pre-training on massive corpuses of text. In the causal part of LLM training however Bayes may be more useful. In these stages training is done using a sort of reinforcement learning from human feedback part (RLHF). In causal scenarios you know a lot about things you did a lot (well you know a lot about how the world responds to these actions). You also know little or perhaps nothing about other actions. Bayes could well help here.

But for the fact that LLMs quite often produce nonsense outputs, and report very high confidence. I don’t see how Bayesian deep learning can help here.