Date: 7 December 2025
Large scale interactive systems such as large language models and recommender systems are arguably the largest living body of machine learning and so it is a very natural testbed of all the principles and practices that have been developed in academia. This workshop asks the following questions: which of these academic ideas survive in a production setting?, which are lost in translation?, and what additional ideas are needed in order to have a whole of system understanding of best practices for machine learning in production?
These systems have quickly developed into highly complex combinations of models that combine content (image and text), collaborative filtering, click models, and other signals with model performance measured at A/B test time. However, the speed at which these methods have been developed and then embraced in production has led to the situation where we are playing catch up in terms of establishing solid best practices, perhaps informed by relevant theories such as reinforcement learning, causal inference, Bayesian inference, or indeed the theory of randomized control trials.
Can all disputes about relative performance really be settled using Moritz Hardt’s Iron Rule: agree on a metric, benchmark, compete? Or should we pay heed to the ghost of John Maynard Keynes (paraphrased): Practical men who believe themselves to be quite exempt from any intellectual influence, are usually the slaves of some defunct learning theorist? Indeed, applying the Iron Rule seems relatively straightforward for image classification but it does not convert so easily to large scale industrial systems where there are many subsystems and many online and offline metrics that famously do not agree, and issues of making correct causal conclusions. How should the need to modularize a system be traded off with establishing whole system best practices? How much do we need to worry about causal issues such as SUTVA, confounding, or M-bias? Should we embrace inferential principles such as the likelihood principle or must we use propensity scores to make valid causal claims? Is A/B testing the correct way to implement the Iron Rule in production settings? How much do we have to learn from adjacent fields with complex protocols of best practice such as epidemiology?
| Name |
|---|
| David Rohde (Criteo) |
| James McInerney (Netflix) |
| Mingzhang Yin (University of Florida) |
| Srivas Chennu (Apple) |
| Kaiwen Wang (Cornell Tech) |