Weak Baselines and Bad Management

The pressure is high in the AI world right now and baselines are wobbling.

Weak Baselines and Bad Management
Via Dall-E

Let’s say I had a model that identifies typos in text and I wanted to prove that it is the world’s best typo model. I would need to: 

  1. Find or build a large dataset of text with all the typos labeled 
  2. Run my model and the existing best typo models on the dataset
  3. Demonstrate that my model correctly finds more true typos (and doesn’t flag as many non-typos) than the other models 

If the results are good, I can write a paper, get it reviewed by my peers and I will temporarily be able to claim that I have the “state-of-the-art” typo identification model until someone detrones me. 

This process of running head-to-head comparisons against competitive models is the core of machine learning research, and it’s how effective companies deploy machine learning in products. In an appropriation of terminology from the hard sciences, we call the process an experiment and the models we compare against the baseline models, or simply the baselines

A baseline need not be a fancy machine learning model. Asking my neighbor’s 12-year-old daughter to identify typos on the dataset would be a valid baseline. I couldn’t claim that I had the best typos model in the world, but I could claim that my model is better than a well educated middle school student. 

This may all sound obvious — of course you need to compare against your competition to claim you’re better than your competition! But people mess it up all the time. Sometimes by accident, but often on purpose. The pressure is high in the AI world right now, new practitioners are streaming into the field and it can be hard to know what even is the “state-of-the-art” with so many results flying around. Things are getting messy and people are losing the thread. 

What I want MoP readers to remember is that experimental results with botched or (horrifyingly) missing baselines are useless. For the same reasons, algorithmically powered product features that have not been evaluated against a baseline are also useless. If you are an executive who thinks that adding a bunch of unevaluated — or un-evaluatable — AI features into your product is the path for fame and glory in this Age of AI, I urge you to reconsider. Every one of those features is instant tech debt: You will not be able to improve them if you have no way of measuring them and they will quietly degrade in performance until someone pulls the plug. 

Let’s take some examples.

Google’s response to OpenAI’s series of GPT models is a suite of generative AI models called Gemini. The announcement of the models includes experimental results that appear to indicate that Gemini Ultra outperforms GPT-4 on the large MMLU dataset of multiple-choice questions. But on close inspection, it’s clear that Google used a different, more powerful strategy to generate the prompts they sent the Gemini model (CoT@32) than they used to generate the prompts for GPT-4. 

Critics quickly picked up that this was not a one-to-one baseline, “If you look at the main marketing material it showed Gemini with Cot-32 vs GPT-4 5-shot which is a bit dishonest (yes, there is a more detailed table, but the headline result is rubbish).” said Ilya Venger of Microsoft. The 90-page, 1.2k+ author paper that accompanied the release of the models contains detailed experimental results and I am not suggesting any of these numbers are incorrect. But I assume that if they had created a more powerful model than GPT-4, they would have reported clearer baselines. 

Here in NYC, Mayor Eric Adams has decided to dive into the AI game. Last fall, the city shipped its first AI-powered chatbot called The MyCity Chatbot. It was designed to help business owners navigate city government, but an investigation from The Markup found that it will sometimes advise business owners to do all kinds of illegal things. “Yes, you can keep your funeral home pricing confidential,” it confidently told them, even though the FTC begs to differ. 

A traditional search system may be less flashy, but would likely perform better against the metric COUNT(returns advice that is illegal). My guess is that the city didn’t run this safety baseline. 

I don’t mean to be a downer; the new AI technology is exciting. I can’t wait to use it to replace existing algorithmic systems that make low-stakes, often incorrect predictions such as whether an article is a good “morning read” or if a shopper would like to buy socks with their shoes. I am not excited about having it replace high-stakes, highly accurate interactions such as conversations with a lawyer or HR business partner. 

AI will make boring enterprise software more efficient and it will make junior engineers write simple code faster, and that’s enough to change the way we work. But it is not a salve for every product or business model problem, and for the love of god, if you are experimenting with putting it in your product, run some baselines before you ship. 


Sign in or become a Machines on Paper member to join the conversation.
Just enter your email below to get a log in link.