In defense of vibe-checking
As AIs approach humans, so does our evaluation on how well they are working. “Vibe-checking” is the act of evaluating an LLM and deploying it in production just by eye-balling and *feeling* if it is good enough, but is that a problem?
Engineering is regarded as applied science to solve practical problems. We think it’s like applied physics, where knowledge is derived from first principles, in all reality, it looks much more like natural sciences, where constant tinkering leads us to new discoveries. Just look at all the recent advances in AI and the papers coming out of it, they are all about tinkering, trying stuff until it works. Attempts to explain *why* they work comes much later.
Even though this has always been the case for all areas of engineering (or do you think humans only started building houses *after* we had geometry? Nope, practice comes first, theory after), we have this general feeling in retrospect that engineers first do the calculations and make something work in theory before they do in practice. This makes us feel weird and funny about people deploying all those chatbots without knowing how they will act and what they will do, and having no answers whatsoever if it’s good or not other a few query examples and then “vibe-checking”.
How do we know that something is working well? If you go back to civil engineering, you can actually make the calculations of the structure and a simulation of the whole building, which matches pretty close to real life. If you go to a car maker manufactured, they have all those crash test dummies, that’s how they know their theorized safety really works on real life. You plan, you building, you run a battery of tests trying to cover all real life scenarios, and you deploy, with a high success rate.
On software, it’s even easier. Computers follow a completely predictable logic, that’s literally their definition, so unless you are in outer space, a computer program will be executed the same and behave the same 100% of the times. This means if we have a financial application, we can guarantee the transactions calculations are working correctly, by having not only real world examples for the unit tests, but really deriving from first principles to proof the algorithm is correct if needed so. How come software has bugs then? Well, because we can take bugs, we are comfortable with a certain level of them, so we increased the complexity of our software by a lot (the whole world is connected), and our testing and safety practices followed just enough behind.
Then, in the past decade, as ML gained more popularity, we started to adapt to a somewhat level of uncertainty in technology, but still, a measured uncertainty. When a classification model has 90% accuracy, you know what to expect how often it will be wrong, in what domain. When a regression has a mean absolute error of $50K for predicting apartment prices in New York, you know that’s still a pretty damn accurate model.
But as AI get more and more general, executing more and more tasks, how do you evaluate how good they are? You cannot cover all real-life scenarios, if you could, then you wouldn’t need a general-purpose agent.
Well, turns out, other than technology, there was something else all along with a lot of unpredictable performance: humans. Yes, turns out we have been handling billions of those completely unpredictable creatures for a while now. They have an advantage though: they are very general and can execute about every type of task there is.
How can you uniformly evaluate humans? Is there even a way? Well, turns out, there is, we do it all the time with children, by scoring their school tests, having even standardized exams on a national level like the SATs, or law school and medical school exams.
This is pretty much the same as we have been doing with AI now, we give them standardized benchmarks, and as they approach human-level and benchmarks quickly get obsolete, we come up with more and more benchmarks to somehow measure how good they are, even the actual SATs we give humans.
But both you and I know that’s not enough, if it were, why would companies still do interviews with candidates before hiring them as employee? If they passed all standardized tests, why do we still need to talk to them, what are companies doing there? Well, they are vibe-checking of course!
As the work becomes too general and too nuanced, so it becomes the evaluation of what good or bad means, and the effort to capture all of it increases exponentially. As you remove the biggest uncertainties, what’s left brings steep diminish returns. That’s why companies don’t keep evaluating forever, they deploy employees, put them on probation, put some guardrails around, and out they go.
This is not to say that standardized exams are not useful, of course they are, and that’s why all LLMs try to compete for the leaderboard to prove their value, and why companies prefer to hire employees from high-regarded institutions rather than little known ones. Before you have any insights on how well it is going to work out for your specific scenario, having higher priors is a safer bet, and then you vibe-check on top of that.
That’s the same as a lot of companies have been doing deploying their LLMs right now, but not without guilt, everybody is ashamed to admit that they are really just vibe-checking, it makes us uncomfortable, there is lack of confidence, lack of control, it’s an unpredictable employee with too high risk for brand reputation.
So I wanted to build up to bring you to this thought: in many ways LLMs are like an employee of yours, you give it the job description, you interview it with some test scenarios and then you deploy it, without knowing what it is going to say or do. It’s okay, it’s not a problem, as long as you also add some guardrails around it to keep it in check for the most important rules of your business, and monitor it on how it is doing it’s work, as a manager does, evaluating its performance and giving feedbacks to improve (iterating), not only based on cold-hard metrics though, but also on further vibe-checking for constant growth (with regression-prevention tests being possible with LLMs).
On one hand, being machines, of course AI can scale and be way more impactful than a single individual, for good or for worse, so then we justifiably hold them accountable to higher standards than we would a lot of people. On the other hand, being machines, there is a maybe a lot we can automate and do moving forward to increase the standards in a scalable way as we expect. Meanwhile, iterative vibe-checking is still a very valid way for our brains to align on weather general-purpose machines are good or not for our specific scenario.