The Reasoning Engine

Data Flywheel

Rogerio Chaves — Fri, 28 Jun 2024 10:21:04 GMT

This post was originally shared here: https://langwatch.ai/blog/data-flywheel

It does not come as a shock for anybody when I say that data is one of the most valuable assets a company can have, this has been true for some years now. Valuable data can be used to build very powerful products, and with the invention of LLMs, it became even more obvious that companies that already have valuable data internally can leverage AI to create a bigger edge.

But internal data is not the one I want to talk about today, I want to talk about the data that is produced as a byproduct of your AI being in production: all the interactions your users have with it, both with happy and unhappy outcomes. This is super valuable data, it’s literally the best insight on how users expect your product to behave, and now it comes in conversational text format, it's not just numbers like traditional product analytics, so it can directly feed back into the AI.

The thing with LLMs, is that they are super powerful and general, this is the exact reason we want to use them that, they can answer any questions and execute a large range of tasks, however exactly because of that, their surface area to the real world usage is extremely large, the more they do the larger it is, resulting in countless different ways it can fall short. It is hopeless to try and predict all the ways your users will want to use your product, and try covering all the possibilities and edge cases beforehand.

How can a Data Flywheel help LLM apps?

We call it a flywheel when an improvement brings business value and that loops back on further improvements. Since the number of possibilities of an LLM product is too vast - think of all possible ways a user can try using your product, or even compose a sentence - nothing is better at selecting what is more valuable and relevant like the real world, this makes it essential to monitor and get insights on where the product is suboptimal, or where it could bring even more value.

Then by using those insights for improving the product and making it more valuable, it will receive more usage and more users, it will gain more trust and cover more cases to be used in more places. By getting more usage, even more scenarios are revealed, new paths to bring even more value, and new shortcomings to overcome and make the product more robust, which will lead to even more usage an so on. That’s the flywheel.

The advantage with AI over traditional software improvements, is that this connection is even more direct, LLMs can be used also to evaluate, classify and synthesize data that will improve the product, so this flywheel is closed even tighter together.

There is, however, no shortcuts, if your solution has been finding new unexpected edge cases for months or years, it is not possible for someone else to just guess which ones they are and port them over elsewhere without going through the same discovery process. The improvement process you done on your product for your specific domain then creates an edge that sets you apart from others as soon as you start it.

This is a good news, if you are doing it.

LangWatch

I started building LangWatch on top of this realization, to help you maximize your data flywheel. Our platform helps both on collection part, with monitoring, evaluations, insights, collecting user feedback and annotating with domain experts from your team, and on the improvement part, running experiments, validating and guaranteeing the improved quality of each iteration.

If you are interested, help us out with a star on github ⭐

Adventures with DSPy

Rogerio Chaves — Sat, 01 Jun 2024 14:22:23 GMT

In the past few weeks I’ve became obsessed with DSPy, when I first saw the video DSPy Explained by Connor Shorten everything immediately clicked.

This is the LLM framework I was waiting for, this is the idea I was trying to get to but didn’t knew exactly how, and oh how coherent all the pieces are! It’s a great conceptual abstraction, I noticed myself not needing the docs all the time, even though the library is not properly typed right now, because things just *make sense* together.

If you haven’t dig into it yet, DSPy main selling point is that you just define your pipeline declaratively without worrying about the prompts to the LLM, and then it can optimize the prompts for you automatically. Of course, why should humans be trying to prompt engineer, when we have the best machines to do that: LLMs and trial-and-error.

And that’s kinda obvious right, of course I thought of that before, of course a lot of people thought about that before, having another LLM to improve your prompts, in fact we already have plenty of those like PromptPerfect, however things still felt iffy, kinda unreliable, relying just on feeling, or on a bit of uncanny abstraction. But DSPy looked at PyTorch for inspiration, and this, turns out, was the right abstraction.

Many devs adventured themselves at the problem, and different angles were attempted, I personally took a TDD approach at the problem, which I still think is the right one for sanity checks, I’ve also tried Reactive Streams FP-style approach, some inventented this weird “chains” concept which nobody really likes, and are now attempting Graphs angle.

But turns out, comming from the Machine Learning angle with PyTorch abstraction like DSPy did was the actual right approach, the one that makes me and all devs I talk to go like “oooh that is genius! It makes a lot of sense” and have this sense of relief that finally looks like someone had the right jab at the problem and you’ll be able to build something reliable.

Of course DSPy doesn’t seem the best for every problem right now, it seems to make the most sense for problems with well defined discrete outputs, like traditional supervised learning, for example when you want to use LLMs as a classifier. However, it is a clear a step on the right direction, and you can see a near future where more “feeling-based” evaluators will be also very effective and ubiquotous, for example, if you are writing a children’s stories chatbot and want to reliably evaluate how boring or exciting a story is.

The main value that DSPy brings to the industry then, is not (only) that it optimizes prompts for you allowing you to switch models whenever you like, but that it actually turns the problem on its head: it forces you to change your mindset, thinking first about what does it mean to be a good output or not with examples (like TDD promotes!) before the prompt, freeing you think in a more abstract level about the structure and so, and let the machine do machine-things of finding the best fit given the problem at hand.

Being so bearish in DSPy, I really want to help it grow and contribute to the community because I think it has a lot of future and it is the ahead. And I mean conceptually, even if something else would superseed DSPy with a different name.

Turns out, I already have a startup in the LLM space, LangWatch, so as I played with DSPy for LangWatch experiments, I naturally noticed a piece missing that could help me a lot. Thus, DSPy Visualizer was born.

DSPy Visualizer

DSPy Visualizer is open-source, and it allows you to log your DSPy training sessions, track the performance, costs, compare runs and debug them in detail.

DSPy Visualizer

It’s still early days for DSPy or any kinda of automated LLM optimizer, so it really helps a lot being able to understand what is the optimizer trying to do, where is it going, and specially, how much is it costing, as too many examples and too complex pipelines can make your OpenAI credits go down the drain in a heartbeat. Even though LLMs are getting cheaper, we are pushing the automation limits here.

Iteration is the word of order with DSPy, although a much more confortable, reliable iteration than “classical” prompt engineering of course, and we found that looking at the examples being used, the LLM calls being made, and what is happening on all those optimization attempts that is helping it go up or down, gave us lots of ideas of what we can try next, or what is wrong so we can stop the iteration earlier and fix it.

I’ve posted an official blogpost on the DSPy Visualizer on our LangWatch blog, if you want to try it out, check out the quickstart guide.

If you are in for more DSPy content, follow me on twitter and subscribe to this substack!

In defense of vibe-checking

Rogerio Chaves — Tue, 02 Apr 2024 10:30:51 GMT

As AIs approach humans, so does our evaluation on how well they are working. “Vibe-checking” is the act of evaluating an LLM and deploying it in production just by eye-balling and *feeling* if it is good enough, but is that a problem?

Engineering is regarded as applied science to solve practical problems. We think it’s like applied physics, where knowledge is derived from first principles, in all reality, it looks much more like natural sciences, where constant tinkering leads us to new discoveries. Just look at all the recent advances in AI and the papers coming out of it, they are all about tinkering, trying stuff until it works. Attempts to explain *why* they work comes much later.

Even though this has always been the case for all areas of engineering (or do you think humans only started building houses *after* we had geometry? Nope, practice comes first, theory after), we have this general feeling in retrospect that engineers first do the calculations and make something work in theory before they do in practice. This makes us feel weird and funny about people deploying all those chatbots without knowing how they will act and what they will do, and having no answers whatsoever if it’s good or not other a few query examples and then “vibe-checking”.

How do we know that something is working well? If you go back to civil engineering, you can actually make the calculations of the structure and a simulation of the whole building, which matches pretty close to real life. If you go to a car maker manufactured, they have all those crash test dummies, that’s how they know their theorized safety really works on real life. You plan, you building, you run a battery of tests trying to cover all real life scenarios, and you deploy, with a high success rate.

On software, it’s even easier. Computers follow a completely predictable logic, that’s literally their definition, so unless you are in outer space, a computer program will be executed the same and behave the same 100% of the times. This means if we have a financial application, we can guarantee the transactions calculations are working correctly, by having not only real world examples for the unit tests, but really deriving from first principles to proof the algorithm is correct if needed so. How come software has bugs then? Well, because we can take bugs, we are comfortable with a certain level of them, so we increased the complexity of our software by a lot (the whole world is connected), and our testing and safety practices followed just enough behind.

Then, in the past decade, as ML gained more popularity, we started to adapt to a somewhat level of uncertainty in technology, but still, a measured uncertainty. When a classification model has 90% accuracy, you know what to expect how often it will be wrong, in what domain. When a regression has a mean absolute error of $50K for predicting apartment prices in New York, you know that’s still a pretty damn accurate model.

But as AI get more and more general, executing more and more tasks, how do you evaluate how good they are? You cannot cover all real-life scenarios, if you could, then you wouldn’t need a general-purpose agent.

Well, turns out, other than technology, there was something else all along with a lot of unpredictable performance: humans. Yes, turns out we have been handling billions of those completely unpredictable creatures for a while now. They have an advantage though: they are very general and can execute about every type of task there is.

How can you uniformly evaluate humans? Is there even a way? Well, turns out, there is, we do it all the time with children, by scoring their school tests, having even standardized exams on a national level like the SATs, or law school and medical school exams.

This is pretty much the same as we have been doing with AI now, we give them standardized benchmarks, and as they approach human-level and benchmarks quickly get obsolete, we come up with more and more benchmarks to somehow measure how good they are, even the actual SATs we give humans.

But both you and I know that’s not enough, if it were, why would companies still do interviews with candidates before hiring them as employee? If they passed all standardized tests, why do we still need to talk to them, what are companies doing there? Well, they are vibe-checking of course!

As the work becomes too general and too nuanced, so it becomes the evaluation of what good or bad means, and the effort to capture all of it increases exponentially. As you remove the biggest uncertainties, what’s left brings steep diminish returns. That’s why companies don’t keep evaluating forever, they deploy employees, put them on probation, put some guardrails around, and out they go.

This is not to say that standardized exams are not useful, of course they are, and that’s why all LLMs try to compete for the leaderboard to prove their value, and why companies prefer to hire employees from high-regarded institutions rather than little known ones. Before you have any insights on how well it is going to work out for your specific scenario, having higher priors is a safer bet, and then you vibe-check on top of that.

That’s the same as a lot of companies have been doing deploying their LLMs right now, but not without guilt, everybody is ashamed to admit that they are really just vibe-checking, it makes us uncomfortable, there is lack of confidence, lack of control, it’s an unpredictable employee with too high risk for brand reputation.

So I wanted to build up to bring you to this thought: in many ways LLMs are like an employee of yours, you give it the job description, you interview it with some test scenarios and then you deploy it, without knowing what it is going to say or do. It’s okay, it’s not a problem, as long as you also add some guardrails around it to keep it in check for the most important rules of your business, and monitor it on how it is doing it’s work, as a manager does, evaluating its performance and giving feedbacks to improve (iterating), not only based on cold-hard metrics though, but also on further vibe-checking for constant growth (with regression-prevention tests being possible with LLMs).

On one hand, being machines, of course AI can scale and be way more impactful than a single individual, for good or for worse, so then we justifiably hold them accountable to higher standards than we would a lot of people. On the other hand, being machines, there is a maybe a lot we can automate and do moving forward to increase the standards in a scalable way as we expect. Meanwhile, iterative vibe-checking is still a very valid way for our brains to align on weather general-purpose machines are good or not for our specific scenario.

LLMs price reduction timeline

Rogerio Chaves — Wed, 06 Mar 2024 10:30:41 GMT

As ChatGPT launched the world into the AI and LLMs race, enormous amounts of efforts and capital went into improving the technology, not only in quality, but also in speed and price. As I wrote earlier, LLMs saw an 86.5% reduction in cost just in 2023 (Mar/2023 to Nov/2023) if we take GPT 3.5-turbo as a baseline (if we would go a step before turbo, plain GPT-3.5 on Nov/2022, this actually jumps to 92.5% reduction for the first year!). For contrast, IT equipment historically reduced “just” 23% per year on its best years, even tech that faced impressive sharp cost declines like storage, solar, gene sequencing or LED bulbs were not that fast.

And the price cuts continues steadily as new challengers try to take both OpenAI and NVIDIA’s crowns. Sure, a lot of price cuts might come from VCs and Big Tech deep pockets, but given that open-source LLMs are definitively here to stay, predatory pricing just to get monopoly might not be the best investment, and they need to be sure to recover the costs somehow, so it certainly feels like the price cuts are in a big part due to real optimizations, not just hype dynamics, and papers comming out give a good evidence support for that.

Timeline with GPT-3.5 as a baseline

Nov/2022 - OpenAI launches GPT-3.5 (text-davinci-003) for $20 / 1M tokens

Mar/2023 - OpenAI launches GPT-3.5-turbo, for $2 / 1M tokens, a 10x reduction

Jun/2023 - OpenAI reduces GPT-3.5-turbo input cost by 25%, at $1.5 / 1M tokens, keeping output at $2 / 1M tokens

Nov/2023 - OpenAI announces on it’s DevDay a reduction in GPT-3.5 turbo input cost again to $1 / 1M tokens (output still at $2 / 1M tokens), while increasing the context window to 16K

Dec/2023 - Mistral announces Mixtral 8x7B, achieving GPT 3-5’s performance, and costing $0.7 / 1M tokens. Given it’s open-source nature, providers rush to host it at lower prices, with DeepInfra offering it at $0.27 / 1M tokens

Jan/2024 - OpenAI announces GPT-3.5 price reduction again, with $0.5 / 1M tokens input and $1.5 / 1M tokens output (lower than Mistral for input, but not the lowest anymore at this quality)

Feb/2024 - Groq opens it’s API access with it’s new LPU engine, offering a lower price guarantee per million of tokens, with Mixtral 8x7B at the same $0.27 / 1M tokens, but at ~480 tokens/s speed

Mar/2024 - Anthropic announces the new Claude 3 family, with it’s smaller Haiku model at GPT 3-5’s capability but for half of the price for input, at $0.25 / 1M tokens for input and $1.25 / 1M tokens for output

In summary, this represents ~92.5% reduction from Nov/2022 to Nov/2023, and ~86.5% from Mar/2023 to Mar/2024, I’ll keep this timeline up to date, let’s see how it will look like this november, if the cost reduction trend will reduce, or keep accelerating!

Now, you might be wondering, do we even care about GPT-3.5 when GPT-4 is out there? Well, yes, it is a quite capable model and good baseline to start with and look at price reductions, specially now that other models are catching up in performance.

However, let’s also take a look at a timeline for GPT-4 as a baseline.

Timeline with GPT-4 as a baseline

Mar/2023 - OpenAI launched GPT-4 behind a waitlist for developers, costing $30 / 1M tokens for input and $60 / 1M tokens for output

July/2023 - GPT-4 API generally available for all paying customers

Nov/2023 - OpenAI announces GPT-4-turbo at the DevDay, with 128K context window, and reducing the prices to $10 / 1M tokens for input and $30 / 1M tokens for output

Mar/2024 - Anthropic seems to be the first one to really beat GPT-4 with Claude 3 Opus, but at a slighlty saltier price of $15 / 1M tokens for input and $75 / 1M tokens for output. However, the runner up Claude 3 Sonnet does beat GPT-4 in some benchmarks and get quite close on others, at a lower $3 / 1M tokens for input and $15 / 1M tokens output prices

That’s a ~33% to ~90% input price reduction from Mar/2023 to Mar/2023, depending how you look at it.

Now, you might notice that Google’s Gemini is missing from those lists, this is because Google is being weird with their prices, telling that the prices for Gemini Ultra are “comming soon”, offering Gemini Pro completely for free (desperate much?) and losing public confidence, like comparing 5-show with CoT, trying to pull marketing stunts with very heavily edited demo videos, and so on. When they can actually show comparable performances and prices without gimmicks, then we can put them as a good comparison point as well.

That’s it, GPT-4 seems like a tougher game to play with Anthropic reaching it just now, but regardless of that, the LLM game doesn’t seem to be slowing down anytime soon, the remaining question to answer is if this remaining GPUs after optimization will transformed in business value and cost recovery, or if it means we can go stronger, bigger, GPT-5 and ahead.

Cheers!

The Reasoning Engine

Rogerio Chaves — Wed, 27 Dec 2023 11:31:42 GMT

In the world of software development, we’re accustomed to connect many components together to build complex systems, those components have been stable for a while: you have a database, which is your storage engine for transactional data, you have a queue, to transport data reliably and at scale to different parts of the system, you have load balancers and CDNs, and other 200+ AWS services at your disposal. However, never before we had a component for reasoning, an engine where we throw data at it, it reasons for a bit, and it gives an answer out. Now we do.

Some may argue that what LLMs do is not really reasoning, even so, they definitively pushed the boundaries of what is possible to automate. As we automated away mechanical and repetitive tasks, we moved more and more to become knowledge workers, doing non-repetitive work or at least very flexible types of work, which cannot be described by a formula to be automated, or at least it is not worth to do so. Now, for the first time, we can also automate tasks that require knowledge and intelligence, and that are somewhat non-repetitive, tasks that we always said, required reasoning.

For the first time we can, as part of our software, have a function call to execute a task with the specification of the task in natural human language, a task that requires some level of intelligence to be executed, and have the computer to carry it over for you instead of a fellow co-worker.

Let’s take the example of a retail ecommerce, which has a catalog of over 40 thousand items from various suppliers, adding a special “playful and funny” branding on top of it. To have this branding throughout the whole store, the ecommerce needs to rewrite the description given by the suppliers of each and every item. To do that, the ecommerce pays a wholesome amount to an agency, which has an army of humans rewriting those products in the business tone of voice. This costs millions per year for the business, but there is no other way, rewriting requires a level of intelligence. Turns out, rewriting with a different tone of voice can now be automated with LLMs, the cost can be brought down from a few hundred dollars per item, to cents. Even if there is a trade-off in quality, it has to be huge to cover the difference in price, and as LLMs get smarter, the gap is definitively shrinking. You can now execute tasks that requires intelligence for a cheaper price. Intelligence got cheaper.

Reasoning Engine as imagined by Dalle-3

Humans need not to be displaced for the reasoning engine to add value either, automation is probably the most cliché example, but there is actually much more to be gained in the accretion of new ideas, tiny gaps that requires intelligence, but would not be carried unless really cheap to do so.

For example, consider an online travel agency selling tickets to attractions and things to do around the world, and it wants to help travelers to have fun even if they got unlucky with the weather, under a heavy rain for vacations. They don’t have the information of which things would be fun to do in bad weather conditions, they could ask the suppliers, but suppliers may have quite biased opinions of themselves. They could crowdsource it from their community, but it could take years until they have a decent amount of data worldwide to give out recommendations. They could ask their own travel experts and employees, or hire an agency to mark each attraction as “good for heavy rain” or not, but if they have a hundred thousand products, the task becomes incredibly boring and cost prohibitive for such a minor feature. There is no return over investment, so this feature would never happen. Guess what, LLMs have the world knowledge compared to a travel expert, they can be good estimators if an attraction is good for rainy days or not, and can carry over the task for a modest sum — suddenly, this feature becomes ROI positive and pops into existence.

Now where does traditional Machine Learning fit into this? This seems like a task that could be easily carried over by a classification model, and moreover, isn’t ML already this engine capable of more than just the logical processing of traditional software? Well, yes and no. Sure, it can carry over a task that requires some intelligence, but it doesn’t reason about the task itself, it just mindlessly executes this one thing. This means that for this specific example, unless there is some off-the-shelf attractions-to-rain classification model out there, then there would be also a high cost associated with hiring data scientists to develop, train and run a specific model for your business, with the timelines also measured in months, making it still cost prohibitive.

Those examples are just scratching the surface, considering the abstraction of having a Reasoning Engine anywhere can take software much further, think of having a system that dynamically integrates with any other system on the fly, by simply generating the API integrations as users need them, or an exception handler that can auto-recover by understanding the cause of issue and patching the data to get it to pass through, maybe UIs generated on the fly for user amusement, and a supervision tree agent that quarantines and review all actions executed by the thousands of mini automated agents across the system, reporting to the human in charge. The possibilities are much larger than my creativity can contain.

Sharp Decline of Cost

Two biggest blockers for having Reasoning Engines more and more ingrained in our software are quality and cost. Quality is where the biggest challenge and most popular race lies of course, however, even given lower quality models, we can play into computers advantages, which is being easily scalable and never getting tired. Take software development for example, even GPT-4 is often not better than Junior developers, however, there is no reason why can’t we employ an army of a 1000 almost-junior LLM developers to work 24/7 until they crack a problem, the main limiting factor right now is cost, and this has been in sharp decline.

This year, OpenAI announced a 3X reduction in GPT-4 cost even as they massively increase the context window size; there is a race starting to try outperform NVIDIA GPUs with specialized transformers chips; open-source is catching up on performance with tiny models that run on your phone and architectural innovations such as mixture-of-experts reducing inference cost to ~25% for the same number of total parameters, as the 8x7B MoE model by Mistral. The race to the bottom is so intense that Google decided to provide Gemini Pro completely for free.

Considering that GPT-3.5, which launched the LLM race at the beginning of 2023 came with the cost of $2.00 / 1M tokens (advertised as $0.002 / 1K tokens) and Mixtral 8x7B at the end of 2023 is comparable or even better in performance, now costing $0.27 / 1M tokens at DeepInfra, this means a 86.5% reduction in cost just this year, an impressive feat. For contrast, to give an idea, IT equipment historically reduced “just” 23% per year on its best years.

LLM as the Kernel

Taking a step forward, some people, most notably Andrej Karpathy, came up with the conceptual idea of a LLM OS, having the LLM in very core just for it’s reasoning capabilities similar to a CPU or our brain, having other components connected to allow it to extend its limitations, such as giving it longer term memory, peripherals like audio and video, communication with the network and other LLMs and so on.

LLM OS by Andrej Karpathy

This has as its very core the idea that the LLM is capable of general reasoning to orchestrate all those elements, which is only true at a shallow level as of now, but as context windows and multi-modal capabilities expand, models become better and better at it, already controlling computers and mobile phones while doing all the reasoning to carry over a task.

Kernel or not, being at the core or at the edge of software, either way I cannot even begin to imagine the possibilities of what having an engine capable of reasoning in the creative hands of developers can lead to, and I’m starting this blog’s precisely to follow and write about it as it develops.

This Blog and Newsletter

Hello my name is Rogerio 👋

I’ve been a software engineer for 14 years now, I’ve started this blog to write on musings and developments of the Reasoning Engine through the lenses of a developer as it eats software. It doesn’t really matter if this will indeed be advanced by LLMs or if some better idea is going to replace it, but the fact is, we won’t go back from here, the journey ahead is exciting for builders, and I will be following it closely.

My goal is to understand how devs are using LLMs, how both the foundational technology and the libraries and techniques on top advances, for dev productivity but specially when embedded in the software itself, and what possibilities will be unlocked that we could never guess before.

Subscribe to the newsletter if you want to enjoy the ride too, and follow me on twitter if you prefer AI memes instead.

Coming soon

Rogerio Chaves — Tue, 26 Dec 2023 15:00:44 GMT

This is The Reasoning Engine.

Subscribe now