There is a trifecta of variables that just about everyone building generative AI systems cares about — cost, quality, and latency.
This is most noticeable in medium and large companies building products with large language models (LLMs). And it may also be true of many other kinds of software system on the internet.
As is often the case in computing, there is a trade-off between these properties. And all three are bought with the currency of performance.
With a fixed budget of performance, you have to decide if you want your system to be good, be fast, or be affordable. If you’re clever, you can get a bit of all three in the right balance for your needs.
Sometimes, the most important aspect is cost — if you have a B2C app that runs on a freemium model and is triggered thousands of times a day by hundreds of thousands of users, it better be cheap as chips. But if you’re selling a $5000/seat legal co-pilot to law firms, it better be more reliable than the paralegal. And if you’re making an intelligent autocomplete tool for software developers, it needs to be lightning fast (and also pretty cheap to run).
But once you’re optimally located on the three-dimensional Pareto front, the only way to improve anything is to spend more performance. That’s an investment of engineering time and risk. Perhaps it’s not something you can even do in-house. Many are the companies built on GPT-3 who realise they need to spend their performance budget in a different dimension to the one OpenAI prioritises, but don’t have the expertise to do so.
But how to gain performance?
Sometimes the best strategy is just to wait and let the underlying technology get better. This is what happened with CPUs between 1970 and 2005. It’s whats still happening with GPUs. It’s what happened with LLMs from GPT-2 to GPT-3 to GPT-3.5-turbo to GPT-4.
And in the meantime you work on making your UI better, understanding your customers, hiring, and raising capital.
But eventually, the curve hits a plateau. Or sometimes your timeline and use case means you can’t afford to wait. In those cases, you typically have three options. Optimise, scale vertically, or scale horizontally.