In the 20th century, energy efficiency was one of the most important factors determining whether a factory would make a profit: a higher-than-expected electricity bill could lead to a serious financial crisis, halting all production. In the 2010s, this was replaced by “cloud costs”. A very high AWS bill, for example, could be the death knell for a startup, a problem that even affected sites like the Casino Fortunica.
Now, these have been replaced by “AI model costs”, and no, this isn’t related to the “training costs” of the model used. The problem with AI models is recurrence: many minor design issues can lead to extremely large operating expenses. Let’s take a closer look at this and see how it affects startups.
Cloud Costs vs Model Costs
In the cloud model, costs were simply tied to the number of users. A startup offering services via the cloud would have to migrate to a larger infrastructure as the number of users increased, which would increase bills. An unexpected decrease in the number of users or a sudden increase in service fees could become a serious profitability problem.
The problem with the AI model, however, isn’t the number of users. Even a single user can strain the infrastructure by as many as 100 users, because this model differs from others in the following ways:
- The load on the infrastructure is determined by user actions, not the number of users.
- Even a single user can make hundreds of calls in a very short time, especially if they’re chatting with the AI.
- Unfortunately, this is unpredictable, as each user’s interaction is different from the next.
- For the same reason, the actual cost is determined by the content, UX design, and token length.
| WHAT IS TOKEN LENGTH IN AI? | Every question you ask an AI or every text you write consists of a certain number of characters. Roughly, every 4 characters correspond to 1 token. A paragraph of 75-80 words is approximately 100 tokens. The more tokens, the more processing power the AI uses to understand, process, and answer that query. |
Even a single sloppy prompt can inflate the bill unexpectedly. In the traditional cloud model, startups don’t need to worry about service costs until the number of users reaches a certain threshold, and by then, they’ve already reached a certain level of profitability. In the AI model, the situation is the opposite: from day one, your infrastructure can be strained beyond expectations. Even a startup with just 200 users may have to burn through thousands of dollars every month before it can even become profitable.
The Problem Is Not the Training Costs
Training an AI model can be expensive, but that’s not the problem. An AI startup will use an already trained model. Even if it wants to train its own model, it does so before launching its services, so these costs are already part of the planned budget—they’re not unexpected.
The problem begins when users start interacting with AI. Even every click a user makes carries a cost. If it’s not properly set up or configured, these costs will increase even further. For example, if a “premium” AI model is used even for trivial tasks, the infrastructure load increases further.
Unfortunately, this isn’t a problem with a simple solution. The classic engineering dilemma is simple: fast, cheap, and good—you can only choose two; you can’t have them all. In AI, this dilemma becomes the following: model quality, low latency, low cost—you can only choose two.
| WHAT IS LATENCY IN AI? | This is simply the response time to user queries. It determines how quickly you’ll get a response when you ask AI something. |
So, if you want a model as high quality as ChatGPT and a response time of less than one second, you’ll have to bear the cost. You could also opt for low costs and low latency, but then the model quality would be subpar and wouldn’t deliver high-quality results.
There may be no simple solution to this problem, but there is perhaps a way around it: innovative and creative startups can do this with “model routing.”
Load Balancing via Model Routing
In the early 2000s, web developers invented many tricks to ensure servers could handle incoming requests. These were used for “load balancing,” meaning they used clever tactics to reduce user workload and keep the server running. A similar balancing technique can be applied to AI: this is called “model routing.” We can explain what model routing is with examples:
- Using a small and inexpensive model to identify keywords.
- Switching to a mid-tier model for classification.
- Using a premium model for complex reasoning.
- Leveraging cached results for repetitive tasks.
The biggest problem AI startups face is using the same model for every type of query. This means both very simple and very complex tasks are often solved using the same model, creating a never-ending workload and increasing costs.
In a building with 100 employees, you don’t need to heat the entire building just because one person is cold—this would be prohibitively expensive. Simply placing a heater next to that employee would solve the problem cost-effectively. AI model routing is based on this very principle.
Tokens Are the New Kilowatts
We mentioned above that one of the biggest problems for businesses in the 20th century was electricity bills. Tokens, in this sense, constitute the “kilowatt-hours” of the AI economy. Each prompt creates a varying number of tokens, and electricity is used to process them. This means that the fewer tokens, the lower the electricity costs.
Unmanaged, token costs can reach brutal levels. Therefore, token optimization could be another solution to the problems faced by the AI model. This solution could even be so effective that it could create a new field of work in its own right: in the near future, a profession called “cost engineering” could emerge, focusing solely on token optimization.
This profession is different from “prompt engineering”. For the past few years, almost every AI startup has been hiring prompt engineers, and their primary goal is to make AI “smarter.” However, as mentioned above, smarter means higher costs. A cost engineer, on the other hand, would focus on keeping the system smart without additional costs: this could become one of the most in-demand professions in the near future.
Simple UX, Less Load
Another solution is to make better UX choices. UX design can be so complex that it drives up utility bills. This is because every user interaction has a cost within the AI system. So, if a user can perform a certain action in 5 clicks instead of 1, the cost to the system is 5x higher. Most AI startups don’t pay much attention to UX design and simply want it to look “as impressive as possible.” However, a UX that is “as simple as possible” can significantly reduce costs.
In a market where model costs are paramount to success, the winners won’t be the startups with the flashiest interfaces or the highest-quality AI models. On the contrary, startups that think about model routing, use token optimization techniques, and design their UX designs to create as little load on the system as possible will succeed in surviving.


