Rendered at 11:36:51 GMT+0000 (Coordinated Universal Time) with Vercel.
languid-photic 23 hours ago [-]
We track performance vs. the all-in cost of completing real engineering tasks, rather than cost per token. [1]
Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
I feel that the recent iterations of LLM haven't provided an intuitive qualitative leap. Have they entered a bottleneck period so quickly?
olao99 1 days ago [-]
For what is worth I find GPT 5.5 qualitatively different than 5.4 and 5.3
If I had to collapse the nature of the difference in one sentence it'd be that the 5.5 does more what I'm asking it to do versus doing a small aspect of what I'm asking then stopping.
5.4 required a lot of "continue" encouragement. 5.5 just "gets it" a bit more
What is boils down to for me is that even though it's more expensive I would much rather use 5.5 on low then 5.4/5.3 on high/medium
2ndorderthought 24 hours ago [-]
I am delighted to see the ceiling on small models exponentially increase. I think the "make models unsustainably large because the benchmark improved by 1%" practice is ending. I think the thing boosting small models will be the thing that makes LLMs actually useful. The main thing is research.
patates 24 hours ago [-]
Considering my use case (web apps), there already wasn't anything I couldn't do with Opus 4.5, the same will be true or were already true for more people in other releases, and at some point, which may have already passed, most people will stop finding qualitative leaps.
This doesn't always mean that there is a bottleneck in terms of raw power, it may also mean that your use cases (or the lower hanging fruits among them) are already covered.
aurareturn 23 hours ago [-]
They likely entered the same compute constraint scenario as Anthropic.
IE. They had 100 compute units. Demand is 200 units. They have to do a combination of buying more compute, increasing price, lowering limits, etc.
eiekek11 23 hours ago [-]
Bunch of nonsense.
If that is true then they should all invest resources into projects that will yield efficient use of the compute. The most efficient producer then gains a huge cost advantage AND capacity to serve more… so yeah.. that logic doesn’t hold.
cyanydeez 23 hours ago [-]
capitalism convinced you that line goes up unless you dont let it eat all the resources.
helloplanets 24 hours ago [-]
Are you running gpt-5.5 on xhigh reasoning? Because I'm seeing a clear difference between that and gpt-5.4 on xhigh.
gchamonlive 1 days ago [-]
My take is that demand is also increasing, so maybe they are making incremental improvements to model quality while focusing on improving inference costs. Prices are increasing though because even if they achieve a very efficient model, they are still selling at a loss.
cyanydeez 23 hours ago [-]
its a sigmoid, not a bottleneck.
SecretDreams 24 hours ago [-]
> Have they entered a bottleneck period so quickly?
So quickly - this industry has had trillions thrown around to get here so quickly, heh.
But, yes, capability seems somewhat stagnant. It's about ISO perf and cost improvements or iso cost and perf improvements + agentic.
This doesn't seem to be controlling for the number of turns in any way. Am I missing something?
Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
jfim 1 days ago [-]
They also don't mention what their sample size is, or anything about the distribution of input and response lengths.
It'd be interesting to see the distributions if the author actually plotted the data, so we could see if their analysis holds water or not.
A plot of the input lengths using ggplot2 geom_density with color and fill by model, 0.1 alpha, and an appropriate bandwidth adjustment would allow us to see if the input data distribution looks similar across the two, and using the same for the output length distributions, faceted by the input length bins would give us an idea if those look the same too.
Edit: Or even a faceted plot using input bins of output length/input length.
sarjann 24 hours ago [-]
I think it should be tested on goals.
E.g. Crack this puzzle, fix this code so these tests pass. (A human can verify it doesn't cheese things).
gertlabs 19 hours ago [-]
We observed slightly smaller outputs over long horizon agentic coding for GPT 5.5, at a significant improvement in overall response scores. For one-shot coding responses, GPT 5.5 was actually more verbose than GPT 5.4, but again, the responses were significantly stronger. The expected cost increases reported by OpenRouter seem reasonably accurate (perhaps a bit optimistic), but in my opinion, highly worth it. GPT 5.5 has a pretty wide lead on the #2 model for understanding complex scenarios.
New model releases are now like new iPhones--mostly imperceivable improvements with a higher price tag. That's one of the major benefits to open source: you can "freeze" what model you're using. Often it's the model that you know that wins over the one that is different enough that you have to start from scratch with every major update. Most businesses require cost control and predictability over a cutting edge with limited evidence of profitable output outside of tech.
32dsfa 9 hours ago [-]
Really bad comparison.
If I skip 2 models of iPhone upgrades, there is definitely a difference between how the thing feels - and it feels its worth the money.
If I skip 2 models of upgrades of the frontier models now, I highly doubt I can discern what the difference is and what exactly I'd be paying more for.
DeathArrow 3 hours ago [-]
In terms of work done per dollar, new models from OpenAI and Anthropic are worse than the older models. They are trying to squeeze the customers.
For personal use I switched to coding plans containing GLM 5.1, Kimi K2.6 and Xiaomi MiMo V2.5 Pro and I never been happier. I said goodbye to both Claude Max and Cursor.
degutemesgen 19 hours ago [-]
I do think recent models are too expensive to be used for customer-facing agentic workflows.
coalhouse 1 days ago [-]
it does seem like a step change in token efficiency, though based on the earlier artificial analysis reporting it's also quite the cost lottery and i'm not sure i am comfortable with that
i_think_so 1 days ago [-]
Has any enterprising hacker here yet graphed price vs "output" over time since 2023, taking "quality" into account?
That's got to be a very tricky analysis given how subjective quality is. But I'm sure there are people trying to pin it down.
helloplanets 1 days ago [-]
Quality would be performance against different given benchmarks, I assume?
There's multiple open weight models you can run on a pretty standard computer at home, which match the quality of GPT 4. I guess that would also change the equation.
coalhouse 1 days ago [-]
anything that compares proprietary models will be very miscalibrated and may not be indicative, there have been too many model changes in both chat and the api where model providers did not even say the word before it got too noticable
Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
[1] https://voratiq.com/leaderboard?x=cost
If I had to collapse the nature of the difference in one sentence it'd be that the 5.5 does more what I'm asking it to do versus doing a small aspect of what I'm asking then stopping.
5.4 required a lot of "continue" encouragement. 5.5 just "gets it" a bit more
What is boils down to for me is that even though it's more expensive I would much rather use 5.5 on low then 5.4/5.3 on high/medium
This doesn't always mean that there is a bottleneck in terms of raw power, it may also mean that your use cases (or the lower hanging fruits among them) are already covered.
IE. They had 100 compute units. Demand is 200 units. They have to do a combination of buying more compute, increasing price, lowering limits, etc.
If that is true then they should all invest resources into projects that will yield efficient use of the compute. The most efficient producer then gains a huge cost advantage AND capacity to serve more… so yeah.. that logic doesn’t hold.
So quickly - this industry has had trillions thrown around to get here so quickly, heh.
But, yes, capability seems somewhat stagnant. It's about ISO perf and cost improvements or iso cost and perf improvements + agentic.
[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
It'd be interesting to see the distributions if the author actually plotted the data, so we could see if their analysis holds water or not.
A plot of the input lengths using ggplot2 geom_density with color and fill by model, 0.1 alpha, and an appropriate bandwidth adjustment would allow us to see if the input data distribution looks similar across the two, and using the same for the output length distributions, faceted by the input length bins would give us an idea if those look the same too.
Edit: Or even a faceted plot using input bins of output length/input length.
E.g. Crack this puzzle, fix this code so these tests pass. (A human can verify it doesn't cheese things).
Rankings at https://gertlabs.com/rankings?mode=agentic_coding. See the efficiency chart at the bottom.
If I skip 2 models of iPhone upgrades, there is definitely a difference between how the thing feels - and it feels its worth the money.
If I skip 2 models of upgrades of the frontier models now, I highly doubt I can discern what the difference is and what exactly I'd be paying more for.
For personal use I switched to coding plans containing GLM 5.1, Kimi K2.6 and Xiaomi MiMo V2.5 Pro and I never been happier. I said goodbye to both Claude Max and Cursor.
That's got to be a very tricky analysis given how subjective quality is. But I'm sure there are people trying to pin it down.
There's multiple open weight models you can run on a pretty standard computer at home, which match the quality of GPT 4. I guess that would also change the equation.