Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Using “underdrawings” for accurate text and numbers (samcollins.blog)

376 points by samcollins 8 days ago | 131 comments

danpalmer 6 days ago [-]

I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).

There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.

What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.

locknitpicker 6 days ago [-]

> There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions.

Not so long ago, this was how early adopters of LLM coding assistants claimed was the right way to use them in coding tasks: prompt to draft the outline, and then prompt to implement each function. There were even a few posts in HN on blogposts showing off this approach with terms inspired in animation work.

Sammi 6 days ago [-]

In short, LLMs are pretty great at working at a single level of abstraction at a time.

You can go from the highest level and all the way down to the lowest level with LLMs, you just have to work at it iteratively one level at a time.

danpalmer 6 days ago [-]

I'm not necessarily suggesting always getting down to literally the function level, although I think that gives you excellent quality control, but having a code-level understanding is clearly an important factor.

nullsanity 6 days ago [-]

[dead]

p-e-w 6 days ago [-]

> due to fundamental limitations

People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.

aDyslecticCrow 5 days ago [-]

> character counting

The models now whaste a vast amount of useless neurons memorising the character count the entire English language so that people can ask how many r's are in strawberry and check a tickbox in a benchmark.

The architecture cannot efficiently or consistently represent counting letters in words. We should never have forced trained them to do it.

This goes for other more important "skills" that are unsuited to tranformer models.

Most models can now do decent arithmetics. But if you knew how it has encoded that ability in its neurons then you would never ever ever ever trust any arithmetic it ever outputs, even in seems to "know" it (unless it called a calculator MCP to achieve it).

There are fundamental limitations, but we're currently brute forcing ourselves through problems we could trivially solve with a different tool.

p-e-w 5 days ago [-]

> The models now whaste a vast amount of useless neurons memorising the character count the entire English language

No they don’t. They only need to know the character count for each token, and with typical vocabularies having around 250k entries, that’s an insignificant number for all but the tiniest LLMs.

aDyslecticCrow 5 days ago [-]

In a very simplified view;

Those "tolkens" humans "count" are translated to a ~2048 (depends on model) floating point vector.

bird => {mamal, english, noun, Vertebrate, aviant} has one r but what if you make it 20% more "french". Is is still 1 r? That could be the word "bird" in french, or it could be a french speaking bird or a bird species common in france.

If nearest neibour distance to the vocabulary of every language makes the vector no longer map to "bird"; then the amount of rs' must change, using a series of trained conditional checks (with some efficiency where languages have some general spelling patterns).

That is such an unreasonable amount of compute, that it is likley faar cheaper, easier and more reliable to train the model to memorise the output:

{"MCP":"python", "content":"len((c for c in 'strawberry' if c='r'))"}

The attention mechanism allow LLMs to learn this kind of absurdly inefficient calculations. But we really shouldn't use LLMs where they're outperformed by trivial existing solutions.

topham 5 days ago [-]

Nope. Tokens aren't what you think they are.

coldtea 6 days ago [-]

>People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist

Some limitations are not rigorously demonstrated to be fundamental, but continuously present from the first early LLMs yes. Shouldn't the burden of proof be on those who say it can be done?

And some limitations are fundamental, and have been rigorously demonstrated, e.g.:

https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com

p-e-w 6 days ago [-]

That paper’s abstract doesn’t carry its title, to put it mildly.

coldtea 5 days ago [-]

What part of "Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate if used as general problem solvers. " doesn't carry the title, to ask mildly?

red75prime 5 days ago [-]

As with all the works that use too broad a definition of an LLM they prove too much. This work defines an "LLM" as a computable function obtained by applying a finite number of steps of a generic algorithm to an initial computable function.

What they really prove is that it's impossible to extrapolate unconstrained non-continuous function from a finite subset of its values. Good for them, I guess.

It's like saying that the no free lunch theorems proves that LLMs can't be the best optimizers, while it proves (roughly) that the best optimizers don't exists. That is, even people aren't the best optimizers, but we manage somehow, so LLMs can too.

p-e-w 5 days ago [-]

I don’t agree with that definition of “hallucination”, for starters.

MarkusQ 5 days ago [-]

So substitute another phrase, if you prefer. It doesn't change the logic.

"Specifically, we define a formal world where bungling is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably bungle if used as general problem solvers."

red75prime 5 days ago [-]

Their diagonalization argument applies to any system that uses finite training data. Calling such a system "LLM" is an (unintentional) red herring.

MarkusQ 5 days ago [-]

Yeah. IMHO this is the more serious objection.

gus_massa 5 days ago [-]

[dead]

dijit 6 days ago [-]

Character counting remains a huge issue without tools.

Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.

girvo 5 days ago [-]

The literal best public models still fail to count characters consistently in practice so I’m not sure what you mean. It’s literally a problem we’re still trying to solve at work

outofpaper 5 days ago [-]

What's amazing is that they even can fairly reliably appear to count characters. I mean we're talking about systems that infer sequences not character counters or calculators. They are amazing in unrelated ways and we need to accept this so we can use them effectively.

jameshart 5 days ago [-]

I suspect character counting - counting small numbers in general in fact - is something that multimodal models will gradually learn through their visual capabilities. We have generative systems that are capable of generating an image of the word ‘strawberry’, and of counting how many strawberries are visible in an image; seems likely it’s possible for an LLM to ‘imagine’ what the word strawberry looks like and count the ‘Rs’ it can ‘see’.

girvo 5 days ago [-]

Of course, they’re shockingly powerful, just in an incredibly “spiky” way

3form 5 days ago [-]

Your comment, after removing the particulars, has a shape of:

People have an <opinion> which hasn't been rigorously proven, while <not rigorously proven counteropinion>.

As such, I am not sure what you're trying to achieve here.

3form 6 days ago [-]

Is character counting actually not an issue anymore? Do you know somewhere where I can read more about this?

Marazan 6 days ago [-]

If you remove the auxiliary tools and just leave the core LLM then strawberry still has an undefined number of `r`s in it.

p-e-w 6 days ago [-]

That’s false. Larger LLMs learn token decompositions through their training, and in fact modern training pipelines are designed to occasionally produce uncommon tokenizations (including splitting words into individual characters) for this reason. Frontier models have no trouble spelling words even without tools. Even many mid-sized models can do that.

kilpikaarna 6 days ago [-]

Wait, where can I learn more about this? I don't doubt that varying the tokenization during training improves results, but how does/would that enable token introspection?

p-e-w 5 days ago [-]

Because LLMs can learn that different token sequences represent the same character sequence from training context. Just like they learn much more complex patterns from context.

You can try this out locally with any mid-sized current-gen LLM. You’ll find that it can spell out most atomic tokens from its input just fine. It simply learned to do so.

mrob 5 days ago [-]

Character counting errors are a side effect of tokenization, which is a performance optimization. If we scaled the hardware big enough we could train on raw bytes and avoid it.

teiferer 5 days ago [-]

No, tokenization is not the only reason. A next-word predictor has fundamentally a hard time executing algorithms, even as simple as counting.

mitthrowaway2 5 days ago [-]

Counting is one of the algorithms that can be expressed by a RASP program, which transformers closely approximate.

MarkusQ 5 days ago [-]

Close famously counts in horseshoes and hand grenades. Algorithms, just as famously, are a domain where off-by-one is still wrong.

danpalmer 6 days ago [-]

This is kind of my point, we need to get better at describing the limitations and study them. It seems extremely clear that there are limitations, and not just temporary ones, but structural limitations that existed at the beginning and continue to persist.

ijidak 5 days ago [-]

Yeah I think it was the word "fundamental" he took issue with.

rimliu 6 days ago [-]

of course, if you choose to ignore all the limitations they indeed have no limitations.

mkbosmans 6 days ago [-]

Nobody says they have no limitations. The question is are those limitation fundamental, i.e. can we expect improvement, say within a year.

danpalmer 6 days ago [-]

When I talk about fundamental limitations, I mean limitations that can't be solved, even if they could be improved.

We have improved hallucinations significantly, and yet it seems clear that they are inherent to the technology and so will always exist to some extent.

p-e-w 6 days ago [-]

“Seems clear” based on what?

pegasus 6 days ago [-]

For one, based on continuously frustrated hopes (and promises!) that hallucinations will go away.

coldtea 6 days ago [-]

As a general architecture, an LLM also has limitations that can't be improved unless we switch to another, fundamentally different AI design that's non LLM based.

There are also limitations due to maths and/or physics that aren't fixable under any design. Outside science fiction, there is no technology whose limitations are all fixable.

Here's one: https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com

ToValueFunfetti 5 days ago [-]

Am I misreading that paper? They define hallucinations as anything other than the correct answer and prove that there are infinitely many questions an LLM can't answer correctly, but that's true of any architecture- there are infinitely many problems a team of geniuses with supercomputers can't answer. If an LLM can be made to reliably say "I don't know" when it doesn't, hallucinations are solved- they contend that this doesn't matter because you can keep drawing from your pile of infinite unanswerable questions and the LLM will either never answer or will make something up. Seems like a technically true result that isn't usefully true.

raincole 5 days ago [-]

Drawing five fingered humans was a fundamental limitation... until it's not.

Rendered at 21:21:54 GMT+0000 (Coordinated Universal Time) with Vercel.

Original Prompt: "Man with Trapezoid Head" AI Expansion: Portrait of a man with a trapezoid-shaped head, sharp geometric facial structure, angular jawline wider at the top and narrowing toward the chin, realistic skin texture, detailed pores, dramatic studio lighting, ultra-detailed, 85mm lens, shallow depth of field, dark neutral background, cinematic, photorealistic, 8k resolution.