- new
- past
- show
- ask
- show
- jobs
- submit
Side note: whenever I read prompts for image generation, I notice very specific details which the model obviously ignored. Here the chocolates / candies in the last two images look anything but artisanal. They look very "sterile" and mass-produced. The viewing angle is also not accurate.
Why do we even bother writing such elaborate prompts, when the model ignores most of it anyway?
Rustic, homemade, amateur, etc might align better with the tagging.
Original Prompt: "Man with Trapezoid Head"
AI Expansion:
Portrait of a man with a trapezoid-shaped head, sharp geometric facial structure, angular jawline wider at the top and narrowing toward the chin, realistic skin texture, detailed pores, dramatic studio lighting, ultra-detailed, 85mm lens, shallow depth of field, dark neutral background, cinematic, photorealistic, 8k resolution.
Note: Most people (outside the generative space) won’t pick up on this but in many cases if don't prompt otherwise, you’ll often end up with a prompt that’s better suited to older, keyword‑based models like Stable Diffusion which rely heavily on specific sets of positive and negative prompt keywords more akin to magical incantations to improve the output.Because if I wanted a spiral of little "buttons" like the last one at the end (and they don't look very much like sweets) I'd be able to knock that out in Blender in an afternoon, and I'm not very good at Blender.
I think part of the problem is that pretty much all the tutorial material for Blender seems to be in video form, which is easily my least effective way to learn, even leaving aside the "I've only got one screen" issue.
If you were good at GLSL you could do it in that maybe.
Someone somewhere is going to write something that directly draws it to a framebuffer in Brainfuck, you just know it, don't you?
1. Prompt to make SVG - review in browser, iterate.
2. Prompt to write image prompt - review in editor, refine
3. Send to Gemini, get image
So maybe 5-10 mins.
I don’t know how to use Blender.
Also this method can be done over WhatsApp/telegram which is another plus over Blender type approach.
Also, there might be other new 3D software with better UX. I am not a Blender fanboy, but I do love 3D art and graphics programming and want as many people as possible to get into it :^)
I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful
Example: In the past I'd use a similar approach to lay out architectural visualizations. If you wanted a couch, chair, or other furniture in a very specific location, you could use a tool like Poser to build a simple scene as an approximation of where you wanted the major "set pieces". From there, you could generate a depth map and feed that into the generative model, at the time SDXL, to guide where objects should be placed.
It’s a useful trick to have in one’s toolbox, and I’m grateful to the author for sharing it.
That said, I spent plenty of time doing both, and yet it would probably take me a while to arrive at this approach. For some reason, the "draw a sketch, have a model flesh it out" approach got bucketed with Stable Diffusion in my mind, and multimodal LLMs with "take detailed content, make targeted edits to it". So I'm glad the OP posted it.
Another thing I’ve gotten very used to doing is avoiding the “one-shot” approach. If I generate something and don’t like the results, I bring it into Krita, move things around, redraw some elements, and then send it back in with instructions to just clean it up (remove any smudges or imperfections). The state-of-the-art models can do an astonishing job with that workflow.
I believe Nano Banana and gpt-image-2 have a little of this going on, but it's like asking a model to one-shot some code vs having an agentic harness with tools do it. Even the most basic agent can produce better code than ChatGPT can.
That's a scary thought.
Hey Claude, why haven't you finished yet? ... Because the human I'm holding hostage hasn't finished the drawing yet.
Models don't have intelligence, even less so creative thinking.
But it's not universally true, particularly among artists working in the last 100 years or so. Certainly Jackson Pollock (whether one regards his work as good or not) didn't sketch out how he was going to distribute paint onto canvas. Another example is Morris Luis (and other "stain painters") who didn't sketch out how he applied paint to canvas.
You're comment is largely correct, just pointing out that more than a few "decent artists" didn't (or don't) work that way.
So LLM/GenAI crave. An entire article to show that it's nearly there, yet it's not, despite convoluted effort to make it just so on a very very niche example.
But I'm forseeing the opposite. This kind of tool use will soon be integrated and hidden such that people will eventully say "see we solved the problem that AI can't do 123+456, now we are really really close to AGI. Yeah no, with an AGI, it would have been the AGI itself that would have come up with needing at tool, building the tool and then using the tool. But that's not what LLMs are. They are statistical machines to predict tokens. They are very good at it, but that's not an AGI.
1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)
2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.
3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.
4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.
But by using the LLM to generate code like an SVG graphic is made up of, and then using a rasterized image of that SVG as an input to the diffusion model, this takes place of the raw noise input and guides the denoising process of the diffusion model to put the numerical parts in the right spots.
The LLM is putting the SVG in the right order because the code that drives the SVG is just that - code - and the numerical order is easily defined there, even if it has to follow something like a spiral.
Edit: although LLMs now also may be using thinking modes with their feedback during generation to help with complex positioning when drawing something like an SVG, as I just asked claude to generate me one such spiral number SVG and it did so interactively via thinking, and the code generated is incredibly explicit with positions, so, that must help. But the underlaying idea to two-step SVG-to-diffusion model is the real key here.
It seems to be a very effective pattern. Curious if there are other examples out there. Or other names for this?
Still emotionally unrelatable, but definitely was providing something that match the specifications of there are explicit and systematically enforced through deterministitic means. For now I retain that LLM limitations are thus that they can't seize the ineffable and so untrustworthy they can only be employed under very clear and inescapable constraints or they will go awry just as sure as water is wet.
It should be fairly trivial to fix any logic errors in the structured output, too.
LLMs are evolving so fast I wouldn’t be surprised if this technique was not needed in <6 months
At the end of the day we can get so much done just by breaking down a problem into smaller problems.
But won't it be fun when we can cloud burst-parallel a grid/tiled sampling of multiple code implementations/architectures, and interactively explore navigate/blend-points-in the latent design space. Multiples embodying different trade-offs, styles, clarity vs performance, etc. Code as generative art. What might the software engineering equivalent of designer moodboards be?
There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.
What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.
Not so long ago, this was how early adopters of LLM coding assistants claimed was the right way to use them in coding tasks: prompt to draft the outline, and then prompt to implement each function. There were even a few posts in HN on blogposts showing off this approach with terms inspired in animation work.
You can go from the highest level and all the way down to the lowest level with LLMs, you just have to work at it iteratively one level at a time.
People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.
The models now whaste a vast amount of useless neurons memorising the character count the entire English language so that people can ask how many r's are in strawberry and check a tickbox in a benchmark.
The architecture cannot efficiently or consistently represent counting letters in words. We should never have forced trained them to do it.
This goes for other more important "skills" that are unsuited to tranformer models.
Most models can now do decent arithmetics. But if you knew how it has encoded that ability in its neurons then you would never ever ever ever trust any arithmetic it ever outputs, even in seems to "know" it (unless it called a calculator MCP to achieve it).
There are fundamental limitations, but we're currently brute forcing ourselves through problems we could trivially solve with a different tool.
No they don’t. They only need to know the character count for each token, and with typical vocabularies having around 250k entries, that’s an insignificant number for all but the tiniest LLMs.
Those "tolkens" humans "count" are translated to a ~2048 (depends on model) floating point vector.
bird => {mamal, english, noun, Vertebrate, aviant} has one r but what if you make it 20% more "french". Is is still 1 r? That could be the word "bird" in french, or it could be a french speaking bird or a bird species common in france.
If nearest neibour distance to the vocabulary of every language makes the vector no longer map to "bird"; then the amount of rs' must change, using a series of trained conditional checks (with some efficiency where languages have some general spelling patterns).
That is such an unreasonable amount of compute, that it is likley faar cheaper, easier and more reliable to train the model to memorise the output:
{"MCP":"python", "content":"len((c for c in 'strawberry' if c='r'))"}
The attention mechanism allow LLMs to learn this kind of absurdly inefficient calculations. But we really shouldn't use LLMs where they're outperformed by trivial existing solutions.
Some limitations are not rigorously demonstrated to be fundamental, but continuously present from the first early LLMs yes. Shouldn't the burden of proof be on those who say it can be done?
And some limitations are fundamental, and have been rigorously demonstrated, e.g.:
https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com
What they really prove is that it's impossible to extrapolate unconstrained non-continuous function from a finite subset of its values. Good for them, I guess.
It's like saying that the no free lunch theorems proves that LLMs can't be the best optimizers, while it proves (roughly) that the best optimizers don't exists. That is, even people aren't the best optimizers, but we manage somehow, so LLMs can too.
"Specifically, we define a formal world where bungling is defined as inconsistencies between a computable LLM and a computable ground truth function. By employing results from learning theory, we show that LLMs cannot learn all the computable functions and will therefore inevitably bungle if used as general problem solvers."
Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.
People have an <opinion> which hasn't been rigorously proven, while <not rigorously proven counteropinion>.
As such, I am not sure what you're trying to achieve here.
You can try this out locally with any mid-sized current-gen LLM. You’ll find that it can spell out most atomic tokens from its input just fine. It simply learned to do so.
We have improved hallucinations significantly, and yet it seems clear that they are inherent to the technology and so will always exist to some extent.
There are also limitations due to maths and/or physics that aren't fixable under any design. Outside science fiction, there is no technology whose limitations are all fixable.
Here's one: https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com