- new
- past
- show
- ask
- show
- jobs
- submit
I know Yann LeCun is trying to do a completely different architecture and I think that's expected to take 2-3 years before showing commercial results, right? Is that why they're finding it quicker to change the hardware?
People, Researcher, Investor etc. probably also want to see what would be possible and someone has to do it.
I can also imagine, that an inferencing optimized system like this could split the context for different requests if it doesn't need to use the full context.
Could also be that they have internal use cases which require this amount of context.
Yann LeCunn has been very wrong in the past about LLMs.[0] The approach he wants to take is to train using sensor data in the physical world. I think it's going to fail because there's near infinite amount of physical data down to Schrodinger's equation on how particles behave. There's too much signal to noise. My guess is that they'll need magnitudes more compute to even get something useful but they do not have more compute than OpenAI and Anthropic. In other words, I think LLMs will generate revenue as a stepping stone for OpenAI and Anthropic such that they will be the ones who will ultimately train the AI that LeCunn dreams of.
[0]https://old.reddit.com/r/LovingAI/comments/1qvgc98/yann_lecu...
The most high-profile example is the latest set of Qwen models, which replace most of the attention mechanisms with Gated DeltaNet (which uses constant memory with respect to sequenc length).
Test-time training architectures are also getting a lot of attention, and have shown great performance in the acedemic setting. It's only a matter of time before we start getting open TTT models.
Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.
However, now that RL environments and long-horizon agentic performance have taken such a prominent role in model development, I wonder if that practice still holds. I know that the most recent Gemma and Qwen models are incomparably more reliable at long contexts than their predecessors, even though, e.g. Qwen already had a 256k context. It just didn’t work like it does now.
When you train, you teach the model to, among other things ‘self attend’ to the input vector, ultimately projecting that vector into a large embedding space.
Thought experiment —- if 99% of the time the last 100,000 digits of your vector was zero, how likely is it that you’d have high quality embedding trained by doing gradient descent on those outputs?
That’s what the paper is referring to.
I noticed that the longer a chat gets, the more unpredictable the models behavior becomes (and I think that's still a common jailbreak technique too).
(I think it might also have something to do with RoPE, but that's beyond me.)
If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128
or lets say it differently: The LLM gets trained on static data but also on the capability of handling context in itself.
Kimi introduced this https://github.com/MoonshotAI/Attention-Residuals but i'm pretty sure closed labs like Google had something like this for a while.
Shockingly, we seem to have found a self attention mechanism of that quality, it just has the sad property of growing at O(N^2) where N is the context length.
Nvidia uses ML for finetuning and architecturing their chips. this might be one use case.
Another one would be to put EVERYTHING from your company into this context window. It would be easier to create 'THE' model for every company or person. It might also be saver than having a model train with your data because you don't have a model with all your data, only memory.
But maybe that’s enough tokens to feed an entire lifetime of user behaviour in for the digital twin dystopia?
Current approaches require fancy tricks to fit tokens into memory, and spread attention thinner over larger numbers of tokens. The new approach tries to find a way to keep everything in a single shared memory, and process the tokens in parallel using multiple GPUs
For 2 or 3 newspapers it works; my idea was to use it as grounding to discover relationships between people, companies and jobs.
As for the "everyone's life", I have always assumed that there would be a graph system to point to "forgotten" documents.
Gemini said my idea was amazing and new in its implementation, even if not in spirit, but I'm assuming it was being sycophantic as usual.
My sense is that this is sort of accurate, but more likely it's a result of two things:
1. LLMs are still next-token predictors, and they are trained on texts of humans, which mostly collaborate. Staying on topic is more likely than diverging into a new idea.
2. LLMs are trained via RLHF which involves human feedback. Humans probably do prefer agreeable LLMs, which causes reinforcement at this stage.
So yes, kinda. But I'm not sure it's as clear-cut as "the researchers found humans prefer agreeableness and programmed it in."
* With Claude's 1-million context window I have been doing some slightly longer range tasks — ~1-3 days of work — with RPI/QRSPI frameworks(see last few days of comments else where on HN) in one context window. They involve a grill-me session with 20-60 sometimes more questions for tasks to get alignment which produces the design and the plan in one window.
My experience with this has been that it front-loads a lot of the LLM interactions, which can be exhausting without a reward (i.e. output.) And then, when I get the output, it's so large as to be hard to review/grok.
In other words, it feels a bit like when my coworker delivers me a month's worth of work in a single PR.
We don't, no. But wouldn't it be great if we did? I'd sure love to be able to hold the entirety of the code of my organisations monolith in my head at once. It would make everything so much easier. It would definitely also cut down on the bugs I write!
Similar if I could recall all of my organisations confluence pages. Id probably be a lot better at my job. Same with all the slack history. All the hr documents, press releases, meeting transcripts. Theres practically no end to useful context even just in text form, and even if much of it is not relevant to any one task, having all of it in working memory would be fantastic, if only it were possible. I could probably make incredible cross organisational efficiencies and probably be far wealthier if I were some savant that could hold all of this in my head at once.
I get that we have agent harnesses to try and fetch only the relevant information. But most of the problems result in either failures in this process, or previous things falling out of context. I very rarely see failures where the agent forgets stuff already in context. The harnesses are making up for this exact limitation!
That sounds like the beginning of a sci-fi story where the conclusion is forgetting is not such a bad thing.
It seems far more likely that it would all get baked-in to the LLM during training, but maybe it will turn out to be really useful to train up a "generic robot controller LLM" and pass in a huge number of tokens to better optimize it.
I do not think it is the direction for everything.
Generally, we need consolidation of experiences and memories to just remember the important conclusions, ideas, and concepts, and then the ability to remember the full details if they are relevant (which they usually are not.)
But for some applications I am sure a billion token context would be useful.
It is likely most people need a 10 core CPU or whatever for most tasks, but for some applications you want a supercomputer with 1M cores.
So we need a taxonomy, we need memory layers, we need summary/details. If there is one thing I have learned about how these LLMs work, if you give them a few flexible tools they can work the shit out of them to achieve objectives. We just need to right tools and right structure for context.
We simply don’t know how to incorporate new information without losing old capabilities reliably. Pans handle this through extensive evaluation, heuristics, and experience.
What we do know is that models can adapt to their context, and extending the context window is an infrastructure and capex problem first. A billion useful tokens would obviate the need for any out of band memory structures.