LLM Minions, DroidSpeak and Byte Latent Transformers

Yeah, Science!

Mar 07, 2025

Hello, my good readers!

Stanford guys continue the "Let's make LLMs friendly with each other" arc.

Overall, the concept is certainly excellent, and humanity will sooner or later reach some impressive solution here thanks to the synergy of accumulated research, as usually happens.

What raises questions for me is that they've again arrived at an approach where a "large, smart, expensive cloud model" generates pieces of code, which "local, small, cheaper, and less intelligent" Minions execute locally, for example on data from PDF documents, and then return results to the big model to generate an answer.

Well, it's generally cool, except for one big issue - How strongly this still reeks of terrible non-determinism.

First, from the user's perspective - what about security? By what means and in what environments will these models run locally? In Ollama? Okay. Do we trust the code generated by LLMs?

Not. I don't trust it.

Despite all the benefits in speeding up work and accelerating by mortal reasoning with various excellent tools like Claude - I don't believe that such "protocols" will be viable and *successfully* adopted even somewhat widely in the near future.

Now, if we develop a specific product with locally running models that clearly "understand" their responsibilities and capabilities (tool-calls?), process the user's request (potentially understanding it better than the user themselves), and then when necessary, reach out to an LLM, perhaps providing the results of their work and the original request to predict the final answer - well... that sounds interesting! Very much so!

Second, from the developer's perspective of such a system - how do you even debug this? Is the complexity of such a non-deterministic system justified? I won't elaborate here as it seems obvious.

The deeper you dig into this, the more purely practical problems arise which, in my opinion, are better solved through familiar and proven methods.

Third, attempting to communicate even with small pieces of code, I mean not tool-calls, but as in the work above - this is already some kind of minimal engineering. And LLMs are not ready for engineering.

Understand me correctly, even Cursor with its latest update, when it tries to "analyze" previous results in a loop - analyzes complete nonsense. It might suggest you delete migrations, simply mindlessly - Let's regenerate it! Or blatantly hallucinate methods and classes from imported packages that don't exist. It now can suggest the user execute a command in the project directory (in the terminal) and continue working with the response from this command.

The result is the same.

Just don't. I see your objections coming. Some too many people go crazy praising the qualities of LLM assistant code.
And there's a small number of computer science folks and folks who have been coding for 20+ years already. These people admit all the true benefits of coding with LLMs, while mercilessly criticizing its possibilities for software design.
I'm in this camp, even though I'm not a degreed scientist, and not yet an old folk (Am I?)

Debugging with Cursors goes similarly - often with simple stuff, type inferences and other things Cursor can handle well, but as soon as we encounter debugging a problem of even slightly more complex order, where you need to reason about 2-3 dependencies - it hits the glass ceiling of its "cognitive capabilities."

And yet - these are all very cool things! Magical! I am fully in love with machine learning and neural models of all sizes. They greatly help and speed up work! And this is indeed an applicable informational technology, computer science discipline of a New World. No Doubt.

However, at the current time, LLMs cannot analyze or think **logically** to any significant degree.

All the "thinking" we currently observe in modern LLMs is a desperate attempt, a respectable attempt, to somehow make LLMs generate "intermediate tokens."

The first attempts (well, basically those we're observing now) - of intermediate reasoning in natural, human language - shatter to pieces.

This is now known as the tokenization problem - For LLMs, reasoning in our languages is limited by the discreteness of the tokens they form. What results is that the quantization of the meaning space of the problem being solved happens in these very tokens - wherever the temperature coincides, that's where we fall.

In real, human analysis, logic and reasoning don't work like that.

In this light, these two works seem more interesting to me:

DroidSpeak:

Large Language Models (LLMs) are increasingly employed in complex workflows, where different LLMs and fine-tuned variants collaboratively address complex tasks. However, these systems face significant inefficiencies due to redundant context processing of the shared context. We propose DroidSpeak, a framework that optimizes context sharing between fine-tuned LLMs derived from the same foundational model. DroidSpeak identifies critical layers in the KV cache and selectively recomputes them, enabling effective reuse of intermediate data while maintaining high accuracy. Our approach balances computational efficiency and task fidelity, significantly reducing inference latency and throughput bottlenecks. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 3x higher throughputs and 2.6x faster prefill times with negligible accuracy loss compared to full recomputation.

And Byte Latent Transformers:

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented dynamically based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters with 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed-vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

In short, DroidSpeak - making it so models communicate with each other without tokenization; Byte Latent Transformers - is also about abandoning tokens in favor of byte representations.

Both approaches create a more "adequate" space for model reasoning, a less discontinuous space - got rid of tokens and "think" in continuous vectors!

In addition to this, continuing to train models on mainstream codebases with CRUDs and so on, as the practice has shown - doesn't increase the quality of either generated code or "thinking" at all (meaning from the point where we are already.)

On the other hand - training models in logical programming using a language like Prolog looks very, very promising.

I want to believe that these directions will be completed and won't die out. And this belief comes without much effort.

/var/log/ivan.zakutnii

Discussion about this post