Large Language Models 5: Win Some, Loose Some
This is the third post in a series. You’re probably best starting here to get the background.
WHERE THE LANDSCAPE METAPHOR CAN BREAK DOWN
Scale
The 2000:1 scale of the landscape, versus the 1:1 scale of the training data is a bit of a hand wave. This is simply looking at the likely size of the training data (all human language the AI lab can get its hands on) compared to the size of the model weights.
Now the hand wave is that it’s not strictly speaking comparing apples to apples. The model weights are a relational representation of the training data, it’s not exactly “compressing” the data - it’s transforming it. Training stores statistical patterns and relationships in language, not the actual information.
I still think it’s close enough for this metaphor to work. The key benefit is to thinking about a language model as a solution space. The actual scale of that solution space relative to the original data is kind of irrelevant. It’s essentially an arbitrary sci-fi story to make it more interesting and suggest the kind of vast scale of information contained in a model.
Dimensionality
The landscape metaphor simplifies the model’s actual high-dimensional space into just three dimensions for visualization purposes. This is the key simplification that allows me to get my head around what’s going on. However, it is an abstraction. It’s worth remembering that the actual mathematical space is vastly more complex than our physical Alps model can represent.
When I say “billions” of dimensions in the training data, I’m referring to the relationship between all the words and all the other words. That’s of high billions to low trillions - the models themselves have a variety of dimension scales. Simple models might have sub 1000, and with something like OpenAI GPT4 I think has low 5 figure magnitude. These are kind of optimised for compute.
Physical effects
This seems a bit pedantic to note, but there’s no gravity, friction or other physical properties of terrain and movement in the model metaphor.
Maybe if you wanted to get into further extensions things like “friction” could represent the computational cost in the rover of considering certain paths, and “momentum” might explain why LLMs sometimes continue along established narrative directions even when small contradictions appear.
These extensions might help predict how LLMs behave when fine-tuned or when exploring less common parts of their training distribution. But… maybe not.
Rover Scale
The real Curiosity Rover is 3m. If our Alp Explorer rovers are the size of an ant in our model then they are at 1000:1 scale not 2,000:1 scale like the rest of the model. So 2x larger than if they were to the scale of the model.
However, since these not really scale models of Curiosity but a new invention - we can skip over this!
PREDICTIVE OUTCOMES OF THE METAPHOR
So there are some interesting predictions that arise from this metaphor.
Things to look out for:
Confabulation clusters
Areas where the compressed terrain is particularly distorted would create clusters of related hallucinations - not just random errors but systematically related misunderstandings.
Contour following behavior
The rovers would tend to follow existing paths/contours in the terrain, predicting that LLMs would have difficulty generating truly novel ideas that don’t follow established patterns in their training data.
This would explain why temperature is extremely important (this is my suggestion, I’ve not heard anyone say this so it might be nonsense).
Temperature directly controls the likelihood of rovers deviating from well-established paths - at low temperatures, they always follow the most prominent trails (producing predictable, sometimes repetitive outputs - often junk), while higher temperatures allow them to occasionally explore less-traveled routes (producing more creative but potentially less coherent outputs -too hot/random and it’s junk again).
The concept is that utterances of human language that the solution space represents has specific structures and grooves that do not encode meaning, but have arisen through historical accident and random events.
Valley locking
Once rovers enter certain deep valleys or travel along certain ridges (narrative patterns), they might find it difficult to escape those patterns - predicting that LLMs would show strong “path dependency” in generating text.
I imagine this is what you see when you have long journeys (long context windows) and they rovers struggle to go up hill and get out since it requires much higher temperatures than you would normally use. This metaphor also would explain why techniques like “system message reframing” or “prompt prefixing” work better than mid-generation attempts to change direction - they essentially place the rovers at different starting coordinates before the journey begins, rather than trying to make them climb out of an established valley.
Semantic plate tectonics
Different domains of knowledge represented as different “mountain ranges” that sometimes meet at awkward boundaries. These are arbitrary and simply encoded in the structures of language utterances in the training data. This predicts that LLMs would struggle most when trying to connect concepts across distant domains - which is what you see. For example, LLMs show behavior of struggling with interdisciplinary subjects like quantum biology and (ironically!) computational linguistics, and can get locked into default explanations in single domains, or struggle with historical paradigm shifts (ie. late republic to early-medieval history).