Large Language Models 2: Mountains and Mars Rovers

Computers and AI

May 13

THE MODEL MOUNTAIN RANGE

GROUNDWORK

Broadly:

The mountainous landscape = the solution space of language
Valleys = high probability word sequences - well worn concepts
Mountain Peaks = low probability word sequences - less likely concepts
The 2,000:1 compression ratio = compression ratio of input data file size to model wights file size
The rovers’ exploration = the inference process
Communication between rovers = attention mechanisms in the transformer architecture
Temperature setting = the actual temperature parameter to add randomness
Expedition logs = Context window

Words don’t equal tokens (but they’re close). Peaks, valleys and terrain shapes don’t equal “concepts” (but they’re close).

Peaks and valleys are probability distributions found in the training data. “Concepts” are philosophically problematic and people are still arguing about what they are anyway (if they even exist). These simplifications make what the model is easier to understand without a PhD in Wittgenstein and statistics.

This is a two part system:

1. the scale model of the Alps containing information

2. the interaction system of rover teams with sensors and programming that communicate their scans to NASA mission control tokenizer computers in Houston that send the user the log of the terrain translated to words.

Code completion tools, search augmentation system, or document summarization engine - are just be rover teams with slightly different exploration programming parameters, occasionally on specially modified models of the terrain.

THE MOUNTAIN RANGE MODEL

Imagine NASA took detailed LIDAR scans of the Alps and then built a scale model in the desert as accurately as possible at 27 km × 4.5 km size.

This physical model of the Alps in this metaphor represents the solution space of the relationships of all human language instances that the model has been shown (ie. trained on). That is trillions of words across billions of documents at a compression of about 2,000 to 1.

In this terrain, the landscape reflects probability distributions in language:

VALLEYS represent commonly used patterns, well-worn conceptual paths that language frequently travels. The deeper and wider the valley, the more common the pattern. These are the linguistic equivalents of highways and major thoroughfares - phrases, constructions, and associations that appear frequently in human writing.

PEAKS represent rare or unusual language patterns - the statistical outliers that rarely appear in training data. These are difficult to reach, just as rare expressions or unusual word combinations are less likely to be generated.

Exploration naturally tends to follow valleys rather than climb peaks, just as language models favor common patterns over rare ones (which explains why LLMs excel at generating conventional text but may struggle with highly specialized or unusual expressions).

A more complex explanation is that the physical model represents compressed relationships between all language patterns the model was trained on. The terrain doesn’t directly represent individual concepts, but rather probability distributions over possible words based on statistical patterns found in the training data.

This is conceptually similar to (though technically different from) a vector database where information is stored as mathematical relationships rather than raw data. LLMs encode not just the relationships but generative patterns that predict how those relationships evolve in sequences.

In this landscape, semantically similar language patterns tend to appear near each other. The height (Z-coordinate) at any point represents the probability of specific word sequences, with peaks indicating low-probability patterns and valleys representing common ones.

This big thing to take away is that the “model” bit of large language models are basically a lossy compressed file of all language OpenAI (or whomever) could find - plus some more they generated themselves. They make the model, then extract information from this lossy compressed file using probabilities to reconstruct.

SO WHAT IS UP AND DOWN?

X and Y coordinates (horizontal position) represent the semantic space - different regions correspond to different topic areas, domains of knowledge, or conceptual categories. Moving across this plane means shifting between different subject matters or contexts.

Z coordinate (height) represents probability density in that region of semantic space - how likely certain patterns of language are to occur based on the training data. Note that this is inverse - high does not mean high probability. Just like in the mountains, low elevations are high traffic well worn areas, higher elevations see much less traffic and use.

To be specific, higher elevations indicate lower probability regions where the model has seen very few examples and has weak statistical patterns to follow. Lower elevations (valleys) represent high-probability regions where the model has more training data and more confidence.

In this framework:

Terrain represents probabilities in the semantic landscape
Valley systems represent clusters of related high-probability patterns
Peaks represent low-probability patterns
Ridges are boundary transitions between different semantic domains, or dividing lines that separate different conceptual domains
Paths around the landscape represent common ways concepts connect in language

When a rover is at any specific (x,y,z) coordinate, it encounters a probability distribution over possible next tokens. This distribution isn’t directly visible in our 3D metaphor (as it exists in the very high-dimensional token space), but the rover samples from this distribution to determine its next step.

FUN WITH LANDSCAPE CONCEPTUALISATION

PLOT SPOLERS!

Theoretically, you can see narrative features in this model. If you were to zoom in you might be able to notice:

 Escarpments: where revelations force a compete re-contextualisation (Sixth Sense, Fight Club, The Prestige, any Kishotenketsu structured story has an escarpment in the Ten act)

Mystery Genre: Deep valley systems with deliberately obscured pathways, occasional false escarpments, and final convergent basins

Comedy Genre: Rapidly oscillating terrain with frequent small escarpments (setups/punchlines) and unexpected connectivity between seemingly unrelated semantic regions

Individual stories could notionally be seen on the model - they’d be features about 2mm across - so the escarpment the rover would examine would be tiny!

ANYWAY

The key simplification is that the metaphor simplifies the solution space of human information with billions of dimensions down to 3: our scale model of the Alps.

ROVERS EXPLORE AND RESPOND

INSERTION

When you provide a prompt to a chatbot, you message NASA a request to explore their model of the Alps from a precise location.

Their tokenizer computers translate your message (your prompt) into a set of co-ordinates on the model. A short prompt is vague and could go anywhere, a long specific prompt gives NASA a better idea of where to start.

For example:

1. “Tell me about space ships.” = “Deploy somewhere near Zermatt”.

2. “Explain the technical challenges of orbital rendezvous operations for the Soyuz spacecraft docking with the International Space Station, particularly regarding approach vectors during Earth’s shadow periods and the automated KURS navigation system’s reliability at distances under 200 meters.” = “Deploy rovers precisely at the northeast ridge of the Matterhorn, 300 meters below the summit, where the Hörnli and Zmuttgrat ridges converge, facing the Dent d’Hérens.”.

A team of ant sized Mars Curiosity rovers are then air dropped in to the co-ordinates with instructions to start exploring.

EXPLORATION - A CRACK TEAM OF ROVERS ON A MISSION

In the first example prompt from above (ie. Deploy somewhere near Zermatt), the rovers could start almost anywhere in a vast region, with many possible directions to explore.

The rovers would be dropped in a general area and each begins exploring in slightly different directions - some heading toward the valleys representing rocket technology, others toward the peaks of space history, and others toward the ridges of astronaut experiences. With such a vague starting point, the rovers spread out widely across the conceptual terrain.

In the second example, rovers are placed at an exact location with clear contextual constraints that guide their exploration.

Even with this precise starting location, each rover still explores in slightly different directions - one investigating the crevice representing KURS system failures, another descending toward the valley of approach vector mathematics, and others examining the shadow terrain of visual navigation limitations. But their paths remain much more closely clustered due to the specific starting coordinates.

The rovers have movement computers programmed with some basic functions: climb up or down hills depending on the terrain context (follow probability gradients), record observations, communicate with other rovers about similar terrain features, avoid revisiting the same areas, maintain coherence with the expedition log, and optimize for efficient paths that satisfy the mission parameters (ie. exploring the co-ordiantes given in the prompt). They also use “self-attention” look back at the path they have taken from their log.

Specifically Imagine a rover that’s been dropped in and is now positioned at point A on our Alps model.

1. Point A corresponds to the word “hat.”

2. The ant size rover fires up their own LIDAR system and scans the surrounding terrain at sub-millimeter resolution, detecting probability distributions for possible next words (e.g., “fedora” (downhill), “store” (similar terrain), and “collection” (higher terrain))

3. Using the CPU exploration rules (concerning following terrain types such as hill gradients and ridges, other scans from other rovers in the team. etc) and influenced by temperature (some randomness, not just the next most likely), the rover selects one option (in this case, “fedora”)

4. The selection is logged.

5. The rover moves to the new position that represents “hat fedora” in the Alps model (the conceptual landscape)

6. It repeats this process for each word selection

The rover never plans a long route in advance - it only knows its next destination after sampling at each stop.

The sensor samplings taken through this landscape becomes your answer - with each sample they recorded on their journey exploring the mountain range model corresponds to a specific token in the output.

The rovers radio back their data from the model to Mission Control in Houston.

TEMPERATURE - A FUNNY THING

The rovers have a temperature sensitive CPU that becomes less deterministic as it heats up. This was randomly discovered to be necessary.

So NASA built a small heater onto the CPU that screws with the output and introduces randomness. You want a more random, less rule following exploration, turn up the heat.

This is user controllable, though NASA (OpenAI/Anthropic etc) will have an optimised temperature baseline. This controls how strictly the rovers will follow their basic programming rules about following the terrain - whether they stick closely to instructions about going to the destination or how much they take occasional random detours (Hill Climb search! Random walk!) .

The rovers have a temperature setting that introduces controlled randomness into their exploration. When the temperature is low, rovers strictly follow the highest-probability paths (producing consistent but potentially repetitive outputs).

When temperature is higher, they occasionally explore less obvious routes (producing more creative but sometimes less coherent responses). This randomization was discovered by testing rather than theoretical prediction. Nobody knows why it works.

Without it, the outputs tend to become excessively predictable and repetitive junk. With too much, they become chaotic and incoherent. Finding the optimal temperature setting is crucial for generating natural-sounding language.

MISSION CONTROL TO USER

NASA’s Houston de-tokenizing computer then translates the log data from tokens back to English and sends you that in an email.

Crucially of course the rovers don’t “understand” the terrain - they just explore it and log what they find. This explains why sometimes small changes in position (and thus temperature - I’ll get to that later) can cause completely different word choices.

CONVERSATIONS

In a multi-turn conversation, the rovers maintain an expedition log of their previous journeys, which constrains where they can start their next deployment and exploration - ensuring the conversation remains coherent and connected.

David Thomson

Large Language Models 2: Mountains and Mars Rovers

THE MODEL MOUNTAIN RANGE

ROVERS EXPLORE AND RESPOND

Large Language Models 3: Key Behaviors

A Hitchhikers Guide to Large Language Models 1