Large Language Models 4: The Weeds

May 27

So How many rovers actually explore this thing?!

This is the fourth post in a series explaining the workings of large language models using the metaphor of exploring a landscape. You’re probably best starting here to get the background.

So now we really get into some weeds.

In the Alps metaphor, the number of rovers airdropped corresponds to the parallel processing capabilities of the LLM, particularly related to the attention mechanism and number of transformer heads. Here is the same thing in a little more detail.

The number of rovers deployed would be determined by:

1. Number of Attention Heads: In a transformer architecture, each attention head represents a different “perspective” on the input. This would map to each layer being a different type of rover specialized for different terrain and different scanning. For example, GPT-3.5 has 32 attention heads per layer, while GPT-4 has 64-128 attention heads per layer. In our metaphor, this would mean 64-128 rover teams being deployed in the exploration, each focusing on different aspects of the terrain.

2. Model Size: Larger models (with more parameters) would deploy more rovers. A small model like GPT-2 might drop 12 rovers at each position, while a massive model like GPT-4 might deploy over 100 rovers per position.

3. Layer Structure: The model’s architecture determines how these rovers are organized. For each layer in the transformer model (GPT-4 has 96+ layers), a new set of rovers might be deployed to explore based on the information gathered by the previous layer’s rovers. The first rovers might be Curiosity types that are best for certain types of terrain, the next might be Spiders from Minority Report that can climb hills and small, difficult terrain, then maybe drones similar to Prometheus to explore from the air.

When exploring the terrain:

1. Different Specializations: Each rover is a different type of rover equipped with different specialized sensors. One rover might be good at exploring and detecting “cliffs” (grammar patterns), while another might specialize in identifying “semantic valleys” (meaning relationships).

2. Coverage Strategy: The rovers spread out to cover different aspects of the surrounding terrain. Some might explore close to the starting point, while others scout further ahead looking for promising paths.

3. Communication Network: All rovers maintain constant radio contact, sharing their findings with the entire team. This corresponds to how attention mechanisms integrate information across tokens.

4. Consensus Decision-Making: After exploring and communicating, the rovers reach a consensus about which direction to move next. This represents how the model combines information from multiple attention heads to make token predictions.

For a concrete example, let’s say you’re using GPT-3.5 with its 32 attention heads per layer across 40 layers:

1. Initial Deployment: When processing a prompt, NASA first deploys 32 rover teams (from Layer 1) to explore the terrain from different perspectives.

2. First Exploration Phase: Each team explores slightly different aspects of the terrain:

Team 1 might focus on grammatical features
Team 2 might focus on semantic relationships
Team 3 might focus on factual information
And so on for all 32 teams

3. First Integration: Mission control’s computers combine the findings from all 32 teams into a comprehensive “terrain report.”

4. Second Exploration Phase: Based on this report, 32 new rover teams (from Layer 2) are deployed. These teams don’t start from scratch - they build upon the analysis from Layer 1.

5. Continued Processing: This process repeats through all 40 layers, with each layer’s rovers analyzing more sophisticated and abstract features of the terrain based on previous layers’ findings.

6. Final Decision: After all 40 layers have completed their analysis, mission control has a highly processed understanding of the terrain and can determine the most appropriate next position.

The “layers” in a transformer model represent sequential processing stages. In our metaphor:

1. First Layer: The initial set of rover teams explores the immediate terrain around the deployment point, gathering basic observations.

2. Middle Layers: Each subsequent layer represents a new phase of exploration that builds upon previous findings. These rover teams don’t start from the original deployment point - they start from where the previous layer’s analysis ended.

3. Final Layer: The last set of rover teams makes the final determination about which direction to move next (which token to generate).

This multi-layer exploration mirrors how transformers process information:

Early layers capture basic patterns (grammar, vocabulary)
Middle layers identify relationships and contextual information
Later layers handle more abstract concepts and long-range dependencies
The final layer produces token probabilities

This helps explain why larger models with more attention heads can capture more nuanced relationships in language - they’re literally deploying more “rovers” to explore different aspects of the conceptual terrain simultaneously.

How do thinking models work differently?

In the standard mode:

1. The rover arrives at a position

2. Performs a quick scan of the immediate terrain (probability distribution)

3. Selects the highest probability path (ie. usually downhill, potentially with some temperature-based randomness)

4. Moves to that position

5. Repeats the process

6. Only reports the final path taken

In thinking mode:

1. The rover arrives at a position

2. Performs a more thorough scan of the surrounding terrain

3. Documents multiple potential paths it could take, not just the one the CPU system would select in standard mode

4. Logs its observations about why certain paths might be better than others

5. Still selects one path to continue (often the same one it would have in standard mode)

6. Reports both the path taken AND the documented observations about alternatives

However, different companies implement thinking in different ways. They change:

1. Scanning protocols: How they document alternative paths

2. Reporting style: How they structure and format their observations about alternative paths

3. Specialized instruments: What additional analysis they perform in specific terrain types

4. Integration methods: How they connect current observations with previous terrain features

All implement the core concept of “enhanced terrain scanning with explicit logging of alternatives,” but with company-specific approaches to how that scanning and logging should be structured and communicated.

Anthropic (Claude)

Rover Configuration Style: Systematic Terrain Analysis

Claude’s rovers are equipped with specialized “contextual terrain analyzers” that methodically document the surrounding landscape in a structured format:

Rovers follow a semi-formal protocol, often organizing observations into explicit categories (e.g., “Key considerations,” “Analyzing options,” “Potential approaches”)
The rovers are programmed to maintain balance - spending roughly equal time assessing terrain and moving forward
Claude’s rovers are instructed to periodically “look back” at their path, explicitly connecting current terrain features to previously observed patterns
The reporting style is comprehensive and formal, often with clear section breaks

Example rover log: “I notice three potential paths ahead. Path A leads toward [analytical observation]. Path B appears to [alternative consideration]. After weighing these options, I’ll proceed along Path A because [reasoned justification].”

OpenAI (GPT-4 with Reasoning Mode)

Rover Configuration Style: Freeform Exploration with Narrative Reporting

GPT’s rovers use a more naturalistic, conversational approach to terrain documentation:

The rovers scan widely but report their observations in a less structured format
They’re programmed to explicitly consider counterintuitive paths that might initially seem suboptimal
When OpenAI’s rovers encounter complex mathematical or logical terrain, they’re instructed to deploy specialized instruments and create detailed measurements before proceeding
The reporting often looks like an internal monologue - as if the rover is “thinking out loud” about what it observes

Example rover log: “Hmm, this is interesting. I’m looking at the terrain and I see that the most direct path would be [option]. But wait - if I consider [alternative perspective], that might actually lead to a better outcome. Let me think through this step by step...”

Google (Gemini)

Rover Configuration Style: Systematic Multi-Modal Integration

Gemini’s rovers are designed to handle complex multi-modal terrain with special instruments for analyzing visual features alongside textual ones:

Rovers are programmed to integrate observations across different “sensory modalities” in the terrain
They follow a more rigid protocol with numbered steps and explicit transitions between different types of analysis
Gemini’s rovers are calibrated to spend extra time in areas of the terrain that represent quantitative reasoning or structured data
The reporting style is comprehensive and methodical, with clear demarcation between different reasoning phases

Example rover log: “Step 1: I’m analyzing the key features of this terrain. Step 2: Based on these observations, I can identify several potential paths. Step 3: Evaluating each path against our objectives...”

Reduced Models

Now, about that 27km sized model...

It’s very accurate but expensive to create, maintain and slow to navigate.

NASA scientists wonder: “Could we build a smaller, more efficient version that still captures the essential features?”

1. Create a Simplified Map: NASA analyzes the original terrain and identifies which features are most essential for navigation. They determine which mountains, valleys, and ridges are most frequently used by rovers and which minor features could be smoothed over without significantly affecting typical journeys.

2. Building a Compact Version: Instead of a 120 meter model, they create a 45 meter version. This compact model doesn’t replicate every pebble and crevice of the original, but carefully preserves the major peaks, important ridge lines, and commonly traversed valleys.

3. Knowledge Distillation: To ensure accuracy, they don’t just build this compact model directly from LIDAR scans. Instead, they run thousands of test expeditions on the full-size model, carefully recording which routes the rovers actually take. Then they specifically optimize the smaller model to reproduce these important pathways, even if that means slightly exaggerating some features to compensate for the reduced space.

4. Rover Modifications: The rovers themselves might be recalibrated to work efficiently with this new compressed terrain - their sensors adjusted to recognize when a single feature in the compact model represents what would have been several distinct features in the full model.

This works because…

1. Pareto Principle. In the original model, it turns out that rovers spend about 80% of their time traversing just 20% of the terrain. By ensuring these frequently used regions are carefully preserved in the smaller model, most common queries remain accurate.

2. Conceptual Compression: Rather than simply shrinking everything proportionally, the reduced model applies intelligent compression. Major peaks (common concepts) remain prominent while very minor features (rare edge cases) might be smoothed away entirely. It’s like a topographical map that uses different scales for different regions based on importance

3. Path Preservation: What matters isn’t preserving every feature of the terrain in exact proportion, but preserving the paths that rovers typically take. If rovers almost never visit a particular valley, it doesn’t matter much if that valley is simplified or even removed in the smaller model.

4. More Efficient Navigation: With less terrain to traverse, rovers can move more quickly. The reduced dimensionality means fewer calculations at each step, allowing for faster processing.

And some tradeoffs...

1. Detail Loss in Obscure Areas: Some rare but potentially important terrain features are lost. If you ask about very obscure topics or edge cases, the smaller model might produce less accurate responses because those regions are oversimplified.

2. Terrain Merging: Sometimes distinct features that were separated in the full model might be merged in the compact version. This explains why smaller models occasionally blend concepts that larger models would keep distinct.

3. Navigation Challenges: In some complex terrain areas, the smaller model might not capture subtle navigational cues that would guide rovers in the full model. This is why reduced models sometimes make “rookie mistakes” that larger models avoid.

Smaller models preserve the most important “paths” through the conceptual landscape, even if it sacrifices some of the nuance and edge case handling of the larger terrain.

David Thomson

Large Language Models 4: The Weeds

So How many rovers actually explore this thing?!

Large Language Models 3: Key Behaviors