A Hitchhikers Guide to Large Language Models 1

I had to explain the structure of large language models to some people with no technical knowledge. This framework is what I came up with.

To rephrase Douglas Adams: it may have many omissions and probably much that is wildly inaccurate, but it has zero mathematics and therefore should have “Don’t Panic” in large friendly letters to calm those who feel this technology is scary and incomprehensible.

MOUNTAIN LANDSCAPES AS SOLUTION SPACES

I regularly use Anthropic's Claude for everything from coding "assistance" (I can't code, Claude does everything) to perhaps the most entertainingly successful use: crafting new cocktails.

But there's considerable voodoo talk about self-awareness and brains floating around, and I haven't found good explanations of how these systems actually work. Stephen Wolfram wrote some excellent deep dives some time back, though they’re quite maths-heavy.

So trying to figure things out I reached back to my AI degree and dug out one of my favourite intuition pumps: imagining solution spaces as landscapes.

It’s new levels of computing power that enable these systems: GPU and tensor computation let us analyze language at massive scale. We can now capture relationships between all words at useful resolution.

LLMs are simultaneously wildly complicated and relatively simple. At their core, these models are data files holding compressed patterns of language use. Your chat message becomes a request to examine that data file in a specific way. What you get back is the prediction generated from those compressed patterns.

TL:DR - ROVERS EXPLORING A MINIATURE LANDSCAPE

Imagine NASA has built a 27 km × 4.5 km scale model of the Alps from LIDAR scans in the Nevada desert.

The Alps here represent all human language utterance OpenAI/Anthropic/whoever could get their hands on, compressed at about 2000:1 (that’s very roughly the size of the training compared to the output weights file based - but this actually varies model to model).

The landscape reveals knowledge concepts: valleys are well worn conceptual paths while peaks are obscure ideas rarely visited. The level of compression unfortunately removes some detail and in this mini-Alps you can’t just look up what things mean like a database, you have to explore its surface to examine and reconstruct the information.

When you interact with the model in a chatbot:

1. You message NASA in the chat window.

2. The NASA tokenizer computer turns your prompt into co-ordinates on the model.

3. NASA air drops a bunch of ant sized Mars rovers to those co-ordinates.

4. The rovers explore the model and log paths through the terrain.

5. The rovers radio back the log to NASA.

6. NASA translates the log of the expedition back to words with the tokenizer computer.

7. NASA sends you the words in the chat window.

In this metaphor:

  • The Alps landscape = The trained model weights/parameters

  • Valleys and low terrain = Well worn high-probability sequences

  • Mountain peaks and high terrain= rarely visited low-probability token sequences

  • Rovers = The inference process of exploring the model

  • Radio communication = Attention mechanisms

  • Temperature setting = Randomness control in path selection

  • Expedition logs = Context window

A WEIRD DATABASE LOOKUP

Alternatively, if you want an even more reductive 2d version; think vector database. It’s like a weird database where the information reduced to storage in statistical patterns and have to be reconstructed and de-noised before giving you the answer. Information is stored by deriving relationships rather than the information itself.

Here we are doing the following:

  1. Human input is translated into tokens, which is the translation to database-ese.

  2. Embedding of tokens is setting up a query in a way to interact with the database.

  3. Query/Key/Value attention creation makes the actual queries.

  4. Attention heads are then a fuzzy search with that query for grammatical patterns, semantic relationships, context associations, all the way up through moral values, other linguistic reasoning patterns and so on.

  5. It then outputs what it finds as tokens using a denoise/decompression kind of algorithm.

  6. Tokens are are turned back into words.

Or perhaps for those who used to play on their ZX Spectrum:


5 REM Intelligence is not the code
6 REM Language = Externalised human reasoning patterns
10 INPUT "What is your query?"; query$
20 LET tokens$ = query$
30 LET embedded$ = tokens$
40 LET QUERY = STRUCTURE_QUERY_FOR_DB(embedded$)
45 REM TEMPERATURE EFFECTS SEARCH + DECOMPRESSION RANDOMNESS
50 LET temperature = 0.7 * RND
60 LET results$ = FUZZY_SEARCH(QUERY, TEMPERATURE, BIG_RELATIONAL_HUMAN_LANGUAGE_DB)
70 LET decompressed$ = DECOMPRESS(TEMPERATURE, results$)
80 LET answer$ = decompressed$
90 PRINT answer$
100 GO TO 10

Hopefully I’m getting the point across that the technology is basically a weirdly constructed and strangely queried database. The magic is in the search and generation of results which cannot return something exactly, but only what is likely to be the answer.

There is no ghost in the machine any more than there is a ghost in your Excel spreadsheet macro (and in fact, you could export the Claude model as an excel csv file, though the terabyte size would blue screen of death Excel - thanks Microsoft!).

IN CONCLUSION

There is no intentionality, they’re not brains in the way most people would think. They’re not aware, or alive in any way I think most people consider the terms “aware” and “alive”. To plant a flag and be specific, I’d be happy to place a big bet that LLMs do not experience quaila. It's a generative decompression of information compressed in language.

Is that what we do? Stochastic parroting? Kind of. No. Sometimes. When you’re being lazy.

This is not to say LLMs can’t do very interesting, very powerful and very useful things. But be on high alert for anthropomorphic thinking. It’s easy when something seems to talk back to you.

Stephen Wolfram argued that the big surprising insight that LLM’s demonstrate that language is fundamentally simpler and more law like than previously suspected.

Philosopher Andy Clarke would probably be less surprised. He has long argued that human intelligence relies heavily on cognitive offloading onto external systems, with language being perhaps the most fundamental. The collective reasoning traces built into language can structure our thoughts which - without going into the deep end - is what I think Foucault and Heidegger were on about.

These models mirror how our thinking emerges from environmental-wetware interaction in a distributed system, highlighting how much our mental processes depend on external structures – this is just the first time we’re seeing this scaffolding manipulated effectively by something other than a human.

However, there are limits to the ways language shapes and creates thought. You might always already be a subject, but even subjects have been known to think different - which is where the embodied, embedded, enacted, extended model is crucial. It’s a dynamic system. These models are essentially static.

We built something (a working Chinese Room!) that passed the Turing Test but that just demonstrates the inability of the Turing Test to define intelligence as most people want to.

Previous
Previous

Large Language Models 2: Mountains and Mars Rovers

Next
Next

Non-Submersible Units and the Forgotten Art of Sequence Writing