For all the talk of machine minds, the engine inside a modern chatbot rests on an idea a statistician from the 1940s would recognise: guess the next word. A large language model, or LLM, is a system trained to predict the unit of text most likely to follow the text it has already seen. Type “the capital of France is” and the model assigns a high probability to “Paris” because, across the trillions of words it learned from, that is overwhelmingly what came next. Everything else — the essays, the code, the apparent reasoning — is that single mechanism, repeated and scaled almost beyond intuition.
Understanding that mechanism is not a pedantic exercise. It explains, in one stroke, why these systems are so fluent, why they are so useful, and why they fail in the specific and sometimes dangerous ways they do. This is the architecture behind the tools reshaping our technology coverage, and grasping it separates informed users from credulous ones.
Context: Tokens, Parameters and the Prediction Game
An LLM does not read words the way we do. Text is first chopped into tokens — chunks that are often whole words but sometimes fragments, so “unbelievable” might split into “un”, “believ” and “able”. The model’s entire job is to take a sequence of tokens and output a probability distribution over which token comes next.
That ability is stored in numbers called parameters: the internal weights, adjusted during training, that encode the patterns the model has learned. Early models had millions of parameters; today’s frontier systems have hundreds of billions. Each parameter is a tiny dial, and training is the process of turning all of them, gradually, so that the model’s predictions match real text more and more closely.
Training works by showing the model an enormous corpus — much of the public web, digitised books, code repositories — with parts hidden. The model guesses the hidden token, compares its guess to the real answer, and nudges its parameters to do better next time. Repeat this across trillions of examples and the dials settle into a configuration that has, in effect, compressed a vast amount of human writing into statistical structure. No one programs in the rule that adjectives precede nouns in English; the model infers it because that is what the data overwhelmingly shows.
How the Transformer Made It Work
The leap that made today’s models possible arrived in 2017 with an architecture called the transformer, introduced by researchers at Google. Its central innovation is a mechanism called attention, and it solves a problem older systems struggled with: how to let distant words influence each other.
Consider the sentence, “The trophy did not fit in the suitcase because it was too big.” What does “it” refer to — the trophy or the suitcase? A human knows instantly. Attention gives the model a way to learn this: as it processes “it”, it can weigh every earlier word and assign more importance to “trophy”. Multiply this across many layers and many parallel attention “heads”, each learning different kinds of relationships, and the model builds a rich internal representation of how the parts of a passage relate.
Crucially, the transformer processes a whole sequence at once rather than word by word, which means training can be parallelised across thousands of specialised chips. That efficiency is why scaling became feasible, and scaling, it turned out, was the whole story. As models grew larger and were fed more data, they did not just get marginally better at predicting text — they began to display abilities, such as translation or step-by-step arithmetic, that smaller versions simply could not do. Researchers call these emergent capabilities, and they are the reason the field shifted from hand-crafting rules to building ever-bigger statistical engines. The Stanford Institute for Human-Centered AI has tracked this scaling dynamic as the defining feature of the modern era of the field.
Why Fluency Is Not Truth
Here is the part most users miss, and it matters enormously. An LLM is optimised to produce plausible text, not true text. Those are different targets that happen to overlap most of the time. When they diverge, the model will cheerfully choose plausibility.
This is the root of hallucination — the tendency to generate confident, well-formed statements that are simply false. Ask for a citation it has not seen and the model may invent one that looks perfectly real, because a plausible-looking citation is exactly what its training rewarded. The model has no internal ledger of what it knows versus what it is improvising; it is generating the next likely token either way.
A second limit follows from how training data is gathered. A model trained on the internet absorbs the internet’s biases, gaps and stale information. Techniques such as fine-tuning, reinforcement learning from human feedback, and retrieval — feeding the model verified documents at query time — reduce these problems but do not eliminate them. Standards bodies including the US National Institute of Standards and Technology have published risk-management frameworks precisely because these failure modes are structural, not incidental. This is also why the debate over AI governance centres less on stopping the technology than on bounding where its confident guesses are allowed to make decisions.
What Comes Next, and What to Watch
The near future of LLMs is being shaped by a few clear pressures. Cost and energy are one: training and running these models consumes vast computing resources, pushing research toward smaller, more efficient models that approach frontier quality at a fraction of the footprint. Reliability is another, with growing investment in systems that can cite sources, flag uncertainty, or verify their own outputs against trusted data.
For readers, the practical takeaway is durable regardless of which model is in the headlines. These systems are extraordinary pattern-matchers and poor authorities. They excel at drafting, summarising, translating and brainstorming — tasks where a knowledgeable human checks the result. They are risky wherever an unverified output is treated as fact, which is why their spread through the wider economy is best understood as a powerful new tool requiring human judgement, not a replacement for it. Knowing that the machine is, at bottom, predicting the next token is the single most useful thing a non-specialist can hold on to.
Sources
Related from Technology
What Data-Localization Laws Do, and Why They Are Spreading
A growing number of countries now require that certain data about their citizens be stored or processed within their borders. The motives…
Why Advanced-Chip Supply Is So Geopolitically Fraught
A handful of companies in a handful of places make the most advanced computer chips on earth. That extreme concentration has turned…
What the EU AI Act Actually Requires of Companies
The world's first comprehensive AI law sorts systems by risk rather than by technology. Understanding its tiered structure explains who it binds,…
Get Cubed News in your inbox
Daily premium coverage, free. Independent · Source-cited.


