

MIT Technology Review Dives In: Let our writers clarify the intricate, chaotic realm of technology to aid your understanding of what lies ahead. You can explore more from the series here.
I’m penning this because one of my editors stirred awake in the wee hours, jotting down on a nightstand notepad: “What is a parameter?” Unlike many thoughts that arise at 4 a.m., it’s a genuinely intriguing question—one that taps into the essence of how large language models operate. And I’m not just saying this because he’s my superior. (Hello, Boss!)
The parameters of a large language model are often described as the knobs and switches that influence its behavior. Imagine a pinball machine the size of a planet, launching its balls across the expanse via a multitude of precisely configured paddles and bumpers. Adjust those configurations, and the balls will act differently.
OpenAI’s GPT-3, launched in 2020, boasted 175 billion parameters. Google DeepMind’s newest LLM, Gemini 3, may possess at least a trillion—some speculate it’s closer to 7 trillion—but the company remains tight-lipped. (Amidst fierce competition, AI companies have ceased to disclose details regarding the construction of their models.)
However, the fundamental nature of parameters and their role in enabling LLMs to perform extraordinary tasks remains consistent across various models. Ever ponder what makes an LLM function effectively—what underlies the colorful metaphors of a pinball machine? Let’s explore.
What constitutes a parameter?
Recall middle school algebra, such as 2a + b. Those symbols are parameters: Assign them values, and you arrive at an outcome. In both mathematics and programming, parameters define bounds or influence output. The parameters embedded in LLMs operate similarly, albeit on a staggering scale.
How are their values determined?
In short: with an algorithm. During model training, each parameter begins with a random value. The training routine then consists of a repetitive series of calculations (referred to as training steps) that modify those values. In the initial phases, the model will produce errors. The training algorithm assesses each mistake and revisits the model, adjusting the values of its numerous parameters so that the ensuing errors are minimized. This iterative process continues until the model performs as intended by its creators. Once achieved, training concludes, and the parameters’ values are locked in.
Sounds simple …
Theoretically! In practice, due to the vast amount of data LLMs are trained on and the sheer number of parameters they encompass, the training process necessitates a colossal number of steps and a staggering volume of computational power. Throughout training, the 175 billion parameters present in a medium-sized LLM like GPT-3 will be updated tens of thousands of times each. Overall, this culminates in quadrillions (a figure with 15 zeros) of computations. Hence, training an LLM demands immense energy. We’re looking at thousands of specialized, high-speed computers operating continuously for months.
So, what are all these parameters for, exactly?
Inside an LLM, there are three distinct types of parameters that receive their values through training: embeddings, weights, and biases. Let’s consider each of these in detail.
Alright! So, what are embeddings?
An embedding is a mathematical depiction of a word (or a word fragment, known as a token) within an LLM’s vocabulary. This vocabulary, which may encompass several hundred thousand unique tokens, is established by the designers prior to the commencement of training. However, these words carry no inherent meaning. That significance is derived during training.
As a model undergoes training, each term in its vocabulary is allocated a numerical value that embodies the meaning of that term in relation to all other words, based on its occurrence in countless examples found in the model’s training data.
Does each word get substituted by a sort of code?
Indeed. But there’s a bit more complexity involved. The numerical value—the embedding—that denotes each word is indeed a list of numbers, where each number in that list represents a different aspect of meaning that the model has derived from its training data. The length of this list, which LLM designers can determine before training, is often around 4,096.
Every word in an LLM is denoted by a list of 4,096 numbers?
Absolutely, that’s an embedding. And during training, each of those numbers is adjusted. An LLM utilizing embeddings that comprise 4,096 numbers is referred to as having 4,096 dimensions.
Why 4,096?
This figure might seem peculiar. Yet, LLMs (like anything operating on a computer chip) perform best with powers of two—2, 4, 8, 16, 32, 64, etc. LLM engineers have discovered that 4,096 is a power of two that strikes an ideal balance between capability and efficiency. Models with fewer dimensions lack capacity; those with excess dimensions become prohibitively costly or sluggish to train and deploy.
Employing more numbers enables the LLM to grasp distinctly nuanced information about how a word is utilized in various contexts, what subtle implications it may carry, its associations with other words, and more.
This past February, OpenAI introduced GPT-4.5, the organization’s largest LLM to date (some estimates suggest its parameter count exceeds 10 trillion). Nick Ryder, a research scientist at OpenAI who contributed to this model, shared with me at the time that larger models are capable of taking into account additional information, such as emotional cues, for instance, when a speaker’s words may indicate animosity: “All of these subtle patterns that emerge during human conversations—those are the elements that these continually expanding models will capture.”
The key takeaway is that all words within an LLM are encoded into a high-dimensional space. Visualize thousands of words suspended in the air around you. Words positioned closer together share similar meanings. For instance, “table” and “chair” would be nearer to each other than to “astronaut,” which is closer to “moon” and “Musk.” In the distance, you might spot “prestidigitation.” It resembles that, but rather than being interrelated across three dimensions, the words within an LLM are interrelated across 4,096 dimensions.
Yikes.
It’s mind-boggling material. Essentially, an LLM condenses the entirety of the internet into one colossal mathematical framework that encodes an incomprehensible amount of interconnected information. This is why LLMs can achieve extraordinary feats and also why fully grasping them is challenging.
Alright. So that covers embeddings. What of weights?
A weight is a parameter that signifies the intensity of a connection between various components of a model—and serves as one of the most prevalent types of controls for modifying a model’s functionality. Weights are utilized when an LLM interprets text.
When an LLM encounters a sentence (or a chapter from a book), it initially retrieves the embeddings for all the words and then processes those embeddings through a series of neural networks, known as transformers, designed to handle sequences of data (like text) simultaneously. Each word in the sentence is processed in connection with every other word.
This is where weights play a role. An embedding signifies a word’s meaning in isolation. When a word appears in a specific sentence, transformers employ weights to interpret the meaning of that word in the new context. (Practically, this entails multiplying each embedding by the weights of all other words.)
And biases?
Biases are another kind of control that enhance the effects of the weights. Weights establish the thresholds at which different segments of a model activate (and therefore convey data to the subsequent component). Biases modify these thresholds so that an embedding can initiate activity even with a low value. (Biases are values added to an embedding rather than multiplied by it.)
By adjusting the thresholds at which parts of a model activate, biases empower the model to capture information that might otherwise be overlooked. Imagine you’re attempting to discern someone’s speech in a loud environment. Weights would amplify the loudest voices the most; biases act like a dial on an audio device that elevates quieter voices in the mix.
Here’s the short version: Weights and biases represent two distinct approaches by which an LLM extracts as much information as it can from the text it processes. Both types of parameters undergo constant adjustment during training to ensure they achieve this effectively.
Okay. And what about neurons? Are they a kind of parameter too?
No, neurons are more of a structure for organizing all this mathematics—containers for weights and biases, interconnected by networks of pathways between them. It’s loosely inspired by biological neurons in animal brains, where signals from one neuron instigate new signals from the next, and so forth.
Each neuron in a model contains a single bias and weights for each dimension of the model. To put it another way, if a model encompasses 4,096 dimensions—and thus its embeddings consist of lists of 4,096 numbers—each neuron within that model will possess one bias and 4,096 weights.
Neurons are organized into layers. In the majority of LLMs, every neuron in one layer is linked to all neurons in the layer above. A model like GPT-3, which has 175 billion parameters, may feature approximately 100 layers with tens of thousands of neurons in each. Meanwhile, each neuron executes tens of thousands of calculations simultaneously.
Dizzy again. That’s an immense amount of computation.
It is indeed a massive amount of math.
How does everything connect? How does an LLM process a collection of words and decide what to output?
As an LLM analyzes a segment of text, the numerical representation of that text—the embedding—traverses multiple layers of the model. In every layer, the embedding’s value (that list of 4,096 numbers) is updated repeatedly via a sequence of computations involving the model’s weights and biases (linked to the neurons) until it reaches the final layer.
The goal is that all the meaning, nuance, and context of the input text are encapsulated in the final value of the embedding after undergoing an astonishing series of computations. This value is then utilized to determine the subsequent word that the LLM should generate.
It’s not surprising that this process is more intricate than it seems: The model actually calculates how likely every word in its vocabulary is to follow and ranks the outcomes. It subsequently selects the top word. (Sort of. See below …)
This word is then added to the preceding text block, and the entire process repeats until the LLM deduces that the most probable next word to generate signifies the conclusion of its output.
That’s it?
Well …
Go on.
LLM creators can also designate a small assortment of other parameters, known as hyperparameters. The primary ones include temperature, top-p, and top-k.
You’re inventing this.
Temperature acts as a parameter that functions akin to a creativity dial. It affects the model’s selection of the subsequent word. I mentioned that the model organizes the words in its vocabulary and chooses the foremost option. However, the temperature parameter can encourage the model to opt for the most likely next word, yielding output that’s more factual and relevant, or a less likely word, resulting in output that’s more unexpected and less mechanical.
Top-p and top-k are two additional controls that influence the model’s word selection. These settings compel the model to randomly choose a word from a pool of the most probable words rather than selecting the highest-ranked word. Such parameters shape how the model presents itself—either quirky and imaginative or trustworthy and mundane.
One final question! There has been considerable buzz regarding smaller models that can outshine larger models. How does a smaller model accomplish more with fewer parameters?
That’s one of the most pressing inquiries in AI currently. There are numerous methods this could occur. Researchers have discovered that the volume of training data significantly impacts performance. Initially, it’s essential to ensure that the model is exposed to adequate data: An LLM trained on insufficient text won’t effectively utilize all its parameters, and a smaller model exposed to the same volume of data could surpass it.
Another approach researchers have identified is overtraining. Presenting models with far more data than previously deemed necessary appears to enhance their performance. Consequently, a smaller model trained on extensive data might outperform a larger model trained on lesser data. For instance, Meta’s Llama LLMs: The 70-billion-parameter Llama 2 was trained on about 2 trillion words, whereas the 8-billion-parameter Llama 3 was instructed with approximately 15 trillion words. The considerably smaller Llama 3 proves to be the superior model.
A third strategy, known as distillation, employs a larger model to educate a smaller one. The smaller model is trained not only on the raw training data but also on the outputs generated by the larger model’s internal processes. The concept is that the hard-earned knowledge encoded in the parameters of the larger model permeates into the parameters of the smaller model, providing it with an advantage.
In truth, the era of singular monolithic models may be behind us. Even the largest models available today, like OpenAI’s GPT-5 and Google DeepMind’s Gemini 3, can be conceptualized as several smaller models disguised together. Utilizing a technique called “mixture of experts,” large models can engage only specific sections of themselves (the “experts”) that are necessary for processing a particular piece of text. This merges the capabilities of a large model with the speed and lower power requirements of a smaller one.
However, that’s not the conclusion. Researchers continue to explore ways to maximize a model’s parameters. As the benefits from simple scaling diminish, simply increasing the number of parameters no longer seems to yield the same advantages. It’s less about how many parameters you possess and more about how effectively you utilize them.
Can I view one?
You wish to witness a parameter? Go ahead: Here’s an embedding.