Skip to content

Have you wondered how a typical LLM is built?

Published:

Look no further — I’m going to take just 5 minutes of your time to explain how Llama 3 was built. Llama 3 from Meta AI is a good approximation for a typical LLM model in Q1 2025 and its my go-to model because its technical paper is comprehensive, covering everything we need to know.

When I say typical, I am specifically refering to Foundation LLMs. Foundation LLMs are generic LLMs for a variety of general purpose AI tasks. Think of summarization, translation, code generation, general knowledge, math solving etc.. You can also think of these as your infrastructure or foundation upon which your specific use cases can be built in the form of Generative AI apps (ex: chatGPT, Cursor) or AI agents..

TL;DR

How LLMs are built?

Broadly speaking, LLMs are built through a series of steps within these two main phases:

  1. Pre-training phase
  1. Post-training phase

1. Pre-training phase

Lets start from ground zero. In this step, a few things happen such as defining the LLM (neural network), finding the data for training, preparing the data for training and then performing the Pre-training.

1. Defining the LLM (neural network)

Let me explain this in simpler terms. All modern LLMs are neural networks. A specific deep neural network architecture called the Transformers is widely used for these models because it’s incredibly good at tasks like understanding language and predicting the next meaningful word in a sentence. So “Defining the LLM” step is nothing but deciding what architecture to use and what hyperparameters to use to create the LLM - the neural network. Think of hyperparameters as the settings or specifications for these neural networks. These do not change during the training process. Some examples of hyperparameters include size of the model, learning rate, max output size, context window size (the amount of tokens the model can “see” at one time as input to understand and make predictions) etc..

you ask: what are tokens? A token refers to a unit of text that is treated as a single element by a language model. Tokens can be as small as characters, subwords, or entire words, depending on the tokenization method used.

For those of you hearing Transformers or neural networks for the first time, I would highly encourage you to review 3Blue1Brown’s video series here.

Now that we understand what happens in Pre-Training, lets look at the Llama 3’s pre training details. Ripped out of the technical paper, below is an illustration of the Llama 3 architecture.

transformer model archictecture diagram

Hyperparameters for Llama 3 ends up looking like this the one in the below table (source Llama 3 technical paper).

llama parameters

Let me explain each hyperparameter in this table:

you ask: what are embeddings? LLMs are neural networks which are like huge and complex math equations if you will. They cannot directly read text, they need to be numbers. That is where the process of embedding comes in. Text embedding is a process where text (words, sentences, or entire documents) is converted into a numerical vector (a list of numbers). These vectors capture the semantic meaning of the text, making it easier for models to understand and work with text.

Meta developed a range of models with different sizes, measured by the number of parameters. Parameters in neural networks primarily refer to the weights and biases that the model learns during training. Each connection between neurons has an associated weight, which is a parameter. Each neuron (except typically those in the input layer) has a bias, which is also a parameter. Typically, total parameters is the sum of count of weights and count of biases. The three models—8B, 70B, and 405B—are scaled versions of the same architecture. The size impacts how powerful the model is.

Sample neural network that explains the concept of parameters

If a neural network has 3 layers. Lets say, 1 input layer, 1 hidden layer and 1 output layer with 4 neurons in each layer. Then typically the number of parameters in this model is computed as 40. These 40 parameters will be learned by the model as part of the model training process. You can think of the number of layers, what algorithm to use for making the model learn etc.. as hyperparameters.

network parameters

# of parameters = Count of weights + Count of biases = 32 + 8 = 40

2. Preparing the data to train

Now that we have defined the model, lets look at the process of getting the data it needs. The data used to train a model must be carefully curated and balanced. In Meta’s case, they used 15 trillion (15T) multilingual tokens.

To train the model, these 15T tokens were generated from an enormous dataset sourced from the internet and fed into the neural network. The goal is for the model to learn enough about language, context, and patterns to generate meaningful responses to new unseen user inputs (this process is often referred to as next-token prediction).

Meta trained Llama 3 using data collected from various sources on the internet, with the data being up-to-date as of December 2023 (this is referred to as the cutoff date). Before training, the data went through a cleaning process to remove duplicates, personal information, and other unnecessary elements.

Since much of the data came from web pages, Meta created a custom parser to efficiently process and extract information. They also used two types of tools for filtering and cleaning the data:

  1. Fast and simple text classifiers to identify and filter out bad or repetitive data.
  2. Llama2-based models for more robust classification, such as by assigning scores to documents and deciding their relevance.

Specialized pipelines were developed to handle specific types of data, such as math and code, separately. To ensure a balanced dataset for training, Meta decided on the following mix of content:

This careful curation of data helped create a model with a well-rounded understanding across different areas of knowledge. Now that we have prepared the data, lets move on to the actual training in the pre training phase.

The exact original data set size of the data used for Llama 3 training is unknown. But its estimated to be about 45TB of data.

3. Training in Pre-Training phase

Imagine this process as if you were accelerating a human brain’s learning from infancy to 30 years old in a matter of months. The brain would “experience” vast amounts of information, gaining knowledge and understanding as it processes it. This is essentially what happens with the LLM. It “sees” and “learns” from a massive amount of data to develop its knowledge and generate intelligent responses. For example, after this training process, if you ask, “What is the square root of 16?” the model should understand the question and respond appropriately. Keep in mind, this isn’t just retrieving an answer from a database—this is more comparable to human intelligence. The LLM “thinks” by activating a complex network of artificial neurons and functions, similar to how a human brain processes and responds to questions.

Training Compute Power

The speed and efficiency of training a model depend heavily on the compute power available. Compute is often measured in FLOPs (Floating Point Operations Per Second). For example, Meta used 3.8 × 10²⁵ FLOPs to train their 405B model on 15.6 trillion text tokens. In AI, scaling up the compute power (FLOPs), the size of the model (parameters), and the amount of training data (tokens) often leads to better results. However, scaling requires careful budgeting because compute power is expensive. Meta followed “scaling laws” to determine the optimal size of their model based on the available budget and compute resources.

Llama as a Multimodal Model

Llama is a multimodal model, meaning it can process not just text, but also images, audio, and videos. To achieve this, Meta used separate encoders for each modality:

To integrate these different inputs (image, audio) into the main language model, Meta used an adapter with cross-attention layers. This bridges the gap between the modality-specific encoders and the core language processing capabilities of the model.

AdamW was the learning algorithm used in pretraining.

2. Post training

This is where we tackle the alignment problem—at least partially. The goal is to teach the model to behave appropriately and give responses that are useful and practical. For example, in Meta’s approach, they used Supervised Fine-Tuning (SFT), which means improving the model using real human feedback. At this stage, the model is also trained to handle specific tasks, like generating code or using tools effectively. Meta’s approach of using a reward model and then combining it with SFT and channeling it through Direct Preference Optimization is detailed out below.

llama parameters

As part of post training, Task specific items are trained to improve on specific tasks such as reasoning, coding, tool use etc.. One particular thing I want to highlight is the tool use. Tools expand the portfolio of the LLM by giving them access to systems that they can leverage to perform tasks or answer questions.

Remember in Meta’s case, the knowledge cutoff was Dec 2023. But in the real world, if you ask the LLM what is today’s weather to the LLM, it should be able to answer it. This is where tool use comes into play. In the case of Llama 3, it is trained to use Brave Search to answer questions about recent events. Llama 3 is also trained to use other tools like a python interpreter or Wolfram API for math etc..

A bit about Hardware

Llama 3 was training on the following hardware. Imagine the scale and cost associated with it.

Benchmarking

After Post training, LLMs also go through a Benchmarking exercise to just report out real world performance evaluations on a bunch of tasks. Evaluations shown in the below table were done for Llama3.

llama parameters

Read the full Llama 3technical paper from Meta here. I’m thankful for the groundbreaking work by Meta’s AI research team. Its about 70 pages of content if you remove the citations. I wanted to share this because, if you haven’t had the chance to read it yet, now is the perfect time to dive in. With advancements like Llama 4 and even more powerful LLMs on the horizon, staying ahead of the curve has never been more crucial.

Hopefully this answers the question “Have you wondered how a typical LLM is built?”


Next Post
Would Model Context Protocol be the catalyst for AI Agent Revolution?