How your very average laptop can run large language models

One of the biggest blockers I have had is trying to get my head around how you can run some sort of large language model which is good enough for useful tasks, on a regular laptop. Before I looked into the details of how the magic worked, this just didn’t seem possible!

But there are some very smart people out there, and your starting point is the understanding of Ollama, downloaded from https://huggingface.co/ .. Once you have this installed and running on your laptop, you can start implementing some of the models and seeing how far you can take it based on your machine spec..

What Happens When You Run `ollama run llama3`

1. Tokenization

Your text is broken into tokens (subword units).
Example: ["Why", " is", " AI", " visibility", " important", "?"]
Each token is mapped to an integer ID.

2. Embedding Lookup

Each token ID is converted into an embedding vector (e.g. 4096 numbers).
This places your words into the model’s high-dimensional semantic space.

3. Transformer Layers (stacked N times)

Each layer does:

Self-Attention – tokens look at each other, producing an attention matrix.
Weighted Sum of Values – information is mixed according to attention scores.
Feed-Forward Transformation – nonlinear layers refine the representation.

-> After N layers, your prompt becomes a dense contextual representation of “the meaning so far.”

4. Output Layer (Logits)

The final hidden state is passed through a large matrix (hidden_dim × vocab_size).
Produces a probability distribution over the vocabulary.
Example:
"critical" = 0.27
"essential" = 0.18
"important" = 0.12

5. Sampling the Next Token

The model chooses the next token from the distribution:
Greedy = highest probability
Top-k / Top-p = add randomness
Example choice: "critical".

6. KV Caching (Efficiency Trick)

Keys and Values from attention are cached.
Next token only processes relative to the cache.
Prevents re-computing the entire sequence every time.

7. Loop Until Done

Append chosen token to the sequence.
Repeat Steps 3–6 until a stop token (e.g., <eos>) or max length is reached.
Text streams out token by token.

BJREES