The 5 AI Terms That Actually Matter — A Deep Dive

 

Most AI glossaries give you a one-liner and move on. That's not what this is. If you're building with AI, investing in it, or just trying to cut through the noise, you need to understand how these concepts work under the hood — because that's where the real leverage is.


1. Tokens: The Atomic Unit of AI Thought

When you type "I'm building a fintech app," the AI doesn't read that sentence the way you do. It breaks it down into tokens — chunks that are roughly words or pieces of words. "fintech" might become two tokens: fin and tech. A common punctuation mark, like a comma, is its own token. The word "the" is one token. A long or unusual word might get split into three or four.

Why does this matter?

Cost. Every API call to models like GPT-4 or Claude is priced per token, both input and output. If you're building a product that processes thousands of user requests per day, the difference between a 500-token prompt and a 2,000-token prompt is a 4x multiplier on your inference bill. Prompt engineering is, at its core, token economics.

Speed. Models generate tokens sequentially. A 100-token response streams back noticeably faster than a 1,000-token response. This directly affects user experience — especially in real-time applications like chat interfaces or trading bots.

Language bias. Tokenizers are trained predominantly on English text. This means languages like Mandarin, Swahili, or Hindi often require more tokens to express the same idea. A sentence that costs 10 tokens in English might cost 25 in another language. If you're building for multilingual markets, this is a hidden cost multiplier most people overlook.

The practical takeaway: tokens are the currency of AI. Every decision — from prompt design to model selection to product pricing — flows downstream from how efficiently you use them.


2. Context Window: The AI's Working Memory

The context window is the total number of tokens a model can "see" at once — your input, any system instructions, conversation history, and the model's own output, all combined. Think of it as RAM, not a hard drive. It's volatile, finite, and when it's full, something has to go.

Early GPT-3 had a context window of about 4,096 tokens — roughly 3,000 words. Claude's latest models offer 200,000 tokens. Google's Gemini models push even further. The trend line is clear: context windows are expanding fast.

But bigger isn't automatically better. Here's why:

The "lost in the middle" problem. Research has shown that models pay the most attention to information at the beginning and end of the context window. Content buried in the middle gets degraded recall. If you dump a 150-page document into a 200K context window and ask a question about page 73, the model may miss it entirely — not because it can't see it, but because attention is unevenly distributed.

Cost scales linearly (or worse). A 200K-token input costs roughly 50x more than a 4K-token input. For production systems handling thousands of requests, this matters enormously.

Latency increases with context size. More tokens means more computation, which means slower time-to-first-token. In real-time applications, this is a direct UX penalty.

The smart approach isn't to maximise context usage — it's to be surgical about what goes in. This is where techniques like summarisation, chunking, and retrieval (more on that in #5) become essential. The context window is expensive real estate. Treat it like you're paying rent per token, because you are.


3. Temperature: The Creativity Dial

Temperature is a parameter (typically between 0.0 and 2.0, though usable range is usually 0.0–1.0) that controls the randomness of the model's output. It operates on the probability distribution the model generates for each next token.

At temperature 0, the model always picks the highest-probability token. The output is deterministic — run the same prompt twice, get (nearly) the same result. This is what you want for code generation, data extraction, structured outputs, and anything where consistency matters.

At temperature 1.0, the model samples across the full probability distribution. Lower-probability tokens get a real chance of being selected. The output becomes more varied, surprising, and creative — but also less reliable.

What's actually happening mathematically: the model produces a set of logits (raw scores) for every possible next token. Temperature divides those logits before applying the softmax function. Low temperature sharpens the distribution — the top candidate dominates. High temperature flattens it — more candidates compete.

Practical guidance:

  • Factual Q&A, code, structured data: Temperature 0–0.3. You want precision.
  • Marketing copy, brainstorming, creative writing: Temperature 0.7–1.0. You want variety.
  • Above 1.0: Output starts to degrade. Sentences lose coherence. Rarely useful in production.

A common mistake is leaving temperature at the default (usually 1.0) for every use case. If your chatbot occasionally gives unhinged responses, the temperature setting is the first thing to check.

Related parameters worth knowing: top_p (nucleus sampling) limits the pool of tokens the model can pick from, and top_k limits it to the k most likely tokens. These work alongside temperature to give you fine-grained control over output behavior.


4. Hallucination: The Fundamental Reliability Problem

Hallucination is when an AI model generates information that is incorrect, fabricated, or entirely made up — while presenting it with full confidence. It doesn't hedge. It doesn't say "I'm not sure." It states fiction as fact.

This isn't a bug that can be patched. It's a structural feature of how these models work. A language model is fundamentally a next-token predictor. It doesn't "know" anything — it generates statistically plausible sequences of tokens based on patterns in its training data. When the model encounters a question it doesn't have a strong training signal for, it doesn't say "I don't know." It generates the most plausible-sounding completion, which may be completely wrong.

Common hallucination patterns:

  • Fabricated citations. Ask a model for academic sources, and it will generate papers that don't exist, complete with plausible-sounding author names, journal titles, and DOIs.
  • Confident numerical errors. The model will produce statistics, dates, and figures that are close to right but not actually correct.
  • Entity confusion. Mixing up attributes between similar entities — attributing one company's revenue to another, or merging biographical facts of two different people.
  • Plausible but fictional events. Generating news stories, court cases, or historical events that never happened.

Why this matters for builders:

If you're shipping a product where users trust the AI's output — financial advice, medical information, legal research, educational content — hallucination is an existential risk. A single fabricated claim that a user acts on can destroy trust, trigger liability, or cause real harm.

Mitigation strategies:

  • Grounding with external data (see RAG below) is the strongest defense.
  • Structured output formats (JSON, specific templates) reduce free-form generation where hallucination thrives.
  • Multi-step verification — using a second model call to fact-check the first — adds cost but catches errors.
  • Confidence calibration — training or prompting models to express uncertainty — is an active research area but isn't fully solved.

The honest take: hallucination is the single biggest barrier to AI deployment in high-stakes domains. Every serious AI product needs a hallucination mitigation strategy, not as an afterthought, but as a core architectural decision.


5. RAG (Retrieval-Augmented Generation): Grounding AI in Reality

RAG is the most practically important architectural pattern in applied AI right now. The idea is simple: instead of relying solely on what the model learned during training, you retrieve relevant information from external sources at query time and inject it into the prompt. The model then generates its response grounded in that retrieved context.

The basic RAG pipeline:

  1. User asks a question.
  2. The system converts the question into a vector embedding — a numerical representation of its meaning.
  3. That embedding is compared against a vector database containing pre-processed chunks of your documents, knowledge base, or data.
  4. The most semantically similar chunks are retrieved (typically 3–10 chunks).
  5. Those chunks are inserted into the model's prompt alongside the user's question.
  6. The model generates a response grounded in the retrieved context.

Why RAG dominates over alternatives:

  • Fine-tuning bakes knowledge into model weights. It's expensive, slow to update, and the model can still hallucinate about the fine-tuned content. RAG keeps knowledge external and updatable in real time.
  • Long context windows let you stuff more data in, but they're expensive per query and suffer from the attention degradation problem mentioned earlier. RAG is surgical — it retrieves only what's relevant.
  • RAG provides citations. Because you know which chunks were retrieved, you can show users exactly where the answer came from. This is critical for trust and verifiability.

Where RAG gets hard:

  • Chunking strategy. How you split your documents matters enormously. Too small and you lose context. Too large, and you waste tokens on irrelevant content. Overlapping chunks, semantic chunking, and hierarchical approaches all have tradeoffs.
  • Retrieval quality. If the retriever fetches the wrong chunks, the model generates a confident answer grounded in irrelevant information. Garbage in, garbage out — but with better grammar.
  • Embedding model selection. The quality of your vector embeddings determines retrieval accuracy. Different embedding models perform differently across domains.
  • Hybrid search. Pure vector similarity search misses exact keyword matches. Pure keyword search misses semantic similarity. Production RAG systems increasingly combine both.

The evolution beyond basic RAG:

The field is moving fast. Advanced patterns include: re-ranking retrieved results with a cross-encoder before feeding them to the model, agentic RAG, where the model decides what to retrieve and when, graph RAG that traverses knowledge graphs instead of flat document stores, and corrective RAG that detects when retrieval fails and falls back to the model's parametric knowledge.

If you're building any AI product that needs to work with proprietary data, domain-specific knowledge, or information that changes over time, RAG isn't optional. It's the foundation.


The Bigger Picture

These five concepts aren't isolated — they form an interconnected system. Your token budget determines what fits in your context window. Your context window constraints drive the need for RAG. RAG is your primary defense against hallucination. And temperature controls how deterministically the model uses everything you've fed it.

Understanding these mechanics doesn't just make you more literate about AI. It makes you dangerous — in the best sense. You can evaluate tools, architect systems, spot bullshit vendor claims, and make build-vs-buy decisions from a position of actual understanding rather than hype.

That's worth more than being ahead of 90% of people. That's being in the 1% who can actually ship.

Madra David
Madra David
8 minute read
TL;DR

Share this article

Previous Post

The Conway Leak: Anthropic's Play for the Persistent Agent Layer

I send out tips on how to improve your website's performance

I send out tips on how to improve your website's performance

user-image

Published avril 09 2026

by Madra David