Core AI Concepts

πŸ“– 15 min read

How Large Language Models Work

Large Language Models (LLMs) are neural networks trained on massive text datasets to understand and generate human-like language. Understanding their architecture helps explain both their capabilities and limitations.

The Transformer Architecture

Modern LLMs are built on the transformer architecture, introduced in the 2017 paper β€œAttention Is All You Need” by Vaswani et al. at Google. This architecture revolutionized natural language processing.

Why transformers changed everything:

  • Parallelization: Unlike earlier architectures (RNNs), transformers process entire sequences simultaneously, enabling massive GPU acceleration
  • Long-range dependencies: Captures relationships between distant words effectively
  • Scalability: Performance improves with more data and compute, with no apparent ceiling yet
  • Transfer learning: Pre-trained models can be adapted for specific tasks

Core Components

Component Function
Self-attention layers Allow each position to attend to all other positions
Feed-forward networks Process attention outputs
Positional encoding Provides sequence order information
Multi-head attention Multiple attention mechanisms working in parallel

How Generation Works

LLMs generate text one token at a time, predicting the most likely next token based on all previous tokens. The model doesn’t β€œthink” or β€œunderstand” in a human sense; it calculates probability distributions over possible continuations.

Input: "The capital of France is"
Model predicts: "Paris" (highest probability)

This autoregressive generation means the model can produce fluent text but can also confidently generate incorrect information.


Tokens and Tokenization

Tokens are the fundamental units LLMs use to process text. Understanding tokenization helps explain model behavior, costs, and limitations.

What Are Tokens?

Tokens can be words, parts of words, or individual characters, depending on the tokenization method. Most modern LLMs use subword tokenization like BPE (Byte Pair Encoding).

Examples:

"Hello world" β†’ ["Hello", " world"] (2 tokens)
"tokenization" β†’ ["token", "ization"] (2 tokens)
"πŸŽ‰" β†’ ["πŸŽ‰"] (1 token, but may vary by model)

Why Tokenization Matters

Aspect Impact
Cost API pricing is per token; understanding token count helps predict costs
Context limits Context windows are measured in tokens, not words or characters
Performance Rare words may tokenize into many pieces, affecting model behavior
Languages Non-English text often requires more tokens for the same content

Token Estimates

Content Approximate Ratio
English text ~0.75 tokens per word
Code ~1.5 tokens per line (varies by language)
Non-Latin scripts 2-4x more tokens than English equivalent

Practical Implications

  • Long prompts consume more of your context window
  • Unusual words or technical terms may tokenize inefficiently
  • Token counts for the same meaning vary across models

Context Windows

The context window is the maximum amount of text (in tokens) a model can consider at once. Everything the model knows about your conversation must fit within this window.

How Context Windows Work

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Context Window                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ System prompt + conversation history   β”‚  β”‚
β”‚  β”‚ + current input + space for output     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Everything must fit: system instructions, previous messages, any context you provide, your current input, AND space for the model’s response.

Context Window Sizes (2025)

Model Context Window
GPT-4 Turbo 128K tokens
Claude 3 200K tokens
Gemini 1.5 Pro 1M+ tokens
Llama 3 8K-128K tokens

Larger isn’t always better; cost increases with context size, and some models perform worse on very long contexts.

Managing Context

Strategy When to Use
Summarization Compress old conversation history
Retrieval (RAG) Pull in only relevant context dynamically
Truncation Remove oldest messages when limit approached
Chunking Process long documents in pieces

The β€œLost in the Middle” Problem

Research shows models pay more attention to the beginning and end of context windows. Information in the middle may be partially ignored. Place critical information at the start or end of your prompts.


Key Parameters

Understanding model parameters helps you tune outputs for specific use cases.

Temperature

Controls randomness in output generation. Lower values make output more deterministic; higher values make it more creative and varied.

Temperature Effect Use Case
0 Deterministic (same input β†’ same output) Factual tasks, code generation
0.3-0.5 Mostly consistent with slight variation General tasks
0.7-1.0 Creative, varied outputs Creative writing, brainstorming
>1.0 Highly random, may become incoherent Experimental only

Top-p (Nucleus Sampling)

Alternative to temperature. Only considers tokens whose cumulative probability exceeds the threshold.

  • Top-p = 0.9: Consider tokens until 90% probability mass is covered
  • Lower values = more focused, higher values = more diverse

Most practitioners use either temperature OR top-p, not both.

Max Tokens

Limits the response length. Useful for controlling costs and preventing runaway responses.

  • Set based on expected response length
  • Too low may cut off responses mid-thought
  • Too high wastes potential cost on unused capacity

System Prompt

Background instructions that shape the model’s behavior throughout the conversation. Sets persona, constraints, and behavioral guidelines.

System: You are a helpful coding assistant. Always explain your reasoning.
        Use Python for examples unless asked otherwise. Be concise.

Other Parameters

Parameter Purpose
Frequency penalty Reduces repetition of tokens already used
Presence penalty Reduces repetition of topics already discussed
Stop sequences Tokens that signal the model to stop generating

Embeddings

Embeddings are dense vector representations that capture semantic meaning. They’re fundamental to many AI applications, especially retrieval and similarity search.

What Are Embeddings?

The name comes from the mathematical concept of embedding one space into another. An embedding model takes text, which lives in the messy, ambiguous space of human language, and maps it into a structured geometric space where meaning becomes measurable. The text is literally embedded into a vector space, and its position in that space encodes what it means.

The result is a fixed-size vector (array of numbers), typically 384-1536 dimensions. Similar concepts end up with similar vectors, enabling semantic comparison.

"How do I reset my password?"  β†’  [0.12, -0.45, 0.78, ...]
"I forgot my login credentials" β†’  [0.11, -0.43, 0.76, ...]
                                    (similar vectors!)

"The weather is nice today"     β†’  [0.89, 0.23, -0.15, ...]
                                    (different vector)

How Similarity Works

Vector similarity (usually cosine similarity) measures how close two embeddings are:

  • 1.0: Identical meaning
  • 0.0: Unrelated
  • -1.0: Opposite meaning (rare in practice)

Embedding Use Cases

Use Case How Embeddings Help
Semantic search Find documents by meaning, not keywords
RAG retrieval Match queries to relevant knowledge chunks
Clustering Group similar content together
Classification Categorize text by comparing to examples
Deduplication Find near-duplicate content

Embedding Models

Model Dimensions Notes
OpenAI text-embedding-3-small 1536 High quality, easy API
OpenAI text-embedding-3-large 3072 Highest quality
Sentence Transformers 384-768 Open source, runs locally
Cohere embed-v3 1024 Strong multilingual

Key principle: Always use the same embedding model for indexing and querying. Vectors from different models are incompatible.

Storing and Searching Embeddings

An embedding is just a computed representation. Once generated, it’s an array of numbers that can live anywhere: held in memory for a real-time comparison, written to a flat file for batch processing, or stored in a database column. A small application comparing a handful of documents might keep vectors in a simple list and compute cosine similarity directly. There’s no requirement to use specialized infrastructure at small scale.

The storage question becomes interesting when the number of vectors grows. A real application might generate millions of embeddings across a document corpus, product catalog, or conversation history. At that scale, brute-force comparison (checking every stored vector against the query) becomes impractical, and standard databases aren’t optimized for high-dimensional similarity search.

Vector databases solve this scaling problem. They store embeddings alongside metadata and provide fast similarity search using approximate nearest neighbor (ANN) algorithms. Instead of comparing against every stored vector, ANN algorithms build index structures that narrow the search space dramatically, trading a small amount of accuracy for orders-of-magnitude speed improvement.

Database Type Good Fit
Pinecone Managed cloud service Teams that want zero infrastructure overhead
Weaviate Open source, self-hosted or cloud Flexible deployment with built-in hybrid search
Qdrant Open source, self-hosted or cloud High-performance filtering alongside vector search
Milvus Open source Massive scale (billion+ vectors)
pgvector PostgreSQL extension Teams already running Postgres who want to avoid a new database
ChromaDB Open source, lightweight Prototyping and small-scale applications

How the search works conceptually: when a user submits a query, it gets converted to a vector using the same embedding model that indexed the documents. The vector database then finds the stored vectors closest to this query vector using cosine similarity or dot product distance, returning the most semantically relevant results.

Metadata filtering adds precision beyond pure vector similarity. Most vector databases let you store metadata (source, date, category, access level) alongside each vector and filter on it during search. A query like β€œfind documents similar to this question, but only from the engineering team’s knowledge base created in the last 6 months” combines semantic search with structured filtering.

For production patterns around chunking, retrieval strategies, and building full pipelines with vector databases, see the RAG guide.


Model Capabilities and Limitations

Understanding what models can and cannot do helps set appropriate expectations.

What LLMs Do Well

Capability Why
Language fluency Trained on massive text corpora
Pattern recognition Statistical patterns in training data
Following instructions RLHF training on instruction-following
Code generation Extensive code in training data
Summarization Compressing information while preserving meaning
Translation Multilingual training data

What LLMs Struggle With

Limitation Why
Factual accuracy Generate plausible-sounding text, not verified facts
Current events Knowledge frozen at training cutoff
Math and logic Predict tokens, don’t compute
Counting and precise tasks Tokenization obscures character/word boundaries
Consistent persona May drift across long conversations
Saying β€œI don’t know” Trained to be helpful, may fabricate

Hallucination

Hallucination occurs when models generate information that appears plausible but is false. This happens because:

  • Models predict likely text, not verified facts
  • Training data contains errors
  • Models aim to be helpful, even when uncertain

Mitigation strategies:

  • Ask for sources (models may still fabricate them)
  • Use RAG to ground responses in real documents
  • Verify critical information independently
  • Lower temperature for factual tasks

Knowledge Cutoff

Models only know information from their training data. They have no awareness of events after their training cutoff date.

Implications:

  • Can’t answer about recent events
  • May have outdated information about evolving topics
  • Use RAG or web search for current information

Skills vs Tools

When building on top of LLMs, every capability you expose falls into one of two categories: a skill that the model performs natively through prompting, or a tool that the model invokes to interact with an external system. The distinction matters because it drives architecture decisions, cost profiles, latency characteristics, and failure modes.

What Are Skills?

Skills are capabilities the model already has. They come from training data, and you access them entirely through prompt design. No external integration, no API calls, no infrastructure beyond the model itself.

Skill What the Model Does Natively
Summarization Compress long text into key points
Translation Convert between languages
Classification Sort inputs into categories
Extraction Pull structured data from unstructured text
Reasoning Draw conclusions, compare options, analyze tradeoffs
Code generation Write, explain, or refactor code
Creative writing Draft prose, marketing copy, or technical documentation
Reformatting Convert between formats like JSON, CSV, XML, or markdown

Skills require no external infrastructure, but they are not free. Every skill invocation runs through inference, which means every input token and output token costs money and takes time. Summarizing a 50-page document means sending the entire document through the model. Translating a large codebase means processing every file as tokens. For large inputs, skills can be the most expensive part of a pipeline because there is no shortcut around token consumption.

The other tradeoff is that skills are bounded by training data. A model can summarize a document you provide, but it cannot look up a document it hasn’t seen. It can reason about data in context, but it cannot compute a precise financial projection across thousands of rows. It can generate code in languages it was trained on, but it cannot execute that code to verify it works.

What Are Tools?

Tools are external functions the model can call to extend beyond what it learned during training. When the model encounters a task that requires current data, precise computation, or interaction with the outside world, it generates a structured request to invoke a tool, receives the result, and incorporates that result into its response.

Common tool categories include:

  • Information retrieval: web search, database queries, file system access, API calls to external services
  • Computation: code execution, calculators, data analysis engines
  • State modification: creating files, sending messages, updating records, deploying code
  • Verification: running tests, checking URLs, validating schemas

Tools require infrastructure. Someone has to define the tool’s interface, host the execution environment, handle authentication, and manage failures. The model doesn’t β€œuse” the tool directly; it generates a request (typically a function name and arguments), the orchestration layer executes it, and the result flows back into the model’s context for the next inference step.

For a deeper look at how tools work in agent architectures, see the AI Agents guide. For the standard protocol that connects models to tools, see the MCP guide.

When to Use Each

The choice between a skill and a tool depends on what the task actually requires. Some tasks are clearly one or the other, but many sit in a gray area where either approach could work.

Dimension Skill (Native) Tool (External)
Latency Scales with input/output token count Tool execution is often instant; round-trip adds overhead
Cost All processing burns tokens (can be expensive for large inputs) Tool execution itself is often free; results re-enter token stream
Accuracy Probabilistic, may hallucinate Deterministic for computation and data retrieval
Current data Limited to training cutoff Can access real-time information
Computation Approximate reasoning Precise execution
Side effects None (read-only by nature) Can modify state in external systems
Failure modes Hallucination, reasoning errors Network failures, auth errors, timeouts, malformed requests
Infrastructure None beyond the model Requires tool definitions, hosting, error handling

Use a skill when the task is pattern recognition, language transformation, or reasoning over context that’s already in the prompt. Summarizing a meeting transcript, classifying support tickets, extracting entities from an email, or drafting a response based on provided guidelines are all skill-native tasks.

Use a tool when the task requires something the model cannot do from memory: fetching live data, performing exact arithmetic, executing code, modifying external state, or verifying facts against an authoritative source.

Tradeoffs in Practice

The tension between skills and tools plays out in real system design decisions.

Over-relying on skills leads to hallucination risk. A model asked to β€œlook up the current price of AAPL stock” will generate a plausible-looking number from training data rather than admitting it doesn’t know. Without a tool to fetch the actual price, the output looks confident but is wrong. Any task where accuracy depends on data the model hasn’t seen requires a tool.

Over-relying on tools leads to unnecessary complexity. If a model has a web search tool available and a user asks β€œwhat is a binary search tree?”, the model might invoke the search tool to answer a question it already knows well from training. The tool result then enters the context window, consuming additional tokens on the next inference call and introducing a failure point that didn’t need to exist. When the model can handle a task accurately from training data, a tool call adds infrastructure burden without improving quality.

The gray area is where it gets interesting. Consider math: a model can reason through simple arithmetic and get it right most of the time, but it will occasionally make errors on multi-step calculations. A code execution tool will always get the math right and runs instantly, but requires infrastructure to define and host. The right choice depends on how much accuracy matters for the use case. A rough estimate in a brainstorming session favors the skill; a financial calculation in a production system demands the tool.

Decision Framework

When deciding whether a capability should be a skill or a tool, work through these questions:

Does the task require information the model hasn’t seen? If the answer depends on data after the training cutoff, data in a private database, or real-time state, you need a tool. No amount of prompt engineering gives a model access to information that isn’t in its context window.

Does the task require deterministic precision? Mathematical calculations, date arithmetic, regex matching, and data aggregation across large datasets all benefit from tools. Models approximate these operations through pattern matching and will occasionally produce wrong results, especially as complexity increases.

Does the task require action in the outside world? Sending emails, creating files, updating databases, and deploying code are all side effects that require tools. Skills are inherently read-only: they transform input into output but cannot change state beyond the conversation.

Is the model already good at this from training? Summarization, classification, translation, code generation, and text analysis are tasks where models are strong out of the box. Adding a tool for these capabilities typically adds cost and latency without improving quality. Invest in better prompts before reaching for a tool.

How much does an error cost? For low-stakes tasks like drafting an email or generating test ideas, skill-level accuracy is usually sufficient. For high-stakes tasks like calculating dosages, generating legal documents, or making financial decisions, tool-backed verification is worth the added complexity.


Model Types and Sizes

Different models serve different purposes. Understanding the landscape helps with selection.

Model Size Impacts

Size Typical Params Characteristics
Small 1-7B Fast, cheap, basic tasks
Medium 7-30B Good balance, most tasks
Large 30-70B Complex reasoning, nuanced tasks
Frontier 100B+ State-of-the-art capabilities

Larger models generally perform better but cost more and run slower.

Model Types

Type Examples Best For
General purpose GPT-4, Claude, Gemini Wide range of tasks
Code-focused Codex, StarCoder, DeepSeek Coder Programming tasks
Instruction-tuned ChatGPT, Claude Following directions
Base models Llama base Fine-tuning starting point

Open vs. Closed Models

Aspect Open Models Closed Models
Access Download and run anywhere API access only
Cost Infrastructure costs Per-token pricing
Privacy Data stays local Data sent to provider
Customization Fine-tuning possible Limited or none
Examples Llama, Mistral GPT-4, Claude

Quick Reference

Key Metrics to Know

Metric What It Means
Tokens Processing units; ~0.75 per English word
Context window Max input + output size
Temperature Randomness control (0 = deterministic)
Embedding dimensions Vector size for semantic representation

Common Token Estimates

Content Type Tokens
1 page of text ~500-800 tokens
1 paragraph ~100-150 tokens
Average email ~200-400 tokens
Code file (100 lines) ~150-300 tokens

Parameter Cheat Sheet

Task Temperature Other Settings
Code generation 0-0.3 Clear, consistent output
Factual Q&A 0-0.3 Accuracy matters
Creative writing 0.7-1.0 Variety desired
Brainstorming 0.8-1.0 Many ideas wanted
General chat 0.5-0.7 Balance

Found this guide helpful? Share it with your team:

Share on LinkedIn