Core AI Concepts
How Large Language Models Work
Large Language Models (LLMs) are neural networks trained on massive text datasets to understand and generate human-like language. Understanding their architecture helps explain both their capabilities and limitations.
The Transformer Architecture
Modern LLMs are built on the transformer architecture, introduced in the 2017 paper βAttention Is All You Needβ by Vaswani et al. at Google. This architecture revolutionized natural language processing.
Why transformers changed everything:
- Parallelization: Unlike earlier architectures (RNNs), transformers process entire sequences simultaneously, enabling massive GPU acceleration
- Long-range dependencies: Captures relationships between distant words effectively
- Scalability: Performance improves with more data and compute, with no apparent ceiling yet
- Transfer learning: Pre-trained models can be adapted for specific tasks
Core Components
| Component | Function |
|---|---|
| Self-attention layers | Allow each position to attend to all other positions |
| Feed-forward networks | Process attention outputs |
| Positional encoding | Provides sequence order information |
| Multi-head attention | Multiple attention mechanisms working in parallel |
How Generation Works
LLMs generate text one token at a time, predicting the most likely next token based on all previous tokens. The model doesnβt βthinkβ or βunderstandβ in a human sense; it calculates probability distributions over possible continuations.
Input: "The capital of France is"
Model predicts: "Paris" (highest probability)
This autoregressive generation means the model can produce fluent text but can also confidently generate incorrect information.
Tokens and Tokenization
Tokens are the fundamental units LLMs use to process text. Understanding tokenization helps explain model behavior, costs, and limitations.
What Are Tokens?
Tokens can be words, parts of words, or individual characters, depending on the tokenization method. Most modern LLMs use subword tokenization like BPE (Byte Pair Encoding).
Examples:
"Hello world" β ["Hello", " world"] (2 tokens)
"tokenization" β ["token", "ization"] (2 tokens)
"π" β ["π"] (1 token, but may vary by model)
Why Tokenization Matters
| Aspect | Impact |
|---|---|
| Cost | API pricing is per token; understanding token count helps predict costs |
| Context limits | Context windows are measured in tokens, not words or characters |
| Performance | Rare words may tokenize into many pieces, affecting model behavior |
| Languages | Non-English text often requires more tokens for the same content |
Token Estimates
| Content | Approximate Ratio |
|---|---|
| English text | ~0.75 tokens per word |
| Code | ~1.5 tokens per line (varies by language) |
| Non-Latin scripts | 2-4x more tokens than English equivalent |
Practical Implications
- Long prompts consume more of your context window
- Unusual words or technical terms may tokenize inefficiently
- Token counts for the same meaning vary across models
Context Windows
The context window is the maximum amount of text (in tokens) a model can consider at once. Everything the model knows about your conversation must fit within this window.
How Context Windows Work
βββββββββββββββββββββββββββββββββββββββββββββββ
β Context Window β
β βββββββββββββββββββββββββββββββββββββββββ β
β β System prompt + conversation history β β
β β + current input + space for output β β
β βββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
Everything must fit: system instructions, previous messages, any context you provide, your current input, AND space for the modelβs response.
Context Window Sizes (2025)
| Model | Context Window |
|---|---|
| GPT-4 Turbo | 128K tokens |
| Claude 3 | 200K tokens |
| Gemini 1.5 Pro | 1M+ tokens |
| Llama 3 | 8K-128K tokens |
Larger isnβt always better; cost increases with context size, and some models perform worse on very long contexts.
Managing Context
| Strategy | When to Use |
|---|---|
| Summarization | Compress old conversation history |
| Retrieval (RAG) | Pull in only relevant context dynamically |
| Truncation | Remove oldest messages when limit approached |
| Chunking | Process long documents in pieces |
The βLost in the Middleβ Problem
Research shows models pay more attention to the beginning and end of context windows. Information in the middle may be partially ignored. Place critical information at the start or end of your prompts.
Key Parameters
Understanding model parameters helps you tune outputs for specific use cases.
Temperature
Controls randomness in output generation. Lower values make output more deterministic; higher values make it more creative and varied.
| Temperature | Effect | Use Case |
|---|---|---|
| 0 | Deterministic (same input β same output) | Factual tasks, code generation |
| 0.3-0.5 | Mostly consistent with slight variation | General tasks |
| 0.7-1.0 | Creative, varied outputs | Creative writing, brainstorming |
| >1.0 | Highly random, may become incoherent | Experimental only |
Top-p (Nucleus Sampling)
Alternative to temperature. Only considers tokens whose cumulative probability exceeds the threshold.
- Top-p = 0.9: Consider tokens until 90% probability mass is covered
- Lower values = more focused, higher values = more diverse
Most practitioners use either temperature OR top-p, not both.
Max Tokens
Limits the response length. Useful for controlling costs and preventing runaway responses.
- Set based on expected response length
- Too low may cut off responses mid-thought
- Too high wastes potential cost on unused capacity
System Prompt
Background instructions that shape the modelβs behavior throughout the conversation. Sets persona, constraints, and behavioral guidelines.
System: You are a helpful coding assistant. Always explain your reasoning.
Use Python for examples unless asked otherwise. Be concise.
Other Parameters
| Parameter | Purpose |
|---|---|
| Frequency penalty | Reduces repetition of tokens already used |
| Presence penalty | Reduces repetition of topics already discussed |
| Stop sequences | Tokens that signal the model to stop generating |
Embeddings
Embeddings are dense vector representations that capture semantic meaning. Theyβre fundamental to many AI applications, especially retrieval and similarity search.
What Are Embeddings?
The name comes from the mathematical concept of embedding one space into another. An embedding model takes text, which lives in the messy, ambiguous space of human language, and maps it into a structured geometric space where meaning becomes measurable. The text is literally embedded into a vector space, and its position in that space encodes what it means.
The result is a fixed-size vector (array of numbers), typically 384-1536 dimensions. Similar concepts end up with similar vectors, enabling semantic comparison.
"How do I reset my password?" β [0.12, -0.45, 0.78, ...]
"I forgot my login credentials" β [0.11, -0.43, 0.76, ...]
(similar vectors!)
"The weather is nice today" β [0.89, 0.23, -0.15, ...]
(different vector)
How Similarity Works
Vector similarity (usually cosine similarity) measures how close two embeddings are:
- 1.0: Identical meaning
- 0.0: Unrelated
- -1.0: Opposite meaning (rare in practice)
Embedding Use Cases
| Use Case | How Embeddings Help |
|---|---|
| Semantic search | Find documents by meaning, not keywords |
| RAG retrieval | Match queries to relevant knowledge chunks |
| Clustering | Group similar content together |
| Classification | Categorize text by comparing to examples |
| Deduplication | Find near-duplicate content |
Embedding Models
| Model | Dimensions | Notes |
|---|---|---|
| OpenAI text-embedding-3-small | 1536 | High quality, easy API |
| OpenAI text-embedding-3-large | 3072 | Highest quality |
| Sentence Transformers | 384-768 | Open source, runs locally |
| Cohere embed-v3 | 1024 | Strong multilingual |
Key principle: Always use the same embedding model for indexing and querying. Vectors from different models are incompatible.
Storing and Searching Embeddings
An embedding is just a computed representation. Once generated, itβs an array of numbers that can live anywhere: held in memory for a real-time comparison, written to a flat file for batch processing, or stored in a database column. A small application comparing a handful of documents might keep vectors in a simple list and compute cosine similarity directly. Thereβs no requirement to use specialized infrastructure at small scale.
The storage question becomes interesting when the number of vectors grows. A real application might generate millions of embeddings across a document corpus, product catalog, or conversation history. At that scale, brute-force comparison (checking every stored vector against the query) becomes impractical, and standard databases arenβt optimized for high-dimensional similarity search.
Vector databases solve this scaling problem. They store embeddings alongside metadata and provide fast similarity search using approximate nearest neighbor (ANN) algorithms. Instead of comparing against every stored vector, ANN algorithms build index structures that narrow the search space dramatically, trading a small amount of accuracy for orders-of-magnitude speed improvement.
| Database | Type | Good Fit |
|---|---|---|
| Pinecone | Managed cloud service | Teams that want zero infrastructure overhead |
| Weaviate | Open source, self-hosted or cloud | Flexible deployment with built-in hybrid search |
| Qdrant | Open source, self-hosted or cloud | High-performance filtering alongside vector search |
| Milvus | Open source | Massive scale (billion+ vectors) |
| pgvector | PostgreSQL extension | Teams already running Postgres who want to avoid a new database |
| ChromaDB | Open source, lightweight | Prototyping and small-scale applications |
How the search works conceptually: when a user submits a query, it gets converted to a vector using the same embedding model that indexed the documents. The vector database then finds the stored vectors closest to this query vector using cosine similarity or dot product distance, returning the most semantically relevant results.
Metadata filtering adds precision beyond pure vector similarity. Most vector databases let you store metadata (source, date, category, access level) alongside each vector and filter on it during search. A query like βfind documents similar to this question, but only from the engineering teamβs knowledge base created in the last 6 monthsβ combines semantic search with structured filtering.
For production patterns around chunking, retrieval strategies, and building full pipelines with vector databases, see the RAG guide.
Model Capabilities and Limitations
Understanding what models can and cannot do helps set appropriate expectations.
What LLMs Do Well
| Capability | Why |
|---|---|
| Language fluency | Trained on massive text corpora |
| Pattern recognition | Statistical patterns in training data |
| Following instructions | RLHF training on instruction-following |
| Code generation | Extensive code in training data |
| Summarization | Compressing information while preserving meaning |
| Translation | Multilingual training data |
What LLMs Struggle With
| Limitation | Why |
|---|---|
| Factual accuracy | Generate plausible-sounding text, not verified facts |
| Current events | Knowledge frozen at training cutoff |
| Math and logic | Predict tokens, donβt compute |
| Counting and precise tasks | Tokenization obscures character/word boundaries |
| Consistent persona | May drift across long conversations |
| Saying βI donβt knowβ | Trained to be helpful, may fabricate |
Hallucination
Hallucination occurs when models generate information that appears plausible but is false. This happens because:
- Models predict likely text, not verified facts
- Training data contains errors
- Models aim to be helpful, even when uncertain
Mitigation strategies:
- Ask for sources (models may still fabricate them)
- Use RAG to ground responses in real documents
- Verify critical information independently
- Lower temperature for factual tasks
Knowledge Cutoff
Models only know information from their training data. They have no awareness of events after their training cutoff date.
Implications:
- Canβt answer about recent events
- May have outdated information about evolving topics
- Use RAG or web search for current information
Skills vs Tools
When building on top of LLMs, every capability you expose falls into one of two categories: a skill that the model performs natively through prompting, or a tool that the model invokes to interact with an external system. The distinction matters because it drives architecture decisions, cost profiles, latency characteristics, and failure modes.
What Are Skills?
Skills are capabilities the model already has. They come from training data, and you access them entirely through prompt design. No external integration, no API calls, no infrastructure beyond the model itself.
| Skill | What the Model Does Natively |
|---|---|
| Summarization | Compress long text into key points |
| Translation | Convert between languages |
| Classification | Sort inputs into categories |
| Extraction | Pull structured data from unstructured text |
| Reasoning | Draw conclusions, compare options, analyze tradeoffs |
| Code generation | Write, explain, or refactor code |
| Creative writing | Draft prose, marketing copy, or technical documentation |
| Reformatting | Convert between formats like JSON, CSV, XML, or markdown |
Skills require no external infrastructure, but they are not free. Every skill invocation runs through inference, which means every input token and output token costs money and takes time. Summarizing a 50-page document means sending the entire document through the model. Translating a large codebase means processing every file as tokens. For large inputs, skills can be the most expensive part of a pipeline because there is no shortcut around token consumption.
The other tradeoff is that skills are bounded by training data. A model can summarize a document you provide, but it cannot look up a document it hasnβt seen. It can reason about data in context, but it cannot compute a precise financial projection across thousands of rows. It can generate code in languages it was trained on, but it cannot execute that code to verify it works.
What Are Tools?
Tools are external functions the model can call to extend beyond what it learned during training. When the model encounters a task that requires current data, precise computation, or interaction with the outside world, it generates a structured request to invoke a tool, receives the result, and incorporates that result into its response.
Common tool categories include:
- Information retrieval: web search, database queries, file system access, API calls to external services
- Computation: code execution, calculators, data analysis engines
- State modification: creating files, sending messages, updating records, deploying code
- Verification: running tests, checking URLs, validating schemas
Tools require infrastructure. Someone has to define the toolβs interface, host the execution environment, handle authentication, and manage failures. The model doesnβt βuseβ the tool directly; it generates a request (typically a function name and arguments), the orchestration layer executes it, and the result flows back into the modelβs context for the next inference step.
For a deeper look at how tools work in agent architectures, see the AI Agents guide. For the standard protocol that connects models to tools, see the MCP guide.
When to Use Each
The choice between a skill and a tool depends on what the task actually requires. Some tasks are clearly one or the other, but many sit in a gray area where either approach could work.
| Dimension | Skill (Native) | Tool (External) |
|---|---|---|
| Latency | Scales with input/output token count | Tool execution is often instant; round-trip adds overhead |
| Cost | All processing burns tokens (can be expensive for large inputs) | Tool execution itself is often free; results re-enter token stream |
| Accuracy | Probabilistic, may hallucinate | Deterministic for computation and data retrieval |
| Current data | Limited to training cutoff | Can access real-time information |
| Computation | Approximate reasoning | Precise execution |
| Side effects | None (read-only by nature) | Can modify state in external systems |
| Failure modes | Hallucination, reasoning errors | Network failures, auth errors, timeouts, malformed requests |
| Infrastructure | None beyond the model | Requires tool definitions, hosting, error handling |
Use a skill when the task is pattern recognition, language transformation, or reasoning over context thatβs already in the prompt. Summarizing a meeting transcript, classifying support tickets, extracting entities from an email, or drafting a response based on provided guidelines are all skill-native tasks.
Use a tool when the task requires something the model cannot do from memory: fetching live data, performing exact arithmetic, executing code, modifying external state, or verifying facts against an authoritative source.
Tradeoffs in Practice
The tension between skills and tools plays out in real system design decisions.
Over-relying on skills leads to hallucination risk. A model asked to βlook up the current price of AAPL stockβ will generate a plausible-looking number from training data rather than admitting it doesnβt know. Without a tool to fetch the actual price, the output looks confident but is wrong. Any task where accuracy depends on data the model hasnβt seen requires a tool.
Over-relying on tools leads to unnecessary complexity. If a model has a web search tool available and a user asks βwhat is a binary search tree?β, the model might invoke the search tool to answer a question it already knows well from training. The tool result then enters the context window, consuming additional tokens on the next inference call and introducing a failure point that didnβt need to exist. When the model can handle a task accurately from training data, a tool call adds infrastructure burden without improving quality.
The gray area is where it gets interesting. Consider math: a model can reason through simple arithmetic and get it right most of the time, but it will occasionally make errors on multi-step calculations. A code execution tool will always get the math right and runs instantly, but requires infrastructure to define and host. The right choice depends on how much accuracy matters for the use case. A rough estimate in a brainstorming session favors the skill; a financial calculation in a production system demands the tool.
Decision Framework
When deciding whether a capability should be a skill or a tool, work through these questions:
Does the task require information the model hasnβt seen? If the answer depends on data after the training cutoff, data in a private database, or real-time state, you need a tool. No amount of prompt engineering gives a model access to information that isnβt in its context window.
Does the task require deterministic precision? Mathematical calculations, date arithmetic, regex matching, and data aggregation across large datasets all benefit from tools. Models approximate these operations through pattern matching and will occasionally produce wrong results, especially as complexity increases.
Does the task require action in the outside world? Sending emails, creating files, updating databases, and deploying code are all side effects that require tools. Skills are inherently read-only: they transform input into output but cannot change state beyond the conversation.
Is the model already good at this from training? Summarization, classification, translation, code generation, and text analysis are tasks where models are strong out of the box. Adding a tool for these capabilities typically adds cost and latency without improving quality. Invest in better prompts before reaching for a tool.
How much does an error cost? For low-stakes tasks like drafting an email or generating test ideas, skill-level accuracy is usually sufficient. For high-stakes tasks like calculating dosages, generating legal documents, or making financial decisions, tool-backed verification is worth the added complexity.
Model Types and Sizes
Different models serve different purposes. Understanding the landscape helps with selection.
Model Size Impacts
| Size | Typical Params | Characteristics |
|---|---|---|
| Small | 1-7B | Fast, cheap, basic tasks |
| Medium | 7-30B | Good balance, most tasks |
| Large | 30-70B | Complex reasoning, nuanced tasks |
| Frontier | 100B+ | State-of-the-art capabilities |
Larger models generally perform better but cost more and run slower.
Model Types
| Type | Examples | Best For |
|---|---|---|
| General purpose | GPT-4, Claude, Gemini | Wide range of tasks |
| Code-focused | Codex, StarCoder, DeepSeek Coder | Programming tasks |
| Instruction-tuned | ChatGPT, Claude | Following directions |
| Base models | Llama base | Fine-tuning starting point |
Open vs. Closed Models
| Aspect | Open Models | Closed Models |
|---|---|---|
| Access | Download and run anywhere | API access only |
| Cost | Infrastructure costs | Per-token pricing |
| Privacy | Data stays local | Data sent to provider |
| Customization | Fine-tuning possible | Limited or none |
| Examples | Llama, Mistral | GPT-4, Claude |
Quick Reference
Key Metrics to Know
| Metric | What It Means |
|---|---|
| Tokens | Processing units; ~0.75 per English word |
| Context window | Max input + output size |
| Temperature | Randomness control (0 = deterministic) |
| Embedding dimensions | Vector size for semantic representation |
Common Token Estimates
| Content Type | Tokens |
|---|---|
| 1 page of text | ~500-800 tokens |
| 1 paragraph | ~100-150 tokens |
| Average email | ~200-400 tokens |
| Code file (100 lines) | ~150-300 tokens |
Parameter Cheat Sheet
| Task | Temperature | Other Settings |
|---|---|---|
| Code generation | 0-0.3 | Clear, consistent output |
| Factual Q&A | 0-0.3 | Accuracy matters |
| Creative writing | 0.7-1.0 | Variety desired |
| Brainstorming | 0.8-1.0 | Many ideas wanted |
| General chat | 0.5-0.7 | Balance |
Found this guide helpful? Share it with your team:
Share on LinkedIn