LLM Fine-Tuning
What Is Fine-Tuning?
Fine-tuning is the process of further training a pre-trained language model on a specific dataset to adapt it for particular tasks, domains, or behaviors. The model learns from examples in your data to modify its weights, changing how it generates outputs.
The key distinction: Prompting tells the model what to do. Fine-tuning changes what the model is.
When Fine-Tuning Makes Sense
Fine-tuning is appropriate when you need to change the modelβs behavior, style, or knowledge in ways that canβt be achieved through prompting alone.
| Use Case | Why Fine-Tuning Helps |
|---|---|
| Consistent output format | Learn to always produce specific structure |
| Domain terminology | Internalize specialized vocabulary |
| Style/tone | Match specific writing style |
| Behavior patterns | Follow complex multi-step procedures |
| Efficiency | Replace long prompts with learned behavior |
When Not to Fine-Tune
| Situation | Better Alternative |
|---|---|
| Need current information | RAG |
| One-off customization | Prompting |
| Simple format changes | Structured output prompts |
| Access to specific data | RAG |
| Cost is primary concern | Start with prompting |
Decision Framework
Prompting vs. RAG vs. Fine-Tuning
βββββββββββββββββββ
β What do you needβ
β to change? β
ββββββββββ¬βββββββββ
β
βββββββββββββββββββββΌββββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Knowledge β β Behavior β β Both β
β (what it β β (how it β β β
β knows) β β acts) β β β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β RAG β β Fine-tuning β β RAG + β
β β β or β β Fine-tuning β
β β β Prompting β β β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Detailed Comparison
| Aspect | Prompting | RAG | Fine-Tuning |
|---|---|---|---|
| What changes | Instructions | Available knowledge | Model weights |
| Setup effort | Minimal | Moderate | Significant |
| Update frequency | Instant | Easy | Requires retraining |
| Cost | Per-token | Infrastructure + per-token | Training + per-token |
| Knowledge cutoff | Training date | Real-time | Training date |
| Behavior change | Limited | Limited | Significant |
| Consistency | Variable | Variable | High |
The Progression
Start simple, escalate only when needed:
- Prompting: Try clear instructions with examples first
- Advanced prompting: Chain-of-thought, persona, few-shot
- RAG: If knowledge access is the issue
- Fine-tuning: When behavior consistently doesnβt match needs
Fine-Tuning Techniques
Full Fine-Tuning
Update all model parameters with your training data.
Pros: Maximum flexibility, can make significant changes Cons: Expensive, requires significant compute, risk of catastrophic forgetting
Resource requirements: Full model size Γ optimizer states Γ gradients
- 7B model: ~56GB+ GPU memory
- 70B model: Requires multi-GPU clusters
Parameter-Efficient Fine-Tuning (PEFT)
Update only a small subset of parameters, keeping most frozen.
LoRA (Low-Rank Adaptation)
Injects small trainable matrices into model layers while keeping original weights frozen.
How it works:
- Original weights W remain frozen
- Add low-rank decomposition: Wβ = W + BA
- Only train B and A matrices (much smaller)
Benefits:
- 10-100x fewer trainable parameters
- Can fit on consumer GPUs
- Fast training
- Easy to swap adapters
Typical LoRA config:
rank (r): 8-64 (smaller = fewer params, larger = more capacity)
alpha: Usually 2x rank
target modules: Query, Key, Value projections
QLoRA (Quantized LoRA)
Combines LoRA with 4-bit quantization for even lower memory.
How it works:
- Load base model in 4-bit precision
- Add LoRA adapters in higher precision
- Train only the adapters
Benefits:
- Fine-tune 65B models on single 48GB GPU
- Minimal quality loss from quantization
- Dramatically lower hardware requirements
Trade-off: Slightly slower inference due to dequantization
Comparison
| Technique | Memory | Quality | Speed | Flexibility |
|---|---|---|---|---|
| Full fine-tuning | Very high | Highest | Slow | Maximum |
| LoRA | Low | High | Fast | High |
| QLoRA | Very low | High | Medium | High |
Data Preparation
The quality of fine-tuning depends heavily on data quality.
Dataset Requirements
| Factor | Guidance |
|---|---|
| Size | Hundreds to thousands of examples (more for complex tasks) |
| Quality | Clean, accurate, representative |
| Diversity | Cover the range of expected inputs |
| Format | Consistent structure matching intended use |
Data Format
Most fine-tuning uses instruction-following format:
{
"messages": [
{"role": "system", "content": "You are a helpful customer service agent."},
{"role": "user", "content": "I need to return my order"},
{"role": "assistant", "content": "I'd be happy to help with your return..."}
]
}
Or simpler prompt-completion pairs:
{
"prompt": "Summarize this article: [article text]",
"completion": "The article discusses..."
}
Data Quality Guidelines
Do:
- Use real examples from your domain
- Include edge cases and difficult examples
- Maintain consistent formatting
- Have experts validate outputs
- Balance positive and negative examples
Avoid:
- Synthetic data without validation
- Biased or unrepresentative samples
- Inconsistent formatting
- Examples with errors
- Duplicates
Data Quantity Guidelines
| Task Complexity | Minimum Examples | Recommended |
|---|---|---|
| Simple format | 50-100 | 200-500 |
| Domain adaptation | 200-500 | 1,000-5,000 |
| Complex behavior | 1,000+ | 5,000-10,000+ |
More data generally helps, but quality matters more than quantity.
Training Process
Training Configuration
Key hyperparameters to consider:
| Parameter | Typical Range | Impact |
|---|---|---|
| Learning rate | 1e-5 to 5e-4 | Too high = instability, too low = slow |
| Epochs | 1-5 | More = overfitting risk |
| Batch size | 4-32 | Limited by memory |
| Warmup steps | 5-10% of total | Stability early in training |
| LoRA rank | 8-64 | Capacity vs. efficiency |
Training Steps
- Prepare data: Format, clean, split (train/validation)
- Choose base model: Match capability to task
- Configure training: Set hyperparameters
- Train: Monitor loss and validation metrics
- Evaluate: Test on held-out data
- Iterate: Adjust based on results
Monitoring Training
Watch for:
| Signal | Meaning | Action |
|---|---|---|
| Loss decreasing | Model learning | Continue |
| Validation loss increasing | Overfitting | Stop early, reduce epochs |
| Loss stuck | Learning rate issue | Adjust learning rate |
| Loss unstable | Learning rate too high | Reduce learning rate |
Evaluation
Before deploying, evaluate on:
- Held-out test set: Data model hasnβt seen
- Task-specific metrics: Accuracy, F1, BLEU, etc.
- Human evaluation: Quality assessment by domain experts
- Comparison to base: Does fine-tuning actually help?
Practical Implementation
Platform Options
| Platform | Type | Best For |
|---|---|---|
| OpenAI Fine-tuning | Managed | GPT models, easy API |
| Together AI | Managed | Open models, cost-effective |
| Hugging Face | Framework | Self-hosted, full control |
| Axolotl | Tool | Simplified local training |
| Unsloth | Tool | Optimized LoRA/QLoRA |
Local Fine-Tuning Setup
Minimal setup for LoRA fine-tuning:
Hardware: GPU with 16GB+ VRAM (RTX 4090, A100)
Software stack:
- transformers (model loading)
- peft (LoRA implementation)
- datasets (data handling)
- accelerate (training optimization)
- bitsandbytes (quantization for QLoRA)
Typical Workflow
1. Load base model (possibly quantized)
2. Apply LoRA configuration
3. Load and preprocess dataset
4. Configure trainer
5. Train
6. Evaluate
7. Merge LoRA weights or deploy adapter
Deployment Options
After training, you can:
Keep adapter separate:
- Load base model + adapter at inference
- Easy to swap adapters for different tasks
- Slightly slower inference
Merge into base model:
- Create single model with adapter merged
- Faster inference
- Larger storage
Common Challenges
Catastrophic Forgetting
Model loses general capabilities while learning specific ones.
Mitigations:
- Lower learning rate
- Mix in general data
- Use PEFT methods (preserves base weights)
- Regularization techniques
Overfitting
Model memorizes training data rather than learning patterns.
Signs: Low training loss, high validation loss, poor generalization
Mitigations:
- More training data
- Early stopping
- Regularization (dropout, weight decay)
- Data augmentation
Poor Quality Outputs
Fine-tuned model performs worse than expected.
Diagnose:
- Check data quality
- Verify format matches base modelβs training
- Test with more/less training
- Compare to prompting baseline
Mode Collapse
Model produces same or very similar outputs.
Mitigations:
- Temperature adjustment at inference
- More diverse training data
- Check for data imbalance
Cost Considerations
Training Costs
| Factor | Impact |
|---|---|
| Model size | Larger = more compute |
| Dataset size | More data = longer training |
| Technique | Full > LoRA > QLoRA |
| Platform | Managed > self-hosted (usually) |
Managed Platform Pricing (Example)
| Platform | Approximate Cost |
|---|---|
| OpenAI | ~$0.008/1K tokens (varies by model) |
| Together AI | ~$0.50-2/hour (varies by model) |
| Self-hosted | GPU cost + time |
Break-Even Analysis
Fine-tuning makes economic sense when:
- Reduced prompt length saves token costs over time
- Improved quality reduces retry/correction costs
- Consistency reduces manual review costs
Calculate: Training cost vs. (per-query savings Γ expected query volume)
Best Practices
Before Fine-Tuning
- Establish baseline: Measure prompting performance first
- Define success criteria: What improvement justifies the effort?
- Collect quality data: Donβt fine-tune until data is ready
- Start small: Test with subset before full training
During Fine-Tuning
- Monitor metrics: Track loss and validation performance
- Save checkpoints: Enable recovery and comparison
- Test incrementally: Evaluate before training completes
- Document everything: Hyperparameters, data versions, results
After Fine-Tuning
- Comprehensive evaluation: Test across expected use cases
- A/B comparison: Fine-tuned vs. base with prompting
- Monitor in production: Track quality over time
- Plan for updates: How will you retrain as needs change?
Quick Reference
Decision Checklist
When to consider fine-tuning:
- Prompting doesnβt achieve needed consistency
- RAG doesnβt solve the problem (itβs behavior, not knowledge)
- Have quality training data (hundreds+ examples)
- Can afford training compute
- Have evaluation strategy
- Justified by use case volume
Technique Selection
| Situation | Recommended Technique |
|---|---|
| Consumer GPU (16-24GB) | QLoRA |
| Cloud GPU (40GB+) | LoRA or full |
| Maximum quality needed | Full fine-tuning |
| Quick iteration | LoRA |
| Very large model (70B+) | QLoRA |
Training Quick Reference
| Model Size | LoRA Memory | QLoRA Memory | Training Time* |
|---|---|---|---|
| 7B | ~16GB | ~8GB | Hours |
| 13B | ~32GB | ~12GB | Hours |
| 70B | ~140GB | ~40GB | Days |
*Varies significantly based on dataset size and hardware
Common Mistakes
| Mistake | Consequence | Prevention |
|---|---|---|
| Skipping prompting baseline | Donβt know if fine-tuning helps | Always compare |
| Low quality data | Poor model performance | Invest in data quality |
| Too few examples | Insufficient learning | Collect more data |
| Too many epochs | Overfitting | Monitor validation loss |
| Ignoring evaluation | Unknown quality | Systematic testing |
Found this guide helpful? Share it with your team:
Share on LinkedIn