Database Fundamentals
Why Databases Exist
Applications need to persist data beyond the lifetime of a single process. They need to share data across multiple processes, servers, and users. They need to query data in flexible ways, ensure data isn’t lost during failures, and prevent concurrent modifications from corrupting state.
A database is software that solves these problems. It provides durable storage (data survives crashes), concurrent access (multiple users can read and write safely), query capabilities (find data matching specific criteria), and often transactions (group multiple operations into atomic units).
Without databases, applications would need to implement all of these features themselves, and they’d do it poorly. Databases represent decades of engineering effort to solve these problems correctly.
Why So Many Database Types Exist
For decades, relational databases were the default choice for nearly every application. They worked well enough for most use cases, but “well enough” started breaking down as applications scaled and data patterns diversified.
Relational databases make certain trade-offs: they enforce schemas, maintain ACID transactions, and optimize for complex queries across related tables. These trade-offs work beautifully for some workloads and terribly for others.
Consider write throughput. A relational database must maintain indexes, enforce constraints, and coordinate transactions. Every write involves multiple disk operations. For an e-commerce order system processing hundreds of transactions per second, this overhead is negligible. For a telemetry system ingesting millions of sensor readings per second, it’s a bottleneck.
Or consider data structure. Relational databases force data into rows with fixed columns. A user profile with varying attributes requires either nullable columns (wasting space and complicating queries) or multiple tables with joins (adding complexity and latency). Some users have phone numbers while others don’t; some have multiple addresses while others have none.
Specialized databases emerged because different problems have different optimal solutions. The database that excels at finding the shortest path between two nodes in a social graph is architecturally incapable of efficiently aggregating time-series metrics.
The CAP Theorem
Every distributed database confronts an impossible choice. The CAP theorem, proven mathematically in 2002, states that a distributed system can provide at most two of three guarantees:
Consistency means every read returns the most recent write. If you update your profile, everyone immediately sees the update.
Availability means every request receives a response. The system never refuses to answer.
Partition tolerance means the system continues operating when network failures prevent some nodes from communicating with others.
Networks fail and partitions happen, so every distributed database must choose partition tolerance, leaving a choice between consistency and availability during failures.
CP Systems (Consistency + Partition Tolerance)
These systems choose consistency over availability. During a network partition, they’ll refuse to serve requests rather than risk returning stale data.
Examples include traditional relational databases, etcd, ZooKeeper, and HBase.
Trade-off: During network issues, some requests will fail. Better for financial systems where stale data causes real problems.
AP Systems (Availability + Partition Tolerance)
These systems choose availability over consistency. During a network partition, they’ll continue serving requests but may return stale data.
Examples include Cassandra, DynamoDB, CouchDB, and most eventually consistent systems.
Trade-off: Reads might return outdated data. Better for systems where availability matters more than perfect accuracy (social media feeds, product catalogs).
Understanding the Practical Implications
CAP doesn’t mean you have exactly two properties all the time. During normal operation, when no partitions exist, you can have all three. The theorem constrains behavior during failures.
Modern databases often let you tune this trade-off per operation. Cassandra lets you specify consistency levels: write to one node (fast, less durable), write to a quorum (slower, more durable), or write to all nodes (slowest, most durable).
ACID Properties
ACID defines the guarantees that traditional relational databases provide for transactions. Each letter represents a specific guarantee:
Atomicity
A transaction is an atomic unit of work. All operations within a transaction either complete successfully or have no effect. There’s no partial state.
If you transfer $100 from account A to account B, atomicity guarantees that either both the debit from A and credit to B happen, or neither happens. You won’t end up with money debited but not credited.
Consistency
The database moves from one valid state to another valid state. Constraints like foreign keys, unique constraints, and check constraints are always enforced.
If you have a constraint that account balances must be non-negative, the database will reject any transaction that would violate this constraint, even if atomicity would technically allow it.
Isolation
Concurrent transactions don’t interfere with each other. Each transaction behaves as if it’s the only one running, even when thousands run simultaneously.
Isolation levels control how much transactions can see of each other’s uncommitted changes:
- Read Uncommitted: Can see uncommitted changes (dirty reads)
- Read Committed: Only sees committed changes
- Repeatable Read: Same rows return same values throughout transaction
- Serializable: Transactions appear to run one at a time
Higher isolation levels provide stronger guarantees but reduce concurrency.
Durability
Once a transaction commits, the changes survive any subsequent failure including power loss, crashes, and hardware failures. The database uses write-ahead logging to ensure committed data can be recovered.
The Cost of ACID
ACID guarantees aren’t free. They require:
- Locking or multi-version concurrency control for isolation
- Write-ahead logging for durability
- Constraint checking for consistency
- Coordination for distributed transactions
This overhead limits throughput and adds latency. Many NoSQL databases trade ACID guarantees for performance.
BASE: The Alternative to ACID
BASE describes the properties of many distributed NoSQL systems. It stands for:
Basically Available: The system guarantees availability (in the CAP sense). Requests receive responses, though responses may be stale.
Soft state: The system’s state may change over time even without input, as updates propagate through replicas.
Eventual consistency: Given enough time without new updates, all replicas will converge to the same state. Reads will eventually return the most recent write.
ACID vs BASE
| Property | ACID | BASE |
|---|---|---|
| Consistency | Strong, immediate | Eventual |
| Availability during failures | May be unavailable | Highly available |
| Latency | Higher (coordination overhead) | Lower |
| Complexity | Simpler for developers | Requires handling stale data |
| Use cases | Financial, healthcare, inventory | Social, content, analytics |
When BASE Is Acceptable
Eventual consistency works when:
- Stale reads have low cost (showing a slightly outdated follower count)
- The application can handle or hide inconsistency (UI optimistic updates)
- Availability matters more than accuracy (better to show old data than nothing)
Eventual consistency fails when:
- Decisions based on reads affect real resources (inventory, money)
- Users directly compare values across requests (two users seeing different prices)
- Regulatory requirements mandate consistency
Data Modeling Fundamentals
Normalization
Normalization organizes relational data to reduce redundancy. Instead of storing a customer’s address with every order, you store it once in a customers table and reference it by ID.
Benefits:
- Updates happen in one place
- Less storage waste
- Consistent data
Costs:
- Joins required to reassemble data
- More complex queries
- Potential performance overhead
Normal forms (1NF through 5NF) define increasing levels of normalization. Most applications use third normal form (3NF) as a practical balance.
Denormalization
Denormalization intentionally introduces redundancy for read performance. Store the customer’s name directly in the order record so you don’t need to join.
When to denormalize:
- Read-heavy workloads with known access patterns
- Performance requirements that joins can’t meet
- Distributed systems where joins are expensive or impossible
Costs:
- Updates must happen in multiple places
- Risk of inconsistent data
- More storage required
Schema-on-Write vs Schema-on-Read
Schema-on-write (relational databases): Define the schema before inserting data. The database enforces structure.
Schema-on-read (document databases, data lakes): Store data without enforcing structure. The application interprets structure when reading.
Schema-on-write catches errors early but requires migrations. Schema-on-read provides flexibility but can lead to data quality issues.
Database Selection Framework
Start With Access Patterns
Database selection should flow from access patterns, not the other way around. Ask:
- How will data be written? High-volume streams? Transactional batches? User-driven updates?
- How will data be read? By primary key? By arbitrary attributes? By time range? By relationship traversal?
- What are the consistency requirements? Financial accuracy? Eventual consistency acceptable?
- What’s the expected scale? Hundreds of gigabytes? Petabytes? Millions of reads per second?
Decision Matrix
| Primary Access Pattern | Database Category |
|---|---|
| Transactions across related entities | Relational or NewSQL |
| Simple key-based lookups at massive scale | Key-Value |
| Flexible documents with varied schemas | Document |
| Time-range queries on metrics/events | Time-Series |
| Relationship traversal (friends-of-friends) | Graph |
| Semantic similarity search | Vector |
| Full-text search with relevance ranking | Search Engine |
| Sub-millisecond caching | In-Memory |
| Massive write throughput, sparse columns | Wide-Column |
The PostgreSQL Default
Start with relational unless you have a specific reason not to. PostgreSQL handles more use cases than most teams realize:
- JSON columns provide document flexibility
- The pgvector extension enables vector search
- Full-text search is built in
- Extensions like TimescaleDB add time-series capabilities
If you’re not sure what you need, PostgreSQL is the safe default. You can always add specialized databases later when specific needs emerge.
Operational Considerations
A database you can operate well beats a theoretically superior database you operate poorly. Consider:
- Team expertise: Do you have experience with this technology?
- Managed services: Is a managed option available to reduce operational burden?
- Tooling: Are there good backup, monitoring, and debugging tools?
- Community: Can you find help when things go wrong?
Polyglot Persistence
Most non-trivial applications use multiple databases, each optimized for its specific use case. This pattern is called polyglot persistence.
Common Combinations
Web application: PostgreSQL (primary data) + Redis (sessions, caching) + Elasticsearch (search)
IoT platform: TimescaleDB (metrics) + PostgreSQL (device metadata) + Redis (real-time state)
E-commerce: PostgreSQL (orders, customers) + Elasticsearch (product search) + Redis (cart, sessions)
AI application: PostgreSQL (application data) + Pinecone or pgvector (embeddings) + Redis (caching)
Social platform: PostgreSQL (user data) + Neo4j (social graph) + Redis (feeds, caching)
Managing Data Synchronization
The challenge with polyglot persistence is keeping data synchronized across systems. Common approaches:
Change data capture (CDC): Capture changes from the primary database and apply them to secondary systems. Tools like Debezium stream changes from database transaction logs.
Dual writes: Write to multiple systems from the application. Simple but risks inconsistency if one write fails.
Event sourcing: Use an event log as the source of truth. Each database builds its view from events.
Scheduled sync: Periodically copy data from primary to secondary systems. Simple but introduces latency.
Migration Considerations
Don’t Migrate Prematurely
The operational cost of running a second database type often exceeds the performance benefit until you’ve truly hit limits. PostgreSQL can handle more than most teams expect.
Signs you might actually need a specialized database:
- You’ve optimized queries and still can’t meet requirements
- You’re spending significant effort working around the current database’s limitations
- The access pattern clearly doesn’t match the database model
Consider Hybrid Approaches First
Before a full migration, consider:
- PostgreSQL with TimescaleDB extension for time-series
- PostgreSQL with pgvector for vector search
- Redis alongside PostgreSQL for caching
- Elasticsearch alongside PostgreSQL for search
These hybrid approaches often provide 80% of the benefit with 20% of the migration effort.
Migration Is Never Just Moving Data
When you do migrate, budget for more than data transfer. Query patterns differ between database types:
- Application code that worked with implicit joins needs restructuring for document databases
- Code that relied on transactions needs compensation logic for eventually consistent stores
- ORMs may not support the new database, requiring significant code changes
Plan for:
- Schema redesign for the new data model
- Application code changes
- Testing for behavioral differences
- Rollback strategy if issues emerge
Found this guide helpful? Share it with your team:
Share on LinkedIn