Document Databases - Architecture Insights

What They Are

Document databases store data as semi-structured documents, typically JSON or BSON (binary JSON). Unlike key-value stores, they understand document structure and can index and query by fields within documents. Unlike relational databases, they don’t require predefined schemas, so documents in the same collection can have different fields.

Document databases emerged from the object-relational impedance mismatch. Applications work with objects that have nested structures, optional fields, and arrays. Translating these to normalized relational tables and back requires mapping code that introduces bugs and performance overhead. What if the database just stored objects directly?

Data Structure

┌─────────────────────────────────────────────────────────────┐
│  USERS COLLECTION                                           │
├─────────────────────────────────────────────────────────────┤
│ {                                                           │
│   "_id": "user_1",                                          │
│   "name": "Alice Smith",                                    │
│   "email": "alice@example.com",                             │
│   "addresses": [                      ← Nested array        │
│     {"type": "home", "city": "Seattle"},                    │
│     {"type": "work", "city": "Portland"}                    │
│   ],                                                        │
│   "preferences": {                    ← Nested object       │
│     "theme": "dark",                                        │
│     "notifications": true                                   │
│   }                                                         │
│ }                                                           │
├─────────────────────────────────────────────────────────────┤
│ {                                                           │
│   "_id": "user_2",                                          │
│   "name": "Bob Jones",                                      │
│   "email": "bob@example.com",                               │
│   "phone": "+1-555-0123"              ← Field not in user_1 │
│   // No addresses field               ← Schema flexibility  │
│ }                                                           │
└─────────────────────────────────────────────────────────────┘

Query: db.users.find({ "addresses.city": "Seattle" })
       → Returns user_1 (queries nested fields)

Each document is self-contained. Related data can be embedded (addresses inside user) or referenced (storing IDs to other documents). Documents in the same collection can have different structures.

How They Work

Collections and Documents

Documents group into collections (analogous to tables). Each document has a unique ID and contains fields with values. Values can be primitives (strings, numbers, booleans), arrays, or nested documents.

Schema Flexibility

The database doesn’t enforce document structure. One user document might have a phone number field; another might not. One might have an array of addresses; another might have a single address string. This flexibility accelerates development when requirements are evolving.

Indexing

Document databases build indexes on fields within documents, including nested fields and array elements. An index on “addresses.city” allows efficient queries for users in specific cities.

Query Languages

Unlike key-value stores, document databases support rich queries. MongoDB’s query language (and similar ones in other document stores) can filter by field values, use comparison operators, match elements in arrays, combine conditions with boolean logic, and perform aggregations.

Embedding vs. Referencing

Document databases support two approaches to related data:

Embedding stores related data within a single document (a blog post document contains its comments array). Embedding optimizes read performance at the cost of update complexity.

Referencing stores an ID that points to another document (comments are separate documents with a post_id field). Referencing mirrors relational design.

Why They Excel

Developer Productivity

Storing application objects directly eliminates mapping code. What you save to the database looks like what you get back.

Schema Evolution

Adding fields doesn’t require migrations. Old documents without the field coexist with new documents that have it. Application code handles the difference.

Read Performance for Embedded Data

When related data is embedded, retrieval is a single operation. No joins, no multiple round-trips.

Horizontal Scaling

Document databases were designed for distribution. Sharding by document ID distributes data across servers with minimal coordination.

Why They Struggle

Many-to-Many Relationships

Relational databases handle many-to-many relationships elegantly with join tables. Document databases require either duplicating data (embedding) or multiple queries (referencing).

Cross-Document Transactions

Early document databases had no multi-document transactions. Modern ones like MongoDB now support them, but with performance costs and operational complexity.

Query Flexibility

Without a fixed schema, the query optimizer has less information. Index design requires careful consideration of access patterns.

Data Consistency

Without schema enforcement, nothing prevents invalid data from entering the system except application code.

When to Use Them

Document databases work well for:

Content management systems: Articles, products, user profiles with varying attributes
Event logging: Each event is a self-contained document
Applications with rapidly evolving data models
Read-heavy workloads where embedding eliminates joins

When to Look Elsewhere

If your data is highly relational with many-to-many relationships, if you need complex transactions spanning many documents, if you need strong schema enforcement to prevent data corruption, or if your workload is write-heavy with frequent updates to embedded data, relational databases are likely a better fit.

Examples

MongoDB is the dominant document database, with mature tooling, a fully-managed Atlas service, and features that have expanded to include multi-document ACID transactions.

Couchbase combines document storage with key-value performance and built-in caching, plus a SQL-like query language (N1QL).

Firebase Firestore provides real-time synchronization for mobile and web applications, with offline support and automatic scaling.

Amazon DocumentDB offers MongoDB API compatibility as a managed service.