Search Engines
What They Are
Search engines index unstructured text for full-text search, returning results ranked by relevance. They solve a different problem than databases: not “retrieve the record with ID 123” but “find the most relevant documents containing ‘database performance tuning.’”
Full-text search requires specialized data structures. Traditional database indexes map exact values to records. Search engines build inverted indexes that map terms to documents, enable relevance scoring, and handle linguistic variations (stemming “running” to “run,” expanding synonyms, handling misspellings).
Data Structure
DOCUMENTS (what you store):
┌────────┬─────────────────────────────────────────────────────────────┐
│ id │ content │
├────────┼─────────────────────────────────────────────────────────────┤
│ 1 │ "The quick brown fox jumps over the lazy dog" │
│ 2 │ "Quick database queries improve performance" │
│ 3 │ "The database jumped to a new performance level" │
└────────┴─────────────────────────────────────────────────────────────┘
│
▼ Analysis (tokenize, stem, lowercase)
INVERTED INDEX (what the search engine builds):
┌────────────────┬────────────────────────────────────────────────────┐
│ TERM │ DOCUMENT IDs (with positions) │
├────────────────┼────────────────────────────────────────────────────┤
│ brown │ [1] │
│ databas* │ [2, 3] ← Stemmed form │
│ dog │ [1] │
│ fox │ [1] │
│ jump* │ [1, 3] ← "jumps" and "jumped" match │
│ lazi* │ [1] │
│ level │ [3] │
│ perform* │ [2, 3] ← Stemmed form │
│ queri* │ [2] │
│ quick │ [1, 2] │
└────────────────┴────────────────────────────────────────────────────┘
Query: "database performance"
→ Look up "databas*" → [2, 3]
→ Look up "perform*" → [2, 3]
→ Intersection + relevance scoring → doc 2 (score: 0.89),
doc 3 (score: 0.76)
The inverted index maps terms to documents, making lookups fast. Analysis pipelines normalize text so that variations like “jumping” and “jumped” match the same term.
How They Work
Inverted Indexes
The core data structure. For each term that appears in any document, the index stores which documents contain that term and where. Searching for “database” jumps directly to the list of documents containing that word.
Analysis Pipeline
Before indexing, text passes through analyzers that:
- Tokenize: Split into terms
- Normalize: Lowercase, remove accents
- Stem: Reduce to root forms
- Optionally expand synonyms
The same pipeline processes search queries, ensuring “Databases” matches documents containing “database.”
Relevance Scoring
Not all matches are equal. TF-IDF (term frequency-inverse document frequency) scores documents higher when they contain rare terms. BM25 improves on TF-IDF with better handling of document length. Modern search engines allow custom scoring with boosts and decay functions.
Faceting and Aggregation
Beyond finding documents, search engines can compute aggregations like counting documents by category, finding the price range across matching products, and bucketing by date. This powers the filtering UI on e-commerce sites.
Near-Real-Time Indexing
Documents become searchable within seconds of indexing. The engine balances index freshness against query performance.
Why They Excel
Relevance-Ranked Results
Search engines return results sorted by relevance, not just matching everything that contains a keyword.
Linguistic Intelligence
Stemming, synonyms, and language-specific analysis make search feel natural to users.
Performance at Scale
Inverted indexes make searching millions of documents fast.
Faceted Navigation
The aggregation capabilities power the filters and counts that users expect in search interfaces.
Why They Struggle
Not a System of Record
Search engines are designed for search, not primary storage. Data should live in a database of record with the search engine as a derived index.
Eventual Consistency
Indexing isn’t instantaneous. There’s a delay between writing data and it becoming searchable.
Operational Complexity
Elasticsearch clusters require capacity planning, shard management, and monitoring.
When to Use Them
Search engines are appropriate for:
- Product search: E-commerce with filtering, sorting, relevance
- Content discovery: Documentation, knowledge bases, help centers
- Log analysis: Searching through millions of log lines
- Any application where users type natural-language queries expecting ranked results
When to Look Elsewhere
Don’t use a search engine as your primary database. Don’t use it for simple key-based lookups. Don’t use it where exact matching matters more than relevance.
Examples
Elasticsearch dominates enterprise search with a rich feature set, distributed architecture, and the ELK stack (Elasticsearch, Logstash, Kibana) for log analysis.
OpenSearch is Amazon’s fork of Elasticsearch, fully open source under Apache 2.0 license.
Apache Solr is mature and feature-rich but has a steeper learning curve than Elasticsearch.
Typesense and Meilisearch focus on developer experience and ease of use, often described as “search that just works” compared to Elasticsearch’s complexity.
Found this guide helpful? Share it with your team:
Share on LinkedIn