Modern Data Architecture on AWS - Architecture Insights

What Problems Modern Data Architecture Solves

Modern data architecture eliminates the silos, rigidity, and cost inefficiencies of traditional data warehousing while enabling analytics at scale across structured and unstructured data.

Traditional data architecture challenges:

Data warehouses require schema-on-write, making it expensive to store exploratory or semi-structured data (JSON, logs, events)
ETL pipelines break when source schemas evolve, requiring constant maintenance
Separate systems for batch analytics (data warehouse), real-time analytics (streaming), and machine learning (feature stores) create data silos
Scaling data warehouses vertically costs $50,000-$500,000/year for enterprise capacity
Analysts wait days for data engineering to load new datasets into the warehouse

Concrete scenario: Your e-commerce company runs analytics on PostgreSQL, storing orders, customers, and products in normalized tables. The analytics team wants to analyze clickstream data (20 GB/day of JSON logs), customer support tickets (unstructured text), and product images (binary data). Loading this data into PostgreSQL requires complex ETL: flatten JSON into tables, parse text into structured fields, extract image metadata. The ETL pipeline takes 2 weeks to build per new data source and breaks monthly when schemas change. PostgreSQL storage costs $15,000/month and query performance degrades as data grows. The data science team can’t access raw clickstream data for ML because it’s been aggregated away in the ETL process.

What modern data architecture provides: A data lake on S3 stores all data in raw form (JSON, Parquet, CSV, images) for $23/TB/month. Glue catalogs schemas automatically. Athena queries structured data with SQL. SageMaker reads raw data for ML. QuickSight visualizes aggregates. Redshift Spectrum queries S3 directly when needed. Each team accesses the same source data using the right tool for their use case—no ETL required unless transforming data adds value.

Real-world impact: After implementing a data lake, storage costs dropped from $15,000/month (PostgreSQL) to $500/month (S3 for 20 TB). New data sources go live in hours instead of weeks—upload to S3, run Glue crawler, query with Athena. Schema evolution is handled automatically by Glue crawlers. Data scientists access raw clickstream data directly for ML models. Analysts query aggregated views via Athena or Redshift Spectrum. Total analytics cost: $2,000/month (S3 + Athena + QuickSight) vs $30,000/month (PostgreSQL cluster + ETL infrastructure + BI licenses).

Core Architectural Patterns

Modern data architecture on AWS follows three primary patterns: Data Lake, Data Warehouse, and Lake House (hybrid). The choice depends on data types, query patterns, and team skills.

Pattern 1: Data Lake (S3 + Glue + Athena)

What it is: Store all data in S3 in open formats (Parquet, JSON, CSV), catalog with Glue, query with Athena or other compute engines.

Key characteristics:

Schema-on-read: Store raw data first, apply schema when querying
Open formats: Parquet, ORC, Avro, JSON—no proprietary storage
Decoupled storage and compute: S3 for storage, Athena/EMR/Redshift Spectrum for compute
Pay per query: No idle infrastructure costs

Architecture components:

Data Sources → S3 Data Lake → Glue Data Catalog → Query Engines
                   ↓                                    ↓
           (Parquet files)                    Athena, Redshift Spectrum,
           Organized by zone:                  EMR, SageMaker, QuickSight
           - Raw
           - Processed
           - Curated

Zone organization:

Raw zone: Original data in original format
- Location: s3://datalake/raw/
- Format: JSON, CSV, logs (as received)
- Retention: Indefinite (compliance, reprocessing)
- Who writes: Ingestion pipelines, Kinesis Firehose, IoT
Processed zone: Cleaned, deduplicated, converted to columnar format
- Location: s3://datalake/processed/
- Format: Parquet with Snappy compression
- Partitioning: By date (year/month/day)
- Who writes: Glue ETL jobs
Curated zone: Business-level aggregates, joined with dimensions
- Location: s3://datalake/curated/
- Format: Parquet optimized for BI queries
- Partitioning: By business dimensions (region, product category, date)
- Who writes: Glue ETL jobs, dbt transformations

Example folder structure:

s3://my-datalake/
  raw/
    clickstream/
      year=2024/month=11/day=15/events-20241115-143022.json.gz
    orders/
      year=2024/month=11/day=15/orders.csv
  processed/
    clickstream/
      year=2024/month=11/day=15/events.parquet
    orders/
      year=2024/month=11/day=15/orders.parquet
  curated/
    daily_revenue/
      year=2024/month=11/day=15/revenue.parquet
    user_segments/
      snapshot_date=2024-11-15/segments.parquet

When to use data lake pattern:

✅ Diverse data types (structured, semi-structured, unstructured)
✅ Schema evolution frequent or unpredictable
✅ ML/AI workloads requiring raw data access
✅ Cost-sensitive (S3 storage cheapest option)
✅ Infrequent or ad-hoc queries (pay per query via Athena)

Limitations:

❌ Query performance slower than data warehouse for complex joins
❌ No transactional support (ACID) without Delta Lake/Iceberg
❌ Requires data engineering discipline to maintain quality

Pattern 2: Data Warehouse (Redshift)

What it is: Centralized columnar database optimized for OLAP queries with fast joins, aggregations, and concurrent access.

Key characteristics:

Schema-on-write: Define schema before loading data
Proprietary storage: Redshift columnar format optimized for queries
Provisioned capacity: Pay for cluster (nodes) 24/7 or pause when idle
ACID transactions: Support for INSERT, UPDATE, DELETE, MERGE

Architecture components:

Data Sources → Staging (S3) → COPY to Redshift → BI Tools
                                     ↓
                              Materialized Views
                              Stored Procedures
                              Concurrency Scaling

Redshift cluster types:

Cluster Type	Use Case	Node Type	Cost (approx)
Development	Testing, small datasets	dc2.large (2 nodes)	$0.25/hour ($365/month)
Production Small	1-10 TB, <50 users	ra3.4xlarge (2 nodes)	$6.80/hour ($4,964/month)
Production Large	10-100 TB, 100+ users	ra3.16xlarge (4 nodes)	$52/hour ($37,960/month)

Redshift optimization techniques:

Distribution styles: Control how data distributes across nodes
- KEY: Distribute by join key (co-locate related rows)
- ALL: Replicate small dimension tables to all nodes
- EVEN: Round-robin distribution (default)
- AUTO: Redshift chooses based on table size
Sort keys: Order data on disk for faster range scans
- Compound: Optimal when queries filter on multiple columns in order
- Interleaved: Optimal when queries filter on different column combinations

Materialized views: Pre-compute expensive aggregations

CREATE MATERIALIZED VIEW daily_revenue AS
SELECT
  DATE_TRUNC('day', order_date) as day,
  product_category,
  SUM(revenue) as total_revenue
FROM orders
GROUP BY DATE_TRUNC('day', order_date), product_category;

Concurrency scaling: Auto-add query capacity during peak usage
- Free for 60 minutes/day
- $6.80/hour beyond free tier
- Scales to 10× base cluster capacity

When to use data warehouse pattern:

✅ Primarily structured data (tables with known schemas)
✅ High-frequency queries on same datasets (amortize cluster cost)
✅ Complex joins across many tables (100+ table star schemas)
✅ Concurrent users (100+ analysts querying simultaneously)
✅ Sub-second query latency required for dashboards

Limitations:

❌ Expensive for infrequent queries (pay for idle cluster)
❌ Schema changes require ALTER TABLE, data reload
❌ Not suitable for raw unstructured data (JSON, logs, images)

Pattern 3: Lake House (S3 + Glue + Redshift Spectrum + Athena)

What it is: Combine data lake flexibility with data warehouse performance by querying S3 data from Redshift and using Redshift for hot data.

Key characteristics:

Hybrid storage: Hot data in Redshift, cold data in S3
Unified query interface: SQL queries across both Redshift and S3
Federated queries: Join Redshift tables with S3 tables in single query
Cost optimization: Pay for Redshift cluster (hot data) + $5/TB scanned (S3 data)

Architecture components:

S3 Data Lake (Cold Data)
    ↓
Redshift Spectrum (External Tables)
    ↓
Redshift Cluster (Hot Data) ←→ BI Tools, Applications
    ↓
Materialized Views (Pre-aggregated S3 data)

Example use cases:

Use case 1: Hot/cold data separation

Store last 90 days of transaction data in Redshift (frequently queried, fast access). Store older data in S3 (rarely queried, cheap storage). Query across both seamlessly.

-- Redshift table (hot data, last 90 days)
CREATE TABLE orders_recent (
  order_id BIGINT,
  order_date DATE,
  customer_id BIGINT,
  revenue DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY (customer_id)
SORTKEY (order_date);

-- Spectrum external table (cold data, older than 90 days)
CREATE EXTERNAL TABLE orders_historical (
  order_id BIGINT,
  order_date DATE,
  customer_id BIGINT,
  revenue DECIMAL(10,2)
)
STORED AS PARQUET
LOCATION 's3://my-datalake/curated/orders/';

-- Union query across hot and cold data
SELECT
  DATE_TRUNC('month', order_date) as month,
  SUM(revenue) as total_revenue
FROM (
  SELECT order_date, revenue FROM orders_recent
  UNION ALL
  SELECT order_date, revenue FROM orders_historical
)
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

Use case 2: Enrich data warehouse with data lake data

Join Redshift customer table with S3 clickstream events for comprehensive analysis.

-- Redshift table (customer master data)
SELECT c.customer_id, c.customer_name, c.region
FROM customers c;

-- Spectrum external table (clickstream events in S3)
CREATE EXTERNAL TABLE clickstream_events (
  event_id STRING,
  customer_id BIGINT,
  event_type STRING,
  event_timestamp TIMESTAMP
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://my-datalake/processed/clickstream/';

-- Federated query joining Redshift and S3 data
SELECT
  c.customer_name,
  c.region,
  COUNT(e.event_id) as total_events,
  COUNT(DISTINCT e.event_type) as unique_event_types
FROM customers c
INNER JOIN clickstream_events e
  ON c.customer_id = e.customer_id
WHERE e.year = 2024 AND e.month = 11
GROUP BY c.customer_name, c.region;

Cost optimization with lake house:

Scenario	Data Warehouse Only	Lake House
10 TB hot data (last 90 days)	10 TB in Redshift	10 TB in Redshift
90 TB cold data (older)	90 TB in Redshift	90 TB in S3
Storage cost/month	100 TB × $24/TB = $2,400	10 TB × $24 + 90 TB × $0.023 = $242
Query cost (100 TB/month scanned)	Included in cluster	$5/TB × 100 TB = $500
Total cost/month	$2,400 + cluster	$742 + cluster

Lake house saves $1,658/month on storage, adds $500/month in Spectrum query costs. Net savings: $1,158/month.

When to use lake house pattern:

✅ Mix of hot data (queried daily) and cold data (queried rarely)
✅ Need to join structured data warehouse data with semi-structured S3 data
✅ Want Redshift performance for critical dashboards while reducing storage costs
✅ Existing Redshift cluster can be extended with S3 data

Limitations:

❌ Spectrum query performance slower than native Redshift tables (network I/O to S3)
❌ Still requires Redshift cluster (ongoing cost)
❌ Spectrum charges $5/TB scanned (optimize with partitioning and Parquet)

Service Selection Framework

Choose services based on data characteristics, query patterns, and team capabilities.

Decision Matrix: Storage Layer

Data Type	Volume	Change Frequency	Best Storage
Structured	<1 TB	Low	RDS/Aurora (OLTP), Redshift (OLAP)
Structured	1-100 TB	Low-Medium	Redshift, S3 + Athena
Structured	>100 TB	Any	S3 + Athena, Redshift (hot data only)
Semi-structured (JSON, logs)	Any	Any	S3 (convert to Parquet for analytics)
Unstructured (images, videos)	Any	Any	S3

Decision Matrix: Query Engine

Query Pattern	Concurrency	Latency Requirement	Best Engine
Ad-hoc exploration	Low (1-10 users)	5-30 seconds OK	Athena
Scheduled reports	Low	Minutes OK	Athena, Glue
Interactive dashboards	High (50+ users)	<3 seconds	Redshift, QuickSight with SPICE
Real-time analytics	High	<1 second	ElastiCache, DynamoDB, OpenSearch
Complex joins (10+ tables)	Medium	<10 seconds	Redshift
ML feature extraction	Low	Minutes-hours OK	SageMaker Processing, EMR

Decision Matrix: Transformation Layer

Transformation Complexity	Data Volume	Skill Set	Best Tool
Simple (aggregations, filters)	Any	SQL	Athena CTAS, Redshift views
Moderate (joins, deduplication)	<100 GB/day	SQL	Glue ETL (PySpark), dbt
Complex (ML features, custom logic)	Any	Python/Scala	Glue ETL, EMR, SageMaker
Real-time (streaming transformations)	Any	SQL/Python	Kinesis Data Analytics, Lambda

Cost-Based Decision Framework

Example scenario: 50 TB of transaction data, 20 queries/day averaging 500 GB scanned each.

Option 1: Athena only

Storage: 50 TB × $0.023/GB = $1,150/month
Queries: 20 × 500 GB × 30 days × $5/TB = $1,500/month
Total: $2,650/month

Option 2: Redshift only

Storage: 50 TB on ra3.16xlarge (16 TB per node) = 4 nodes = $52/hour = $37,960/month
Queries: Included
Total: $37,960/month

Option 3: Lake house (Athena + Redshift)

Storage S3: 45 TB × $0.023/GB = $1,035/month
Storage Redshift: 5 TB (hot data) on ra3.4xlarge (2 nodes) = $6.80/hour = $4,964/month
Queries (Athena): 10 × 500 GB × 30 days × $5/TB = $750/month (half queries on Redshift)
Total: $6,749/month

Option 4: QuickSight SPICE + Athena

Storage S3: 50 TB × $0.023/GB = $1,150/month
Athena (SPICE refresh): 1 × 500 GB × 30 days × $5/TB = $75/month
QuickSight: 10 authors × $24 + 40 readers × $5 = $440/month
Total: $1,665/month

Winner: Option 4 (QuickSight SPICE + Athena) for dashboard-centric workloads. Option 1 (Athena only) for ad-hoc analytics.

Integration Patterns

Modern data architecture integrates multiple services to handle different workloads efficiently.

Pattern 1: Batch Ingestion Pipeline

Use case: Daily batch files from external systems (partners, SaaS tools, databases).

Architecture:

External System → S3 (Landing) → Lambda (Trigger) → Glue Job (Transform) → S3 (Curated) → Glue Crawler → Athena

Implementation:

Ingestion: Upload CSV/JSON files to s3://datalake/raw/source-system/
Trigger: S3 event triggers Lambda function
Transform: Lambda starts Glue ETL job
Glue job: Read raw files, clean, convert to Parquet, write to s3://datalake/curated/
Catalog: Glue crawler updates Data Catalog with new partitions
Query: Athena queries curated data, QuickSight dashboards refresh

Example Glue job (PySpark):

from awsglue.context import GlueContext
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())

# Read raw CSV from S3
raw_df = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://datalake/raw/orders/"]},
    format="csv",
    format_options={"withHeader": True}
)

# Transform: clean, deduplicate, add columns
from pyspark.sql.functions import current_timestamp
cleaned_df = raw_df.toDF() \
    .dropDuplicates(["order_id"]) \
    .withColumn("processed_at", current_timestamp())

# Write to curated zone as Parquet with partitioning
cleaned_df.write \
    .partitionBy("order_date") \
    .mode("overwrite") \
    .parquet("s3://datalake/curated/orders/")

Pattern 2: Real-Time Streaming Pipeline

Use case: Clickstream events, IoT sensor data, application logs requiring near-real-time analytics.

Architecture:

Application → Kinesis Data Stream → Kinesis Firehose → S3 (Parquet) → Glue Catalog → Athena
                                        ↓
                                   Lambda (Real-time processing)
                                        ↓
                                   DynamoDB (Real-time dashboards)

Implementation:

Ingest: Application sends events to Kinesis Data Stream
Real-time processing: Lambda function consumes stream, writes aggregates to DynamoDB
Batch storage: Kinesis Firehose buffers events, writes to S3 in Parquet format every 5 minutes
Catalog: Glue crawler runs hourly to discover new S3 partitions
Query: Athena for historical analysis, DynamoDB for real-time dashboards

Example Firehose configuration (Parquet conversion):

{
  "DeliveryStreamName": "clickstream-to-s3",
  "S3DestinationConfiguration": {
    "BucketARN": "arn:aws:s3:::my-datalake",
    "Prefix": "raw/clickstream/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
    "BufferingHints": {
      "SizeInMBs": 128,
      "IntervalInSeconds": 300
    },
    "CompressionFormat": "UNCOMPRESSED",
    "DataFormatConversionConfiguration": {
      "SchemaConfiguration": {
        "DatabaseName": "clickstream_db",
        "TableName": "events",
        "Region": "us-east-1"
      },
      "OutputFormatConfiguration": {
        "Serializer": {
          "ParquetSerDe": {
            "Compression": "SNAPPY"
          }
        }
      }
    }
  }
}

Pattern 3: Data Warehouse Offload

Use case: Move historical data from expensive Redshift to cheap S3 while maintaining query compatibility.

Architecture:

Redshift (Last 90 days) ←→ Redshift Spectrum ←→ S3 (Historical data)

Implementation:

Current state: All historical data (10 years) in Redshift = 500 TB × $24/TB = $12,000/month
Target state: 90 days in Redshift (12 TB), rest in S3 (488 TB)

Migration process:

-- 1. Unload historical data to S3 in Parquet format
UNLOAD ('SELECT * FROM orders WHERE order_date < CURRENT_DATE - 90')
TO 's3://my-datalake/curated/orders_historical/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
PARQUET
PARTITION BY (year, month);

-- 2. Create Spectrum external table pointing to S3
CREATE EXTERNAL TABLE orders_historical (
  order_id BIGINT,
  customer_id BIGINT,
  order_date DATE,
  revenue DECIMAL(10,2)
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET
LOCATION 's3://my-datalake/curated/orders_historical/';

-- 3. Add partitions
ALTER TABLE orders_historical ADD
  PARTITION (year=2014, month=1) LOCATION 's3://my-datalake/curated/orders_historical/year=2014/month=1/'
  PARTITION (year=2014, month=2) LOCATION 's3://my-datalake/curated/orders_historical/year=2014/month=2/';
  -- ... repeat for all partitions

-- 4. Delete historical data from Redshift
DELETE FROM orders WHERE order_date < CURRENT_DATE - 90;
VACUUM DELETE ONLY orders;

-- 5. Create view unifying recent and historical data
CREATE VIEW orders_all AS
SELECT * FROM orders  -- Recent data in Redshift
UNION ALL
SELECT * FROM orders_historical;  -- Historical data in S3

Cost savings:

Before: 500 TB × $24/TB = $12,000/month
After: 12 TB × $24/TB + 488 TB × $0.023/TB = $288 + $11 = $299/month
Savings: $11,701/month (97.5% reduction)

Use case: Centralized data lake accessed by multiple AWS accounts (business units, customers, partners).

Architecture (using Lake Formation):

Central Data Lake Account (S3 + Glue Catalog)
    ↓ (Lake Formation Grants)
Consumer Account 1 → Athena queries via shared catalog
Consumer Account 2 → Redshift Spectrum queries via shared catalog
Consumer Account 3 → QuickSight dashboards via shared catalog

Implementation:

In producer account (data lake owner):

Enable Lake Formation on S3 bucket and Glue database
Grant permissions to consumer accounts

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::222222222222:root \
  --resource '{"Database": {"Name": "my_database"}}' \
  --permissions SELECT DESCRIBE

Grant table-level access:

aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::222222222222:root \
  --resource '{"Table": {"DatabaseName": "my_database", "Name": "orders"}}' \
  --permissions SELECT

In consumer account:

Create resource link to shared database

CREATE EXTERNAL DATABASE shared_db
FROM DATA CATALOG
DATABASE 'my_database'
ACCOUNT '111111111111'  -- Producer account ID
REGION 'us-east-1';

Query shared data via Athena or Redshift Spectrum

SELECT * FROM shared_db.orders
WHERE order_date >= DATE '2024-11-01';

Security: Lake Formation enforces permissions at table and column level. Consumer accounts see only granted tables/columns.

Data Governance and Quality

Data Catalog Standards

Naming conventions:

Databases: {env}_{domain}_{purpose} (e.g., prod_sales_analytics, dev_marketing_raw)
Tables: {source}_{entity} (e.g., shopify_orders, ga4_events)
Columns: Snake_case, descriptive (e.g., customer_id, order_total_usd)

Metadata standards:

Table descriptions: Business purpose, data owner, refresh frequency
Column descriptions: Data type, sample values, business meaning
Tags: Environment (prod/dev), PII status (contains_pii/no_pii), cost center

Example well-documented table:

CREATE EXTERNAL TABLE shopify_orders (
  order_id BIGINT COMMENT 'Unique order identifier from Shopify',
  customer_id BIGINT COMMENT 'Foreign key to customers table',
  order_date DATE COMMENT 'Date order was placed (UTC)',
  total_price_usd DECIMAL(10,2) COMMENT 'Total order value in USD after discounts',
  order_status STRING COMMENT 'Current status: pending, fulfilled, cancelled, refunded'
)
COMMENT 'Daily snapshot of Shopify orders. Refreshes nightly at 2 AM UTC. Owner: sales-analytics@company.com'
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://my-datalake/curated/shopify_orders/'
TBLPROPERTIES (
  'classification'='parquet',
  'data_owner'='sales-analytics@company.com',
  'refresh_frequency'='daily',
  'contains_pii'='yes'
);

Data Quality Framework

Implement data quality checks at multiple stages:

1. Ingestion validation (Lambda/Glue):

def validate_schema(df, required_columns):
    """Ensure required columns exist with correct types."""
    missing = set(required_columns) - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

def validate_data_quality(df):
    """Check for nulls, duplicates, invalid values."""
    # Check for nulls in critical columns
    null_counts = df.select([
        count(when(col(c).isNull(), c)).alias(c)
        for c in df.columns
    ]).collect()[0].asDict()

    critical_nulls = {k: v for k, v in null_counts.items() if k in ['order_id', 'customer_id'] and v > 0}
    if critical_nulls:
        raise ValueError(f"Nulls found in critical columns: {critical_nulls}")

    # Check for duplicates on primary key
    duplicate_count = df.groupBy("order_id").count().filter("count > 1").count()
    if duplicate_count > 0:
        raise ValueError(f"Found {duplicate_count} duplicate order_ids")

    return True

2. Athena queries for ongoing monitoring:

-- Detect schema drift (unexpected new columns)
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'orders'
  AND column_name NOT IN ('order_id', 'customer_id', 'order_date', 'total_price_usd');

-- Detect data freshness issues (no data in last 24 hours)
SELECT MAX(order_date) as latest_order_date
FROM orders
HAVING MAX(order_date) < CURRENT_DATE - INTERVAL '1' DAY;

-- Detect data volume anomalies (daily row count varies >20%)
WITH daily_counts AS (
  SELECT
    order_date,
    COUNT(*) as row_count,
    AVG(COUNT(*)) OVER (ORDER BY order_date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) as avg_count
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '30' DAY
  GROUP BY order_date
)
SELECT order_date, row_count, avg_count
FROM daily_counts
WHERE ABS(row_count - avg_count) / avg_count > 0.20;

3. Glue Data Quality (built-in service):

# Define data quality ruleset
ruleset = """
Rules = [
  IsComplete "order_id",
  IsUnique "order_id",
  ColumnValues "order_status" in ["pending", "fulfilled", "cancelled", "refunded"],
  ColumnLength "customer_email" <= 255,
  Completeness "customer_id" > 0.95
]
"""

# Evaluate ruleset
quality_result = glueContext.evaluate_data_quality(
    frame=orders_df,
    ruleset=ruleset,
    publishing_options={
        "cloudwatch_metrics_enabled": True,
        "results_s3_prefix": "s3://my-bucket/dq-results/"
    }
)

# Fail job if quality thresholds not met
if quality_result.overall_status == "FAIL":
    raise Exception("Data quality checks failed")

Access Control with Lake Formation

Implement fine-grained permissions:

Row-level filtering (restrict data by region):

-- Create data filter
CREATE DATA FILTER sales_us_only
ON TABLE sales_data
COLUMN_NAMES *
ROW_FILTER '(region = "US")'

Column-level filtering (hide PII from analysts):

# Grant access to orders table but exclude PII columns
aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/AnalystRole \
  --resource '{
    "TableWithColumns": {
      "DatabaseName": "sales_db",
      "Name": "orders",
      "ColumnNames": ["order_id", "order_date", "total_price_usd", "order_status"]
    }
  }' \
  --permissions SELECT

Analysts with AnalystRole cannot see customer_email, customer_phone, or other PII columns.

Common Pitfalls

Storing Raw Data in Wrong Format

Symptom: Athena queries cost 10× more than expected, performance slow.

Root cause: Raw JSON stored uncompressed instead of Parquet with Snappy compression.

Example cost impact:

1 TB raw JSON, query selects 5 columns of 50 = 1 TB scanned = $5.00/query
1 TB converted to Parquet (10:1 compression) = 100 GB, query selects 5 columns = 10 GB scanned = $0.05/query

Solution: Always convert raw data to Parquet in processed/curated zones.

Missing Partition Filters

Symptom: Queries slow and expensive despite partitioned tables.

Root cause: Queries don’t filter on partition keys, Athena scans all partitions.

Example:

-- Bad: scans all 365 days
SELECT SUM(revenue) FROM orders WHERE order_status = 'completed';

-- Good: scans only November 2024
SELECT SUM(revenue)
FROM orders
WHERE year = 2024 AND month = 11 AND order_status = 'completed';

Solution: Always include partition key predicates (year, month, day) in WHERE clause.

Over-Provisioning Redshift Cluster

Symptom: Redshift cluster costs $40,000/month but CPU utilization averages 15%.

Root cause: Cluster sized for peak load (Black Friday) but runs 24/7 at low utilization rest of year.

Solution: Use Redshift Serverless or implement auto-pause/resume.

Redshift Serverless:

Pay per RPU-hour (Redshift Processing Unit)
Auto-scales from 8 RPUs to 512 RPUs based on load
Cost: $0.375/RPU-hour
Example: Average 32 RPUs × 730 hours/month = $8,760/month vs $40,000 for over-provisioned cluster

Alternative: Pause cluster during off-hours (nights, weekends).

Cluster runs 12 hours/day × 5 days/week = 260 hours/month (vs 730 hours 24/7)
Cost reduction: 65% ($40,000 → $14,250)

Not Implementing Data Lifecycle Policies

Symptom: S3 storage costs grow unbounded, storing 10 years of raw logs that are never queried after 90 days.

Root cause: No lifecycle policy to transition old data to cheaper storage or delete it.

Solution: Implement S3 lifecycle policies.

Example lifecycle policy:

{
  "Rules": [{
    "Id": "archive-raw-logs",
    "Filter": {"Prefix": "raw/logs/"},
    "Status": "Enabled",
    "Transitions": [
      {
        "Days": 90,
        "StorageClass": "INTELLIGENT_TIERING"
      },
      {
        "Days": 365,
        "StorageClass": "GLACIER_DEEP_ARCHIVE"
      }
    ],
    "Expiration": {
      "Days": 2555  // 7 years retention
    }
  }]
}

Cost impact:

S3 Standard: $0.023/GB/month
S3 Intelligent-Tiering (infrequent access): $0.0125/GB/month
S3 Glacier Deep Archive: $0.00099/GB/month

1 TB data stored 7 years:

All S3 Standard: $0.023 × 1,000 GB × 84 months = $1,932
Lifecycle policy: 3 months Standard + 9 months Intelligent-Tiering + 72 months Glacier = $69 + $113 + $71 = $253
Savings: $1,679 (87%)

Mixing Concerns in Data Lake Zones

Symptom: Processed zone contains mix of clean Parquet files and incomplete/corrupted files from failed jobs.

Root cause: Writing directly to final location instead of staging-then-promotion pattern.

Solution: Use staging directory, atomic promotion.

Bad pattern:

# Write directly to final location
df.write.parquet("s3://datalake/processed/orders/year=2024/month=11/day=15/")
# If job fails mid-write, partial data left in final location

Good pattern:

# Write to staging location
staging_path = "s3://datalake/staging/orders-20241115-143022/"
df.write.parquet(staging_path)

# Validate data quality
validate_data_quality(spark.read.parquet(staging_path))

# Atomic promotion (S3 rename is atomic)
import boto3
s3 = boto3.client('s3')

final_path = "s3://datalake/processed/orders/year=2024/month=11/day=15/"
# Copy staging to final, then delete staging
# Use S3 batch operations or AWS CLI sync for large datasets

Not Monitoring Data Quality

Symptom: Dashboards show incorrect metrics for weeks before discovery.

Root cause: No automated data quality checks, rely on users reporting issues.

Solution: Implement automated quality checks with alerts.

Example CloudWatch alarm:

# Glue job publishes data quality metrics to CloudWatch
cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_data(
    Namespace='DataQuality',
    MetricData=[{
        'MetricName': 'NullPercentage',
        'Dimensions': [
            {'Name': 'TableName', 'Value': 'orders'},
            {'Name': 'ColumnName', 'Value': 'customer_id'}
        ],
        'Value': null_percentage,
        'Unit': 'Percent'
    }]
)

# Create alarm: alert if null percentage > 5%
cloudwatch.put_metric_alarm(
    AlarmName='orders-customer-id-nulls',
    MetricName='NullPercentage',
    Namespace='DataQuality',
    Statistic='Average',
    Period=3600,
    EvaluationPeriods=1,
    Threshold=5.0,
    ComparisonOperator='GreaterThanThreshold',
    ActionsEnabled=True,
    AlarmActions=['arn:aws:sns:us-east-1:123456789012:data-quality-alerts']
)

Key Takeaways

Modern data architecture decouples storage from compute using S3 as the central data lake. This eliminates the rigid schema-on-write requirement of traditional warehouses, reduces storage costs by 90-95%, and enables diverse tools (Athena, Redshift, SageMaker, QuickSight) to access the same data without duplication.

Choose architecture pattern based on workload characteristics. Use pure data lake (S3 + Athena) for cost-sensitive ad-hoc analytics on diverse data types. Use data warehouse (Redshift) for high-frequency, complex queries on structured data with sub-second latency requirements. Use lake house (Redshift + Spectrum) to combine hot data performance with cold data cost efficiency.

Organize data lake into three zones: Raw (original format, immutable), Processed (cleaned, Parquet with partitioning), Curated (business-level aggregates optimized for queries). This separation enables reprocessing, clear ownership, and performance optimization without compromising data lineage.

Cost optimization comes from format conversion and lifecycle policies. Converting JSON to Parquet reduces query costs by 90-98% through columnar compression and column pruning. S3 lifecycle policies transition infrequently-accessed data to Glacier, reducing storage costs by 95% (from $0.023/GB to $0.001/GB).

Implement data governance from day one using Glue Data Catalog naming conventions, Lake Formation access controls, and automated data quality checks. Row-level and column-level security enable multi-tenant architectures where different teams query the same catalog with appropriate filtering.

Common pitfalls involve format choices and lifecycle management. Storing raw JSON instead of Parquet increases costs 10-50×. Not implementing partition pruning causes full-table scans. Over-provisioning Redshift clusters wastes 50-80% of capacity. Missing lifecycle policies cause unbounded storage growth.

Integration patterns connect batch and streaming pipelines to the data lake. Use Glue for batch ETL, Kinesis Firehose for streaming ingestion, and Lambda for event-driven processing. Redshift Spectrum enables federated queries across data warehouse and data lake without data movement.

The lake house pattern is the pragmatic default for most organizations. Keep 90 days of hot data in Redshift for fast dashboards, archive historical data to S3, query both seamlessly with Redshift Spectrum. This provides Redshift performance where needed while achieving 95% storage cost reduction on cold data.

What Problems Modern Data Architecture Solves

Core Architectural Patterns

Pattern 1: Data Lake (S3 + Glue + Athena)

Pattern 2: Data Warehouse (Redshift)

Pattern 3: Lake House (S3 + Glue + Redshift Spectrum + Athena)

Service Selection Framework

Decision Matrix: Storage Layer

Decision Matrix: Query Engine

Decision Matrix: Transformation Layer

Cost-Based Decision Framework

Integration Patterns

Pattern 1: Batch Ingestion Pipeline

Pattern 2: Real-Time Streaming Pipeline

Pattern 3: Data Warehouse Offload

Pattern 4: Multi-Account Data Sharing

Data Governance and Quality

Data Catalog Standards

Data Quality Framework

Access Control with Lake Formation

Common Pitfalls

Storing Raw Data in Wrong Format

Missing Partition Filters

Over-Provisioning Redshift Cluster

Not Implementing Data Lifecycle Policies

Mixing Concerns in Data Lake Zones

Not Monitoring Data Quality

Key Takeaways