Azure Synapse Analytics for System Architects
What Is Azure Synapse Analytics
Azure Synapse Analytics is Microsoftβs answer to distributed analytics platforms. Unlike single-purpose database systems, Synapse integrates multiple computation engines and storage layers into one workspace, allowing data engineers to conduct exploratory analysis, load data at scale, and run complex queries on massive datasets.
The platform combines three core components: dedicated SQL pools (formerly Azure SQL Data Warehouse) for traditional MPP data warehousing, Apache Spark pools for distributed data processing and machine learning, and built-in data integration through pipelines and linked services. This convergence of technologies eliminates the need to stitch together separate services.
What Problems Synapse Solves
Without Synapse (separate services):
- Multiple services to manage (data warehouse, Spark cluster, data factory, storage accounts)
- Complex authentication and authorization across services
- Difficult data movement between analytics engines
- Separate performance tuning requirements per service
- Higher operational complexity and cost
With Synapse:
- Single workspace integrating SQL, Spark, and integration capabilities
- Unified authentication through workspace identity
- Direct data access between engines (minimal movement)
- Unified performance tuning through workload management
- Simplified operations and cost optimization
How Synapse Differs from AWS Redshift
Architects familiar with AWS should understand these key differences:
| Aspect | AWS Redshift | Azure Synapse |
|---|---|---|
| Architecture | Single MPP cluster (all nodes same type) | Flexible pools: dedicated SQL, Spark, Data Explorer |
| Scaling | Resize cluster (downtime or Redshift RA3 with managed storage) | Scale DWUs independently; Spark auto-scales |
| Storage | Attached SSD for dc2/ra3; object storage for Spectrum | Separate (Data Lake Storage Gen2) with compute pools |
| Compute engines | SQL only (Spark requires separate EMR) | SQL + Spark + Data Explorer in one workspace |
| Data integration | AWS Glue, Lambda, or custom ETL | Synapse Pipelines (Azure Data Factory) built-in |
| Authentication | IAM roles | Entra ID + RBAC + SQL authentication (legacy) |
| Cost model | Per-node-hour | DWU-hour (SQL) + Spark cluster-hour |
| PolyBase | Redshift Spectrum (limited to S3/DynamoDB/RDS) | PolyBase (S3-compatible, SQL Server, Delta Lake, Parquet) |
| Query optimization | Automatic query optimization | Workload management + result set caching + materialized views |
Dedicated SQL Pools (MPP Data Warehouse)
What Dedicated SQL Pools Are
Dedicated SQL pools are the core data warehousing engine in Synapse. They implement massive parallel processing (MPP), distributing data across multiple nodes and executing queries in parallel.
Key Characteristics:
- MPP architecture with 60 nodes maximum per pool
- Data Distribution Units (DWUs) as the capacity metric (100 to 30,000 DWUs)
- Reserved compute (you pay for capacity whether used or not)
- Scales in 100-DWU increments
- Pause/resume capability to reduce costs when not in use
- Built-in redundancy and backup
Data Warehouse Units (DWUs)
DWUs represent compute capacity and are the primary scaling metric. A DWU encompasses CPU, memory, and I/O bundled together.
DWU scaling tiers:
| DWU Level | Compute Nodes | vCores per Node | Memory per Node | Use Case |
|---|---|---|---|---|
| DW100c | 2 | 4 | 32 GB | Development, testing, small workloads |
| DW500c | 5 | 4 | 32 GB | Small production workloads |
| DW1000c | 10 | 4 | 32 GB | Medium workloads with standard concurrency |
| DW2000c | 20 | 4 | 32 GB | Medium-large workloads |
| DW5000c | 50 | 4 | 32 GB | Large workloads with high concurrency |
| DW10000c | 100 | 4 | 32 GB | Very large workloads |
| DW30000c | 300 | 4 | 32 GB | Extreme scale (enterprise workloads) |
Scaling rules of thumb:
- Start with DW500c or DW1000c for initial workloads
- Scale up by 2x when query performance degrades
- Scale down when sustained CPU utilization is below 20%
- Each DWU increase doubles the compute and storage IOPS capacity
- Scaling takes 10-15 minutes and causes connections to drop briefly
DWU vs Gen2 Generations
Synapse only supports Gen2 (newest architecture). Gen1 is deprecated. Gen2 provides significantly better query performance and is the only option for new pools. Do not consider Gen1 for new designs.
Distribution Strategies
Data in a dedicated SQL pool is distributed across nodes according to a distribution key. This key determines which rows reside on which nodes and critically affects query performance.
Three distribution types:
1. Hash Distribution (most common)
- Rows are distributed based on a hash of the distribution column
- Distributes data evenly across nodes when the column has high cardinality
- Best for: Fact tables and columns that are frequently joined
- Avoid: Boolean columns, date columns with sparse values, or columns with skewed distributions
Example:
CREATE TABLE Orders (
OrderID INT,
CustomerID INT,
Amount DECIMAL,
OrderDate DATE
)
WITH (
DISTRIBUTION = HASH(CustomerID)
);
For a query joining Orders to Customers on CustomerID, both tables should be hash-distributed on CustomerID. This collocates the join data on the same nodes, eliminating network shuffles.
2. Round-Robin Distribution (default)
- Rows are distributed sequentially across nodes in round-robin fashion
- Provides even distribution regardless of data patterns
- Best for: Staging tables, temporary work tables, or tables without obvious join columns
- Avoid: Tables that are frequently joined on specific columns (causes shuffle)
When you create a table without specifying distribution, Synapse defaults to round-robin.
3. Replicated Distribution (smallest tables only)
- The entire table is replicated to every node
- Eliminates shuffle when the table is joined on any column
- Best for: Small reference/dimension tables (< 2 GB recommended)
- Storage overhead: Replicated table occupies (number of nodes) Γ (table size) storage
Use case: A Customers table (100 MB) replicated across 10 nodes occupies 1 GB total.
Distribution design pattern:
- Identify your largest fact table
- Hash-distribute on the primary join column
- Hash-distribute other large tables on compatible join columns
- Replicate small dimension tables
- Use round-robin for staging/temporary tables
Workload Management
Workload management controls resource allocation across concurrent queries, preventing one user or query from monopolizing the pool.
Resource Classes (legacy approach, still used):
Resource classes allocate memory and concurrency slots:
| Resource Class | Memory per Distribution | Max Concurrency | Use Case |
|---|---|---|---|
| smallrc | 100 MB | 128 | Lightweight queries, reports |
| mediumrc | 400 MB | 32 | Standard analytical queries |
| largerc | 1.6 GB | 8 | Complex queries, index builds |
| xlargerc | 6.4 GB | 4 | Very large data processing jobs |
| staticrc10 | 10% of pool memory | 128 | Admin tasks, system operations |
| staticrc20 | 20% of pool memory | 64 | Heavy analytical workloads |
Workload Groups (newer, recommended):
Workload groups provide finer-grained control through classification and resource allocation policies:
CREATE WORKLOAD GROUP AnalyticsGroup
WITH (
MIN_PERCENTAGE_RESOURCE = 20,
CAP_PERCENTAGE_RESOURCE = 50,
REQUEST_MIN_RESOURCE_GRANT_PERCENT = 1.0
);
This defines a workload group that guarantees 20% of pool resources, caps usage at 50%, and allocates minimum 1% per request.
Best practice: Use workload groups for production workloads. Resource classes are simpler but less flexible. Most organizations blend both approaches: resource classes for interactive queries, workload groups for batch jobs.
Performance Optimization Techniques
1. Materialized Views
Materialized views pre-aggregate and cache query results, reducing computation for repeated queries.
CREATE MATERIALIZED VIEW DailySalesSummary AS
SELECT
OrderDate,
ProductID,
SUM(Amount) as DailyTotal,
COUNT(*) as OrderCount
FROM Orders
GROUP BY OrderDate, ProductID;
Queries referencing the underlying tables may automatically use the materialized view if the result matches. This is called automatic query rewrite and does not require changing application code.
When to use: High-cost aggregations that are queried repeatedly (dashboards, reports). Maintenance occurs through manual refresh (incremental refresh is not available in Synapse dedicated SQL pools).
2. Result Set Caching
Result set caching automatically caches query results at the pool level, serving identical queries from cache without recomputation.
SELECT * FROM Orders WHERE OrderDate = '2025-01-15'
OPTION (RESULT_CACHE = ON);
Cache is invalidated when underlying tables are modified. This is particularly effective for dashboards where the same queries run repeatedly within short time windows.
Trade-off: Cache takes pool memory. For very large result sets, the overhead may not be worth it.
3. Table Design
- Use clustered columnstore indexes (the default). They compress data by 10x on average
- Partition large tables by date or region to enable partition elimination
- Keep tables in columnar format; avoid heaps
- Use appropriate data types (BIGINT when INT suffices wastes storage)
PolyBase for External Data
PolyBase allows queries against external data sources without loading data into the pool. You define external tables pointing to data in Azure Data Lake Storage, Azure Blob Storage, SQL Server, or other sources.
CREATE EXTERNAL DATA SOURCE AzureLake
WITH (
TYPE = HADOOP,
LOCATION = 'abfss://data@datalake.dfs.core.windows.net',
CREDENTIAL = WorkspaceIdentity
);
CREATE EXTERNAL TABLE ExternalOrders (
OrderID INT,
CustomerID INT,
Amount DECIMAL
)
WITH (
LOCATION = '/orders/2025/',
DATA_SOURCE = AzureLake,
FILE_FORMAT = ParquetFormat
);
SELECT * FROM ExternalOrders WHERE Amount > 1000;
Queries push filters to the external source when possible, reducing data movement.
When to use PolyBase:
- Exploratory queries on data lake data before loading to the warehouse
- Real-time access to frequently changing external data
- Archiving historical data to cheaper storage while maintaining queryability
- Joining external and warehouse data
Trade-off: External queries are slower than internal table queries (network latency, no local optimization). Use for ad-hoc queries or infrequent access patterns.
COPY Command
The COPY statement is the modern way to load data into Synapse SQL pools. It replaces the older BULK INSERT and is simpler than PolyBase for data ingestion.
COPY INTO Orders
FROM 'https://datalake.blob.core.windows.net/data/orders.parquet'
WITH (
CREDENTIAL = (IDENTITY = 'Shared Access Signature', SECRET = 'SAS_TOKEN'),
FILE_FORMAT = 'ParquetFormat'
);
COPY is faster than INSERT SELECT and handles various formats (CSV, Parquet, ORC, JSON).
When to use: Loading raw data from data lakes or external storage during ETL processes. Use COPY as your default ingestion method.
Synapse Workspace and Studio
Workspace Architecture
A Synapse workspace is the container for all analytics resources: dedicated SQL pools, Spark pools, pipelines, notebooks, and configurations.
Workspace components:
- Dedicated SQL pools: Data warehouse compute engines
- Spark pools: Apache Spark clusters for big data processing
- Data Explorer pools: Kusto Query Language (KQL) for log and telemetry analytics
- Synapse Pipelines: ETL orchestration (Azure Data Factory)
- Linked Services: Connections to external data sources
- Integration Runtimes: Compute engines for pipelines and Spark
Workspace-level settings:
- Managed VNet: Optional network isolation for workspace resources
- Storage account: Primary data lake (Data Lake Storage Gen2)
- SQL Server firewall: Controls client access to SQL pools
- Authentication: Entra ID by default, SQL authentication available but not recommended
Synapse Studio
Synapse Studio is the web-based IDE for Synapse development. It provides:
- SQL query editor for writing T-SQL against dedicated pools
- Notebook environment for interactive Spark and Python development
- Pipeline builder for visual ETL workflow design
- Linked service configuration and management
- Data integration and publishing workflows
Studio is where data engineers and analysts spend most of their time. It eliminates the need for separate tools like SQL Server Management Studio or Jupyter notebooks.
Data Lake Storage Integration
Synapse tightly integrates with Azure Data Lake Storage Gen2. Every workspace has a default storage account where raw data is stored before loading into SQL pools.
Typical data flow:
- Raw data lands in Data Lake Storage (via ingestion pipelines, ADLS SDKs, or ADF Copy activity)
- Spark jobs transform and clean data (written as Parquet)
- Cleaned data is loaded into dedicated SQL pools via COPY
- Analytics queries run against the warehouse
- Results are cached or exported
This separation (storage separate from compute) allows independent scaling: you can have a massive data lake with small compute pools (pay only for processing time) or vice versa.
Spark Pools
What Spark Pools Are
Apache Spark pools provide distributed big data processing within Synapse. Unlike dedicated SQL pools which require structured, schema-enforced data, Spark handles unstructured data (images, logs, raw text) and complex transformations.
Key characteristics:
- On-demand auto-scaling (0 to 600 nodes)
- Support for PySpark (Python), Spark SQL, Scala, and .NET Spark
- Integration with notebooks for interactive development
- Access to the same data lake as SQL pools
- Can write results back to data lake or SQL pools
Spark vs Dedicated SQL Pools
| Aspect | Spark Pools | Dedicated SQL Pools |
|---|---|---|
| Data format | Unstructured, semi-structured, structured | Structured, schema-enforced |
| Language | Python, Scala, SQL, .NET | T-SQL only |
| Processing model | MapReduce/DAG | MPP query execution |
| State | Stateless, auto-scales to 0 | Reserved capacity |
| Latency | Minutes to hours | Seconds to minutes |
| Use case | ETL, ML feature engineering, data exploration | Analytics queries, reports |
| Cost | Per core-hour of usage | Per DWU-hour |
When to use Spark:
- Transforming raw data before loading to warehouse
- Building machine learning features from raw data
- Processing unstructured data (logs, images, documents)
- Interactive data exploration (notebooks)
When to use Dedicated SQL pools:
- Running analytical queries on structured data
- Building dashboards and reports
- Complex OLAP operations with joins and aggregations
- Serving data via BI tools
Data Explorer Pools
Data Explorer pools provide Kusto Query Language (KQL) for analyzing logs, metrics, and telemetry data. They are optimized for time-series queries with rapid ingestion and interactive exploration.
When to use Data Explorer:
- Analyzing application logs and diagnostics
- Real-time monitoring and alerting on metrics
- Time-series analysis (system performance over time)
- Ad-hoc exploration of operational data
- Not recommended for data warehouse workloads (use dedicated SQL pools instead)
Synapse Pipelines (Data Integration)
Synapse Pipelines (built on Azure Data Factory) orchestrate data movement and transformation workflows.
Common pipeline patterns:
- Copy Activity: Move data from source (Blob Storage, SQL Server, Salesforce) to Data Lake Storage Gen2
- Data Flow Activity: Visual transformation of data (join, filter, aggregate)
- Notebook Activity: Execute Spark notebooks as part of the pipeline
- SQL Script Activity: Execute T-SQL stored procedures or queries
- Wait Activity: Pause execution until a condition is met
Example pipeline flow:
- Copy Activity: Download data from external API to Data Lake
- Data Flow: Clean and transform data
- Notebook Activity: Run Spark ML feature engineering
- Copy Activity: Load results into dedicated SQL pool
- SQL Script: Execute stored procedure to update aggregate tables
Pipelines are triggered on schedule, via webhooks, or manually.
Common Pitfalls
Pitfall 1: Hash Distribution Without Understanding Join Cardinality
Problem: Hash-distributing a table on a column that is frequently joined, but the join is not on that column.
Result: Every query involving that join shuffles data across all nodes, causing severe performance degradation (queries run 100x slower).
Solution: Hash-distribute on the column used in the most common joins. For fact tables, this is typically the foreign key to the largest dimension. Analyze your query patterns before choosing distribution keys.
Pitfall 2: Replicating Large Tables
Problem: Replicating a table that grows to 10+ GB.
Result: Storage multiplies across nodes (10 GB Γ 100 nodes = 1 TB storage cost). Replication also adds latency during updates (all nodes must be synchronized).
Solution: Replicate only small reference tables (< 2 GB). For larger tables, use hash distribution on join columns instead.
Pitfall 3: Leaving Dedicated SQL Pools Running During Off-Hours
Problem: Pausing the pool only during scheduled maintenance, not during nights/weekends.
Result: Unnecessary compute costs accumulate. A DW2000c running 24/7 for a month costs roughly 2x the cost of pausing at nights.
Solution: Automate pause/resume on a schedule. Use Synapse Pipelines, Azure Logic Apps, or Automation Accounts to pause the pool when no queries are running (e.g., 6 PM to 8 AM).
Pitfall 4: No Workload Management, All Queries Same Priority
Problem: Running ad-hoc developer queries and critical business reports with the same resource allocation.
Result: A long-running exploration query from a developer blocks important executive reports. SLAs are violated.
Solution: Implement workload groups. Assign critical reports high priority and more resources. Assign exploration queries lower priority and limit their concurrent sessions. Use query classification to automatically route queries to appropriate groups.
Pitfall 5: Not Pausing Materialized View Refreshes
Problem: Refreshing materialized views during peak usage hours when the pool should be reserved for user queries.
Result: View refreshes consume resources, slowing user queries and violating query SLAs.
Solution: Schedule materialized view refreshes during off-peak hours (late night, early morning). Use MAXDOP and SET SESSION settings to limit the resources a refresh job consumes.
Pitfall 6: PolyBase Queries on Massive External Datasets
Problem: Using PolyBase to query 100+ GB of Parquet files in the data lake for exploratory analysis.
Result: Query takes 30+ minutes even with filter pushdown. Network and compute become bottlenecks.
Solution: For exploratory queries, load a sample of external data into the warehouse first. Use Spark pools for transformations on the full dataset, then load results to SQL for final analysis. PolyBase is best for smaller datasets or joining external data with warehouse data.
Pitfall 7: Ignoring Data Skew in Distribution
Problem: Hash-distributing on a column with highly skewed values (e.g., 90% of rows have the same value).
Result: One node receives 90% of the data, becoming a bottleneck. Queries run at the speed of the slowest node.
Solution: Before distributing, query the system to check for data skew across distributions. If one distribution has > 5% more rows than others, choose a different column or use round-robin. Uneven distribution compounds over time as data grows.
Architectural Patterns
Pattern 1: Standard Data Warehouse
Use case: Structured analytics with multiple data sources, moderate scale (1-10 billion rows).
Data Sources (CRM, ERP, Web Logs)
β
Data Lake Storage Gen2 (Raw Zone)
β
Synapse Pipelines (Copy Activity)
β
Spark Pool (Cleaning & Transformation)
β
Data Lake Storage (Prepared Zone)
β
Dedicated SQL Pool (Analytics)
β
Power BI / Reporting Tools
Components:
- One or two small Spark pools (32-128 cores) for nightly ETL
- One dedicated SQL pool (DW1000c-DW2000c) for day-time queries
- Data Lake organized as raw β prepared β analytics zones
- Materialized views for common aggregations
- Workload groups separating user queries from ETL jobs
Pattern 2: High-Concurrency Analytics (Multi-tenant SaaS)
Use case: SaaS analytics platform serving many tenants, 100+ concurrent users, strict query latency SLAs.
Application Data
β
Dedicated SQL Pool (DW5000c+)
β
Result Set Caching + Materialized Views
β
Power BI Embedded or Custom Portal
Design considerations:
- Higher DWU level (DW5000c or above) for concurrency
- Workload groups with priority: critical reports (higher), ad-hoc (lower)
- Aggressive caching: result set cache + materialized views
- Row-level security (RLS) to isolate tenant data in queries
- Replicated dimensions for quick joins
- Connection pooling at the application level
Pattern 3: Data Lake + Ad-hoc Analytics
Use case: Data science and exploration, large datasets (100+ GB), interactive queries via notebooks.
Raw Data (Data Lake)
β
Spark Pools (Notebook-based exploration)
β
Subset β Dedicated SQL Pool (for final analysis)
β
Visualization
Design considerations:
- Larger Spark pools (256+ cores) for interactive workloads
- Smaller dedicated SQL pool (or none) if primary work is in Spark
- Delta format in data lake for ACID transactions
- Notebooks as the primary development environment
- Use PolyBase for joining external data with warehouse data
Key Takeaways
-
Synapse is a unified analytics platform combining SQL, Spark, and integration. This convergence reduces operational complexity compared to managing separate data warehouse, Spark clusters, and ETL services.
-
Data distribution strategy is the most important design decision for dedicated SQL pools. Hash-distribute on join columns, replicate small tables, and use round-robin for staging. Bad distribution keys cause orders-of-magnitude performance degradation.
-
DWUs are your primary scaling lever, but pause/resume is your primary cost control. Pausing during off-hours saves 50% of compute costs. Scaling up improves concurrency and query speed.
-
Workload management prevents query interference. Without it, a single slow query can block all others. Resource classes or workload groups are essential for any production environment.
-
Spark pools are for transformation, SQL pools are for analytics. Use Spark for ETL and exploratory work; use dedicated SQL pools for queries that need sub-second latency and support many concurrent users.
-
Materialized views cache expensive aggregations without code changes. They automatically rewrite queries that match their definition. Build them on high-cost, frequently accessed aggregations.
-
Result set caching provides easy wins for repeated queries. Dashboards and reports that run the same queries multiple times per day benefit significantly.
-
PolyBase is for exploration, not for large-scale joins. Use it to query sample data or reference small external tables. For large external datasets, load to the data lake first, transform with Spark, then load to SQL.
-
Integration with Data Lake Storage is a core strength. Synapse reads/writes Parquet efficiently. This storage-compute separation allows flexible scaling of each independently.
-
Data exploration should happen in Spark notebooks before loading to the warehouse. Spark handles unstructured and semi-structured data naturally. SQL pools are optimized for structured, pre-validated data.
Found this guide helpful? Share it with your team:
Share on LinkedIn