Azure Monitor for System Architects
What Is Azure Monitor
Azure Monitor is the central platform for collecting, analyzing, and alerting on observability data across your Azure estate. It captures three fundamental types of data: metrics (numerical measurements at specific points in time), logs (detailed events and traces), and application performance data.
Unlike point solutions that monitor individual services, Azure Monitor is deeply integrated into the Azure platform. Resource metrics flow automatically to Azure Monitor. Logs are centralized in Log Analytics workspaces. Alerts, dashboards, and analysis tools operate on this unified data.
What Problems Azure Monitor Solves
Without Azure Monitor:
- Resource metrics exist in isolation; no central view of system health
- No structured place to store application and resource logs
- Alerting must be built separately for each service or resource
- Root cause analysis requires hopping between multiple tools
- No consistent way to visualize infrastructure and application state
- Long-term trend analysis and capacity planning become manual
With Azure Monitor:
- Centralized collection of metrics from all Azure resources
- Log Analytics workspace as the central destination for all logs and traces
- Unified alerting with action groups and notification routing
- Workbooks and dashboards for custom visualizations
- Structured querying with KQL (Kusto Query Language) for analysis
- Historical data retention for trend analysis and forecasting
- Integration with Azure Advisor for best practice recommendations
- Application Insights for deep application performance monitoring
How Azure Monitor Differs from AWS CloudWatch
Architects moving from AWS to Azure should understand these key differences:
| Concept | AWS CloudWatch | Azure Monitor |
|---|---|---|
| Core data types | Metrics + Logs (CloudWatch Logs) | Metrics + Logs (Log Analytics) + Application Insights |
| Logs destination | CloudWatch Logs (separate service) | Log Analytics workspace (integrated) |
| Application monitoring | X-Ray or CloudWatch Insights | Application Insights (built-in) |
| Metrics query language | CloudWatch Insights (limited) | KQL (Kusto Query Language, full SQL-like syntax) |
| Alerts | CloudWatch Alarms | Azure Monitor Alerts (metric, log, activity) + Smart Detection |
| Dashboards | CloudWatch Dashboards | Azure Dashboards or Workbooks (more flexible) |
| Logs ingestion cost | Per GB ingested | Per GB ingested and retained |
| Resource metrics | Auto-collected | Auto-collected |
| Agent-based monitoring | CloudWatch Agent | Azure Monitor Agent (AMA) with data collection rules |
Core Azure Monitor Components
Metrics
Metrics are numerical measurements at specific points in time, collected at regular intervals (typically every 1-5 minutes for platform metrics). Every Azure resource automatically emits metrics to Azure Monitor.
Metric characteristics:
- Time-series data: Values associated with timestamps
- Dimensional: Metrics can be filtered by dimensions (e.g., VM metric filtered by VM name, storage account metric filtered by container)
- Real-time: Low latency between measurement and availability
- Retention: Platform metrics retained for 93 days
- Cost: No ingestion cost; you only pay for metric queries and alerts
- Resolution: 1-minute granularity for most platform metrics
Common Azure resource metrics:
| Resource | Key Metrics |
|---|---|
| Virtual Machine | CPU Percentage, Network In/Out, Disk Read/Write Bytes, Available Memory |
| App Service | CPU Time, Memory Working Set, Request Count, Response Time |
| Azure SQL Database | CPU Percentage, Data Space Used, Active Connections, Deadlock Count |
| Storage Account | Transactions, Ingress/Egress, Availability |
| Azure Cosmos DB | Request Units, Latency, Replication Lag |
| Load Balancer | Data Path Availability, Health Probe Status, Backend Health |
When to use metrics:
- Detect resource state changes in real-time
- Trigger alerts on performance thresholds
- Identify trends over hours, days, or months
- Monitor resource capacity and utilization
Logs and Log Analytics
Log Analytics workspaces are the central repositories for all logs and traces. Resources, applications, and services send structured log data to workspaces where it can be queried, analyzed, and retained for long periods.
Log Analytics characteristics:
- Centralized storage: All logs from VMs, applications, and services in one place
- Structured schema: Logs stored as tables with columns (e.g.,
AzureActivity,SecurityEvent,AppTraces) - Long retention: Configurable retention from 30 days to 2 years (configurable per table)
- KQL querying: SQL-like language for querying and aggregating log data
- Cost model: Per GB of data ingested and per GB retained (retention charges applied daily)
- Ingestion rate: Default 6 GB per minute per workspace (higher with support request)
Common log sources:
| Source | Log Table | Use Case |
|---|---|---|
| Azure Activity Log | AzureActivity |
Track Azure control plane operations (resource creation, deletions, RBAC changes) |
| Resource Diagnostics | Service-specific (e.g., AzureDiagnostics) |
VM logs, app logs, database logs, firewall logs |
| Application Insights | AppTraces, AppExceptions, AppRequests |
Application performance, errors, dependencies |
| Security Events | SecurityEvent |
Windows event logs from VMs |
| Syslog | Syslog |
Linux system logs |
| Custom Logs | User-defined tables | Application custom telemetry |
When to use Log Analytics:
- Store structured log data from applications and services
- Query logs with SQL-like syntax to find patterns
- Correlate events across systems
- Investigate incidents and debug failures
- Build long-term audits and compliance records
Application Insights
Application Insights is a specialized monitoring service for application performance. It automatically collects request rates, response times, failure rates, dependency tracking, and exceptions from your application code.
Application Insights characteristics:
- Automatic instrumentation: SDKs collect data with minimal code changes
- Request tracking: Each request and its downstream calls are traced
- Dependency tracking: Automatic tracking of calls to databases, external APIs, and other services
- Correlation IDs: Requests and related operations are linked across services
- Sampling: Automatic adaptive sampling reduces cost at high volumes
- Maps and live metrics: Real-time view of application health and dependency relationships
- Logs sink: App Insights data flows to Log Analytics for long-term querying
When to use Application Insights:
- Monitor application request volume, response time, and failure rates
- Track performance regressions or anomalies
- Understand application dependencies and call chains
- Diagnose slow requests or exceptions in production
Metrics vs Logs: When to Use Which
The choice between metrics and logs depends on the data pattern and analysis goal:
| Aspect | Use Metrics | Use Logs |
|---|---|---|
| Data type | Numerical time-series | Structured events with many fields |
| Query speed | Seconds (optimized for time-series) | Seconds to minutes (depending on volume) |
| Cost at scale | Lower (no ingestion cost) | Higher (ingestion + retention cost) |
| Retention | 93 days (platform metrics) | Configurable (30 days to 2 years) |
| What to track | Resource utilization, performance (CPU, memory, requests) | Detailed events, errors, audit trails |
| Real-time alerting | Best choice for simple thresholds | For complex multi-condition alerts |
| Time-series visualization | Natural fit (dashboards, trending) | Possible but requires aggregation |
| Drill-down investigation | Limited (only metric dimensions) | Excellent (rich event context) |
In practice: Use metrics for alerting and dashboards. Use logs for investigation, troubleshooting, and building a historical audit trail.
Azure Monitor Agent and Data Collection Rules
The Azure Monitor Agent (AMA) is the modern replacement for legacy agents (Diagnostic Extension, Log Analytics Agent). It collects performance data, logs, and traces from VMs and sends them to Log Analytics workspaces and/or metrics backend.
How AMA Works
AMA is deployed to a VM and configured via Data Collection Rules (DCRs). A DCR specifies which data to collect (performance counters, event logs, syslog) and where to send it (Log Analytics workspace, metrics).
DCR advantages over legacy agents:
- Centralized configuration: Rules apply to multiple VMs at once
- Per-VM customization: Override rules for specific VMs
- Lower overhead: More efficient data collection and filtering
- Transformations: Filter, parse, and enrich data before ingestion
- Cost optimization: Collect only the data you need
Common DCR configurations:
| Configuration | Collects | Sends To |
|---|---|---|
| VMs Insights | Performance counters (CPU, memory, disk, network) | Metrics + Log Analytics |
| Windows Event Log | Application, Security, System event logs | Log Analytics |
| Syslog | Linux system logs | Log Analytics |
| Custom Logs | Application-generated files with structured format | Log Analytics |
Deployment Patterns
Per-VM deployment: Assign AMA and a DCR to each VM individually. Use when VMs have unique monitoring needs.
At-scale deployment: Use VM extensions or Azure Policy to deploy AMA and associate DCRs to groups of VMs based on tags or resource groups. This is the recommended production approach.
Alert Types
Azure Monitor supports four types of alerts, each suited for different scenarios:
Metric Alerts
Metric alerts trigger when a metric exceeds a threshold or crosses a boundary condition. They are low-latency (evaluated every 1-5 minutes) and ideal for resource utilization thresholds.
Trigger conditions:
- Static threshold (CPU > 80%)
- Dynamic threshold (anomaly detection based on historical patterns)
- Multiple conditions (all must be true to trigger)
Characteristics:
- Evaluated frequently (near real-time)
- Scoped to a single resource or dimension
- Lower cost (evaluated at the Azure platform level)
When to use: Alert on VM CPU, memory, disk space, network saturation, or application response time thresholds.
Log Alerts
Log alerts execute KQL queries on data in Log Analytics workspaces. They can aggregate data across multiple resources and perform complex analysis before triggering.
Example triggers:
- Count of failed requests > 10 in the last 5 minutes
- Specific error pattern in application logs
- Security event that matches a detection rule
Characteristics:
- Evaluated less frequently (every 5-60 minutes, configurable)
- Can aggregate data across resources
- Higher latency than metric alerts
- Can include complex business logic
When to use: Detect specific error patterns, security events, or correlate data across multiple log sources.
Activity Log Alerts
Activity log alerts trigger on Azure control plane events (resource creation, deletion, role assignments, policy changes). They monitor what happened to your Azure infrastructure itself.
Common triggers:
- Resource deleted
- RBAC role assigned to a user
- Virtual Machine stopped
- Azure Policy assignment changed
When to use: Track infrastructure changes, monitor for unauthorized operations, or notify when key resources are modified.
Smart Detection
Smart Detection (part of Application Insights) automatically detects anomalies in application telemetry using machine learning.
Detects:
- Abnormal increase in failure rates
- Performance degradation compared to historical baselines
- Unusual patterns in exception counts or response times
- Abnormal memory usage patterns
Characteristics:
- No configuration required
- Runs in the background on Application Insights data
- Identifies anomalies that might not trigger threshold-based alerts
When to use: Catch unexpected application behavior changes that wouldn’t trigger traditional threshold alerts.
Action Groups and Notification Routing
Action groups define how alerts are routed and who gets notified. A single action group can trigger multiple actions (email, SMS, webhook, runbook execution, incident creation).
Notification types:
| Action Type | Use Case |
|---|---|
| Immediate notification to team members | |
| SMS | Critical alerts requiring immediate action |
| Push notification | Mobile app notification |
| Webhook | Integrate with external systems (PagerDuty, Slack, custom tools) |
| Azure Function/Runbook | Automated remediation (restart service, scale resources) |
| Logic App | Complex workflows (approval, notification, escalation) |
| ITSM Connector | Create incidents in ServiceNow, Jira, or other ITSM tools |
Action group design pattern:
Alert triggered
↓
Action Group evaluates
↓
├─ Email: on-call team
├─ Slack: team-alerts channel
├─ Webhook: PagerDuty (page engineer if critical)
└─ Logic App: auto-restart service and notify
Cost considerations: Action groups themselves are free. Costs come from the underlying services (Logic Apps, Runbooks, ITSM connectors).
Workbooks for Visualization and Dashboards
Workbooks are interactive dashboards that combine charts, tables, KQL queries, and markdown text. They are more flexible than Azure Dashboards and are the modern way to visualize Azure Monitor data.
Workbook capabilities:
- Interactive parameters: Dropdown selectors for time range, resource, environment
- Multiple visualizations: Line charts, bar charts, tables, heatmaps, status indicators
- KQL queries: Custom analysis with full query power
- Conditional formatting: Highlight values based on thresholds
- Sharing: Share with teams, embed in documentation
- Versioning: Save versions and rollback if needed
Common workbook patterns:
| Workbook Type | Purpose |
|---|---|
| Resource Health Dashboard | Status of VM, databases, and services with key metrics |
| Application Performance | Request count, response time, failure rate, top slow requests |
| Security Monitoring | Failed authentication attempts, suspicious activity, compliance status |
| Cost Analysis | Resource costs over time, cost anomalies |
| Incident Response | Timeline of events, correlated logs, actions taken |
When to use Workbooks: Use them for visual monitoring, investigation, and sharing analysis with non-technical stakeholders. Use Azure Dashboards for simple pinned charts if you prefer simplicity.
Azure Service Health and Resource Health
Service Health
Service Health provides information about Azure service incidents and planned maintenance that may affect your environment.
Service Health tracks:
- Active incidents (service degradation affecting resources)
- Planned maintenance (scheduled downtime)
- Health advisories (informational updates)
- Upcoming changes to Azure services
Use cases:
- Understand why your resources are slow or unavailable
- Plan maintenance windows around Azure platform changes
- Receive notifications when your subscriptions are affected
Resource Health
Resource Health shows the health status of individual Azure resources. It indicates whether a resource is available, degraded, or unavailable and provides context for the issue.
Resource Health status values:
- Available: Resource is healthy and functioning
- Degraded: Resource is experiencing performance issues
- Unavailable: Resource is not responding
- Unknown: No data about resource status
Integration: Both Service Health and Resource Health integrate with Azure Monitor alerts. You can alert on resource health changes and receive notifications.
Azure Advisor Integration
Azure Advisor analyzes your Azure resources and provides recommendations across reliability, security, performance, cost, and operational excellence.
Advisor provides recommendations for:
- Resizing oversized VMs
- Applying reserved instances for capacity planning
- Identifying unused resources
- Enabling backups and disaster recovery
- Configuring missing security baselines
Integration with Azure Monitor: Advisor insights can be surfaced in workbooks or queried via the Advisor API. Use Advisor recommendations to drive capacity planning and cost optimization initiatives.
Diagnostic Settings and Resource Logs
Diagnostic settings control where Azure resources send their logs and metrics. They route resource diagnostic logs (different from platform metrics) to Log Analytics, storage, or event hubs.
What gets sent:
- Resource-specific logs (e.g., firewall flow logs, web app logs, database query logs)
- Metrics (if enabled)
- Activity logs (if configured at subscription level)
Configuration:
- Destination: Log Analytics workspace, Azure Storage, Event Hubs
- Log category: Select which types of logs to collect
- Retention: How long to retain logs in the chosen destination
Common diagnostic settings:
| Resource Type | Key Log Categories |
|---|---|
| Azure SQL Database | SQLInsights (slow queries), Errors, Deadlocks |
| Application Gateway | ApplicationGatewayAccessLog, ApplicationGatewayPerformanceLog |
| Azure Firewall | AzureFirewallApplicationRule, AzureFirewallNetworkRule |
| Key Vault | AuditEvent, SecureScore |
| Virtual Network | NetworkSecurityGroupEvent, NetworkSecurityGroupRuleCounter |
Cost consideration: Sending diagnostic logs to Log Analytics incurs ingestion costs. Be selective about which log categories you enable.
Multi-Workspace and Cross-Subscription Monitoring
Large organizations often need to monitor resources across multiple subscriptions and regions. Azure Monitor supports this through multi-workspace architectures.
Workspace Design Patterns
Centralized single workspace: All logs from all subscriptions and regions flow to one workspace.
Advantages:
- Simpler queries (no need to union across workspaces)
- Easier to correlate events across teams
- Centralized access control and compliance audit trail
Challenges:
- Single large workspace can reach ingestion/performance limits
- Difficult to isolate sensitive workloads by team or compliance boundary
- Cost attribution across teams becomes complex
Decentralized workspaces by team or application: Each team has their own workspace.
Advantages:
- Teams own their monitoring and alerting
- Easier cost allocation per team
- Sensitive workloads (finance, healthcare) isolated from others
Challenges:
- Cross-team queries require union across multiple workspaces
- Correlation becomes more difficult
- More operational overhead
Hybrid approach (recommended for enterprises): Central workspace for infrastructure (VMs, networks, platform services) + application-specific workspaces for complex services.
Example:
Central Workspace (Operations team)
├── VM metrics and logs
├── Network flows and NSG logs
├── Azure service diagnostics
└── Activity logs
App-A Workspace (App Team A)
├── Application Insights
├── App-specific logs
└── Custom telemetry
App-B Workspace (App Team B)
├── Application Insights
├── App-specific logs
└── Custom telemetry
Cross-Workspace Queries
KQL supports querying multiple workspaces with the workspace() function:
(workspace("workspace1-id").AppTraces | where Timestamp > ago(7d))
union
(workspace("workspace2-id").AppTraces | where Timestamp > ago(7d))
| summarize FailureCount = count() by AppName
This requires appropriate RBAC permissions across workspaces.
Network Monitoring
Azure provides specialized tools for network observability.
Network Watcher
Network Watcher provides diagnostics for network connectivity and performance issues.
Key features:
- NSG flow logs: Track which traffic is allowed/denied by Network Security Groups
- Connection Monitor: Continuous monitoring of network connectivity between endpoints
- Traffic Analytics: Analyze NSG flow logs to identify traffic patterns and anomalies
- Packet Capture: Capture network packets for deep packet inspection (rare, advanced use)
Network Insights
Network Insights provides pre-built workbooks for common networking scenarios:
| Insight | Purpose |
|---|---|
| Virtual Networks | Traffic flow, subnet utilization, peering health |
| Network Interfaces | NIC health, IP utilization, related resources |
| Load Balancers | Backend health, traffic distribution, performance |
| Application Gateway | Request distribution, response times, error rates |
| ExpressRoute | Circuit health, peering status, BGP status |
| VPN Gateway | Connection status, bandwidth utilization |
Connection Monitor
Connection Monitor continuously monitors connectivity from one endpoint to another (VM to VM, VM to external service, on-premises to cloud).
Metrics tracked:
- Probe success rate
- Round-trip latency
- Packet loss percentage
- Jitter (latency variance)
When to use: Ensure connectivity between application tiers, validate hybrid connectivity, monitor external API availability from Azure.
VM Insights and Container Insights
VM Insights
VM Insights provides deep monitoring of VM health and performance with automatic dependency mapping.
What VM Insights includes:
- Performance metrics (CPU, memory, disk, network from the VM’s perspective)
- Guest OS metrics (Windows/Linux performance counters)
- Installed processes and port listening details
- Dependency map (visualization of what each VM connects to)
- Health state based on guest monitoring
Enablement: Install Azure Monitor Agent and associate the VM with a DCR configured for VM Insights.
Use case: Quickly understand which processes are consuming resources, see inter-VM connections, and diagnose performance issues.
Container Insights
Container Insights monitors AKS clusters and provides pod-level metrics, logs, and performance insights.
Metrics and views:
- Cluster, node, and pod resource utilization
- Container logs from all pods
- Kubelet metrics and node health
- Deployment and pod status
- Custom Kubernetes metrics
Integration: Container Insights data flows to Log Analytics, enabling rich querying of Kubernetes events and metrics.
Cost Considerations
Azure Monitor has two main cost drivers: data ingestion and data retention.
Log Analytics Ingestion Cost
Costs are per GB of data ingested into Log Analytics workspaces. Different log types have different ingestion rates and retention policies.
Cost-saving strategies:
- Filter at source: Use DCRs to exclude unnecessary logs before ingestion
- Sample data: Use adaptive sampling for high-volume sources
- Archive to Storage: Move older logs to Azure Storage (cold path) for long-term archival at lower cost
- Retention per table: Reduce retention period for high-volume, low-value logs (e.g., verbose application logs)
- Dedicated clusters: For very high volumes (> 500 GB/day), dedicated Log Analytics clusters offer discounted rates
Typical costs:
- Logs ingested: ~$2.50 per GB
- Logs retained (per day after free tier): ~$0.10 per GB
- Long-term analysis and storage is more cost-effective than keeping all data hot
Metric Ingestion Cost
Platform metrics from Azure resources are collected at no cost. Custom metrics from Application Insights or custom sources are charged per metric.
Alert Evaluation Cost
Metric alerts are evaluated by the Azure platform at no cost. Log alert queries incur a small cost per query evaluation (typically $0.01-0.05 per query).
Common Pitfalls
Pitfall 1: Sending Too Much Data to Log Analytics
Problem: Enabling all diagnostic logs and collecting verbose application logs without filtering, resulting in high ingestion costs.
Result: Log Analytics bill becomes unexpectedly high; cost controls become reactive.
Solution: Use diagnostic settings and DCRs to filter at the source. Collect only logs you actually analyze. Use retention periods appropriate to your use case (e.g., 30 days for verbose application logs, 2 years for audit logs).
Pitfall 2: Metric Alerts on Metrics Without Baseline
Problem: Setting arbitrary CPU or memory thresholds without understanding normal operating patterns.
Result: Alerts trigger frequently for non-critical conditions, or fail to detect real issues.
Solution: Use dynamic threshold alerts (anomaly detection) to learn patterns, or establish baselines by observing the resource for 1-2 weeks before enabling alerts.
Pitfall 3: Multiple Workspaces Without Cross-Workspace Query Plan
Problem: Creating separate workspaces for different teams without designing how to correlate events across workspaces.
Result: Cannot investigate incidents that span multiple teams; queries become fragmented.
Solution: Use a hybrid model: central workspace for shared infrastructure, team workspaces for application-specific telemetry. Define a correlation strategy (e.g., correlation IDs, timestamps) for incident investigation.
Pitfall 4: No Retention Strategy for Compliance
Problem: Keeping all logs forever, or deleting logs too quickly to meet audit requirements.
Result: Either unsustainable costs or regulatory non-compliance.
Solution: Align retention policies with compliance requirements. Use Log Analytics for hot data (recent queries), archive to Storage for cold data (compliance archival). Document retention policy per log type.
Pitfall 5: Alerting on Symptoms Rather Than Root Causes
Problem: Creating alerts for high CPU without understanding what application component is consuming it, or alerting on increased error rate without ability to identify which requests failed.
Result: Alerts trigger but engineers cannot efficiently investigate.
Solution: Layer metrics (resource-level) with logs (application-level) and Application Insights (request-level). Ensure alerts include context (what to check) and link to workbooks or queries for faster investigation.
Pitfall 6: Not Monitoring Monitoring
Problem: Not tracking the health of Azure Monitor itself (Log Analytics ingestion failures, alert evaluation delays).
Result: Silent failures in monitoring go unnoticed.
Solution: Monitor Log Analytics ingestion rate and health. Set up alerts for failed data ingestion. Periodically validate that key alerts are evaluating correctly.
Key Takeaways
-
Azure Monitor is the unified observability platform for Azure. All three data types (metrics, logs, and application telemetry) should flow to Azure Monitor for centralized analysis and alerting.
-
Metrics are for real-time dashboards and simple thresholds. They are low-cost, low-latency, and ideal for resource utilization monitoring. Logs are for detailed investigation, complex analysis, and audit trails.
-
Log Analytics workspaces are the central repository for all logs. Send diagnostic logs, application logs, and event logs to a Log Analytics workspace where they can be queried, retained, and analyzed with KQL.
-
Use Azure Monitor Agent and Data Collection Rules for consistent, scalable log collection. They replace legacy agents and provide centralized configuration, filtering, and cost control.
-
Layer different alert types for comprehensive coverage. Metric alerts detect resource thresholds, log alerts detect patterns, activity log alerts track infrastructure changes, and smart detection finds anomalies.
-
Action groups route alerts to the right people and systems. Design action groups with escalation (email for non-critical, SMS/page for critical) and automation (webhook to remediate, Logic App to escalate).
-
Workbooks are the modern way to visualize and share Azure Monitor data. They are more flexible than dashboards and support interactive investigation.
-
Cost management is critical for Log Analytics. Filter data at source, use appropriate retention, and archive cold data to storage. Monitor your ingestion rate to catch surprises.
-
Multi-workspace strategies require planning. Centralized workspaces simplify queries; distributed workspaces allow team autonomy. Most enterprises use a hybrid approach.
-
Network monitoring complements resource monitoring. Use Network Watcher and Connection Monitor to understand inter-resource connectivity, verify hybrid connectivity, and diagnose network issues that affect applications.
Found this guide helpful? Share it with your team:
Share on LinkedIn