Analytics and Statistics - Architecture Insights

Introduction to Analytics
Variable Types
Descriptive Statistics
Statistical Distributions
Data Analysis Methods
Data Visualization
Modern Analytics Trends (2025)
Best Practices
Free Datasets for Practice
Quick Reference Formulas

Analytics is the process of examining data to draw conclusions, make predictions, and support decision-making.

Four Types of Analytics

Descriptive: What happened?
Diagnostic: Why did it happen?
Predictive: What will happen?
Prescriptive: What should we do?

Variable Types

Categorical Variables

Nominal

Categories with no inherent order
Examples: Colors (red, blue, green), car types (sedan, SUV, truck), gender
Used with: Bar charts, pie charts

Ordinal

Categories with meaningful order/ranking
Examples: Education level (high school < bachelor’s < master’s), satisfaction ratings, size (small < medium < large)
Used with: Bar charts, stacked charts

Quantitative Variables

Discrete

Countable whole numbers
Examples: Number of students, cars in parking lot, goals scored
Used with: Histograms, bar charts

Continuous

Any value within a range (can have decimals)
Examples: Height, weight, temperature, time
Used with: Histograms, scatter plots, line charts

Research Variables

Independent Variable

What you change/manipulate in an experiment
The “cause” in cause-and-effect

Dependent Variable

What you measure/observe
The “effect” that changes based on the independent variable

Confounding Variable

Hidden variable that affects both independent and dependent variables
Can create false relationships if not controlled

Descriptive Statistics

Measures of Central Tendency

Mean (Average)

Add all values ÷ count of values
Sensitive to outliers
Best for: Normally distributed data

Median

Middle value when data is sorted
Resistant to outliers
Best for: Skewed distributions

Mode

Most frequently occurring value
Can have multiple modes
Best for: Categorical data, finding peaks

Measures of Spread

Range

Highest value - lowest value
Simple but affected by outliers

Standard Deviation

Average distance from the mean
Shows how spread out data points are
Low = clustered around mean, High = widely spread

Interquartile Range (IQR)

Q3 - Q1 (range of middle 50% of data)
Robust to outliers
Used to identify outliers: values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR

Mean Absolute Deviation (MAD)

Average distance of all points from the center
More intuitive than standard deviation

Robust Statistics

Statistics that aren’t heavily influenced by outliers:

Median (vs. mean)
IQR (vs. range)
MAD (vs. standard deviation)

Use robust measures when you have outliers or skewed data.

Statistical Distributions

Think of a distribution as the "shape" your data makes when you plot it. Understanding this shape helps you choose the right statistics and spot unusual patterns.

A distribution shows you how your data is spread out: where most values cluster and how they trail off.

Normal Distribution (The “Bell Curve”)

What it looks like: A symmetric hill where most data clusters in the middle, with fewer values at the extremes.

Why it matters: Many real-world phenomena follow this pattern (heights, test scores, measurement errors), and many statistical tests assume your data looks like this.

Key characteristics:

Mean = median = mode (they all sit at the peak)
68% of data falls within 1 standard deviation of the mean
95% of data falls within 2 standard deviations

Real example: If you measured the heights of 1000 adults, most would cluster around average height (5’6”), with fewer very tall or very short people.

Skewed Distributions (Lopsided Data)

What causes skewness: When you have a natural boundary on one side but not the other.

Right-skewed (Positive skew)

What it looks like: Most data bunches up on the left, with a long tail stretching right
Key signal: Mean > median (the tail pulls the average higher)
Example: Household income, where most families earn modest amounts, but a few millionaires pull the average way up
Why it happens: Income can’t go below $0, but there’s no upper limit

Left-skewed (Negative skew)

What it looks like: Most data bunches up on the right, with a long tail stretching left
Key signal: Mean < median (the tail pulls the average lower)
Example: Test scores when most students do well, where scores cluster near 90-100, but a few low scores drag the average down
Why it happens: Scores can’t go above 100%, but can drop to 0%

Why This Matters for Analysis

Normal distribution: Mean and median both work well
Skewed distribution: Use median instead of mean (the mean gets "fooled" by the tail)
Choosing tests: Many statistical tests assume normal distribution, so if your data is skewed, you may need different approaches

Data Analysis Methods

Different types of analysis answer different questions. Think of them as different tools; you wouldn’t use a hammer when you need a screwdriver.

Descriptive Analysis

Purpose: Summarize and describe what happened in your data Question it answers: “What does my data look like?”

What you’re doing: Taking a pile of raw numbers and turning them into understandable summaries, like calculating averages, making charts, or counting how often things happen.

Tools you use: Mean, median, mode, standard deviation, histograms, tables Real example: A store owner calculating that their average daily sales is $2,500, their busiest day is Friday, and 60% of customers pay with credit cards.

When to use: This is always your first step. You need to understand your data before doing anything else.

Exploratory Data Analysis (EDA)

Purpose: Investigate your data to discover hidden patterns and generate ideas Question it answers: “What interesting patterns or relationships might be hiding in here?”

What you’re doing: Playing detective with your data, looking for unexpected connections, unusual patterns, or groups that naturally form. You’re not trying to prove anything yet, just exploring.

Tools you use: Scatter plots, correlation analysis, clustering, PCA Real example: A marketing team discovers that customers who buy coffee also tend to buy pastries, and that there seem to be three distinct customer groups with different buying patterns.

When to use: After descriptive analysis, when you want to generate hypotheses or understand your data’s structure before formal testing.

Key EDA Techniques Explained:

Principal Component Analysis (PCA)

What it does: Takes complex data with many variables and finds the most important patterns
Why useful: Like creating a highlights reel that keeps the important stuff and drops the noise
Example: Instead of tracking 20 different customer behaviors, PCA might show that just 3 underlying patterns explain most of the differences

K-means Clustering

What it does: Automatically groups similar data points together
How it works: You tell it “find me 3 groups” and it sorts your data into 3 clusters based on similarity
Example: Grouping customers into “big spenders,” “bargain hunters,” and “occasional shoppers” based on their purchasing behavior

Inferential Analysis

Purpose: Make educated guesses about a large group based on a smaller sample Question it answers: “If this is true for my sample, is it likely true for everyone?”

What you’re doing: Like conducting a poll where you survey 1,000 voters to predict how 1 million will vote. You’re using statistics to bridge the gap between what you observed and what you can conclude.

Tools you use: Hypothesis testing, confidence intervals, statistical significance tests Real example: A pharmaceutical company tests a drug on 500 patients and uses statistics to determine if it would work for the broader population.

Critical Requirements for Inferential Analysis

Your sample must be large enough (≥ 10% of population)
Your sample must be randomly selected (no cherry-picking)
Test one hypothesis at a time (avoid "fishing" for results)

When to use: When you want to make claims about a population but can only study a sample.

Causal Analysis

Purpose: Determine if one thing actually causes another (not just correlation) Question it answers: “Does X really cause Y, or do they just happen together?”

What you’re doing: The hardest type of analysis, proving cause and effect. Just because two things happen together doesn’t mean one causes the other. Ice cream sales and drowning both increase in summer, but ice cream doesn’t cause drowning; heat causes both.

Tools you use: Controlled experiments, randomized trials, advanced statistical techniques Real example: Testing whether a new teaching method actually improves test scores, not just whether schools using it happen to have higher scores.

Why it’s difficult: You need to control for confounding variables, meaning other factors that might explain the relationship.

When to use: When you need to make decisions based on cause-and-effect (like “will this marketing campaign increase sales?” not just “are they correlated?”).

Predictive Analysis

Purpose: Forecast what will happen in the future based on historical patterns Question it answers: “Based on past data, what should we expect next?”

What you’re doing: Building a model that learns from historical data to make educated guesses about future outcomes. Like teaching a computer to recognize patterns you’ve seen before.

Tools you use: Machine learning algorithms, regression models, neural networks Real example: Netflix predicting which movies you’ll like based on what you’ve watched before, or banks predicting loan defaults.

Key insight: The model can only be as good as your data. “Garbage in, garbage out.”

Difference from inferential: Inferential analysis generalizes about populations; predictive analysis forecasts future events.

When to use: When you need to plan for the future or make decisions based on likely outcomes.

Data Visualization

Univariate Charts (One Variable)

Histogram: Distribution of continuous data
Bar chart: Frequency of categories
Box plot: Shows median, quartiles, outliers
Density plot: Smooth version of histogram

Bivariate Charts (Two Variables)

Scatter plot: Relationship between two continuous variables
Line chart: Change over time
Grouped bar chart: Compare categories across groups

Multivariate Charts (Three+ Variables)

Multi-line chart: Multiple trends over time
Heat map: Show relationships using color intensity
3D scatter plot: Three continuous variables

Correlation Visualization

Correlation Coefficient ranges from -1 to +1:

+1: Perfect positive correlation
0: No linear relationship
-1: Perfect negative correlation

Remember: Correlation ≠ Causation

Modern Analytics Trends (2025)

AI Integration

65% of organizations have adopted or are actively investigating AI technologies for data and analytics
AI agents automate complex analysis tasks
Natural Language Processing enables asking questions in plain English

Real-Time Analytics

Edge computing reduces delays for instant insights
Critical for personalized user experiences
More than 70% of healthcare institutions use cloud computing to facilitate real-time data sharing

Synthetic Data

By 2026, 75% of businesses are expected to create synthetic customer data using Gen AI
Protects privacy while enabling analysis
Cost-effective way to supplement real data

Self-Service Analytics

The global self-service BI market is projected to grow from $6.73 billion in 2024 to $27.32 billion by 2032
Tools that non-technical users can operate
Democratizes data access across organizations

Best Practices

Data Quality First

Clean your data: Remove duplicates, handle missing values
Validate sources: Ensure data accuracy and reliability
Document everything: Track data sources, transformations, assumptions

Analysis Approach

Start with business questions: What decisions will this analysis inform?
Choose appropriate methods: Match analysis type to question type
Consider context: Industry standards, seasonal patterns, external factors
Validate results: Do findings make business sense?

Visualization Guidelines

Choose the right chart: Match chart type to data type and message
Keep it simple: Avoid 3D effects, too many colors, cluttered layouts
Tell a story: Use titles, annotations, and logical flow
Consider your audience: Technical vs. executive presentation styles

Ethics and Governance

Protect privacy: Follow GDPR, CCPA, and other regulations
Avoid bias: Check for discrimination in models and data
Be transparent: Document methodology and limitations
Secure data: Implement proper access controls and encryption

Free Datasets for Practice

Government and Health

World Health Organization (WHO): Mental health, COVID-19, air pollution data
Data.gov: US government data on agriculture, climate, energy
Storm Events Database: Weather and natural disaster data
US Census and EU Census: Population demographics

News and Analysis

FiveThirtyEight: Sports, politics, culture datasets
Political polls: Election and opinion data
NBA player ratings: Sports analytics

UNICEF Data: Global data on children’s welfare
Biodiversity at US National Parks: Species identification data

Business and Economics

Netherlands Orchid Trade: €12 billion flower export market
US Cosmetics Industry Revenue: $49.2 billion market analysis

Finding More Datasets

Google Dataset Search: Search engine specifically for datasets

Quick Reference Formulas

Basic Statistics

Mean: (Sum of all values) ÷ (Count of values)
Standard Deviation: √[(Σ(x - mean)²) ÷ (n-1)]
Correlation: Ranges from -1 to +1

Identifying Outliers

IQR Method: Beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR
Standard Deviation: Beyond mean ± 2 or 3 standard deviations

Data Distribution Check

Right-skewed: Mean > Median (use median)
Left-skewed: Mean < Median (use median)
Normal: Mean ≈ Median (either works)

Sample Size Guidelines

Minimum for inference: Sample ≥ 10% of population
For normal approximation: n ≥ 30 (Central Limit Theorem)
For proportions: np ≥ 5 and n(1-p) ≥ 5