Analytics and Statistics
Table of Contents
- Introduction to Analytics
- Variable Types
- Descriptive Statistics
- Statistical Distributions
- Data Analysis Methods
- Data Visualization
- Modern Analytics Trends (2025)
- Best Practices
- Free Datasets for Practice
- Quick Reference Formulas
Analytics is the process of examining data to draw conclusions, make predictions, and support decision-making.
Four Types of Analytics
- Descriptive: What happened?
- Diagnostic: Why did it happen?
- Predictive: What will happen?
- Prescriptive: What should we do?
Variable Types
Categorical Variables
Nominal
- Categories with no inherent order
- Examples: Colors (red, blue, green), car types (sedan, SUV, truck), gender
- Used with: Bar charts, pie charts
Ordinal
- Categories with meaningful order/ranking
- Examples: Education level (high school < bachelor’s < master’s), satisfaction ratings, size (small < medium < large)
- Used with: Bar charts, stacked charts
Quantitative Variables
Discrete
- Countable whole numbers
- Examples: Number of students, cars in parking lot, goals scored
- Used with: Histograms, bar charts
Continuous
- Any value within a range (can have decimals)
- Examples: Height, weight, temperature, time
- Used with: Histograms, scatter plots, line charts
Research Variables
Independent Variable
- What you change/manipulate in an experiment
- The “cause” in cause-and-effect
Dependent Variable
- What you measure/observe
- The “effect” that changes based on the independent variable
Confounding Variable
- Hidden variable that affects both independent and dependent variables
- Can create false relationships if not controlled
Descriptive Statistics
Measures of Central Tendency
Mean (Average)
- Add all values ÷ count of values
- Sensitive to outliers
- Best for: Normally distributed data
Median
- Middle value when data is sorted
- Resistant to outliers
- Best for: Skewed distributions
Mode
- Most frequently occurring value
- Can have multiple modes
- Best for: Categorical data, finding peaks
Measures of Spread
Range
- Highest value - lowest value
- Simple but affected by outliers
Standard Deviation
- Average distance from the mean
- Shows how spread out data points are
- Low = clustered around mean, High = widely spread
Interquartile Range (IQR)
- Q3 - Q1 (range of middle 50% of data)
- Robust to outliers
- Used to identify outliers: values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR
Mean Absolute Deviation (MAD)
- Average distance of all points from the center
- More intuitive than standard deviation
Robust Statistics
Statistics that aren’t heavily influenced by outliers:
- Median (vs. mean)
- IQR (vs. range)
- MAD (vs. standard deviation)
Use robust measures when you have outliers or skewed data.
Statistical Distributions
A distribution shows you how your data is spread out - where most values cluster and how they trail off.
Think of a distribution as the “shape” your data makes when you plot it. Understanding this shape helps you choose the right statistics and spot unusual patterns.
Normal Distribution (The “Bell Curve”)
What it looks like: A symmetric hill - most data clusters in the middle, fewer values at the extremes.
Why it matters: Many real-world phenomena follow this pattern (heights, test scores, measurement errors), and many statistical tests assume your data looks like this.
Key characteristics:
- Mean = median = mode (they all sit at the peak)
- 68% of data falls within 1 standard deviation of the mean
- 95% of data falls within 2 standard deviations
Real example: If you measured the heights of 1000 adults, most would cluster around average height (5’6”), with fewer very tall or very short people.
Skewed Distributions (Lopsided Data)
What causes skewness: When you have a natural boundary on one side but not the other.
Right-skewed (Positive skew)
- What it looks like: Most data bunches up on the left, with a long tail stretching right
- Key signal: Mean > median (the tail pulls the average higher)
- Example: Household income - most families earn modest amounts, but a few millionaires pull the average way up
- Why it happens: Income can’t go below $0, but there’s no upper limit
Left-skewed (Negative skew)
- What it looks like: Most data bunches up on the right, with a long tail stretching left
- Key signal: Mean < median (the tail pulls the average lower)
- Example: Test scores when most students do well - scores cluster near 90-100, but a few low scores drag the average down
- Why it happens: Scores can’t go above 100%, but can drop to 0%
Why This Matters for Analysis
- Normal distribution: Mean and median both work well
- Skewed distribution: Use median instead of mean (the mean gets “fooled” by the tail)
- Choosing tests: Many statistical tests assume normal distribution - if your data is skewed, you may need different approaches
Data Analysis Methods
Different types of analysis answer different questions. Think of them as different tools - you wouldn’t use a hammer when you need a screwdriver.
Descriptive Analysis
Purpose: Summarize and describe what happened in your data Question it answers: “What does my data look like?”
What you’re doing: Taking a pile of raw numbers and turning them into understandable summaries - like calculating averages, making charts, or counting how often things happen.
Tools you use: Mean, median, mode, standard deviation, histograms, tables Real example: A store owner calculating that their average daily sales is $2,500, their busiest day is Friday, and 60% of customers pay with credit cards.
When to use: This is always your first step - you need to understand your data before doing anything else.
Exploratory Data Analysis (EDA)
Purpose: Investigate your data to discover hidden patterns and generate ideas Question it answers: “What interesting patterns or relationships might be hiding in here?”
What you’re doing: Playing detective with your data - looking for unexpected connections, unusual patterns, or groups that naturally form. You’re not trying to prove anything yet, just exploring.
Tools you use: Scatter plots, correlation analysis, clustering, PCA Real example: A marketing team discovers that customers who buy coffee also tend to buy pastries, and that there seem to be three distinct customer groups with different buying patterns.
When to use: After descriptive analysis, when you want to generate hypotheses or understand your data’s structure before formal testing.
Key EDA Techniques Explained:
Principal Component Analysis (PCA)
- What it does: Takes complex data with many variables and finds the most important patterns
- Why useful: Like creating a highlights reel - keeps the important stuff, drops the noise
- Example: Instead of tracking 20 different customer behaviors, PCA might show that just 3 underlying patterns explain most of the differences
K-means Clustering
- What it does: Automatically groups similar data points together
- How it works: You tell it “find me 3 groups” and it sorts your data into 3 clusters based on similarity
- Example: Grouping customers into “big spenders,” “bargain hunters,” and “occasional shoppers” based on their purchasing behavior
Inferential Analysis
Purpose: Make educated guesses about a large group based on a smaller sample Question it answers: “If this is true for my sample, is it likely true for everyone?”
What you’re doing: Like conducting a poll - you survey 1,000 voters to predict how 1 million will vote. You’re using statistics to bridge the gap between what you observed and what you can conclude.
Tools you use: Hypothesis testing, confidence intervals, statistical significance tests Real example: A pharmaceutical company tests a drug on 500 patients and uses statistics to determine if it would work for the broader population.
Critical requirements:
- Your sample must be large enough (≥ 10% of population)
- Your sample must be randomly selected (no cherry-picking)
- Test one hypothesis at a time (avoid “fishing” for results)
When to use: When you want to make claims about a population but can only study a sample.
Causal Analysis
Purpose: Determine if one thing actually causes another (not just correlation) Question it answers: “Does X really cause Y, or do they just happen together?”
What you’re doing: The hardest type of analysis - proving cause and effect. Just because two things happen together doesn’t mean one causes the other (ice cream sales and drowning both increase in summer, but ice cream doesn’t cause drowning - heat causes both).
Tools you use: Controlled experiments, randomized trials, advanced statistical techniques Real example: Testing whether a new teaching method actually improves test scores, not just whether schools using it happen to have higher scores.
Why it’s difficult: You need to control for confounding variables - other factors that might explain the relationship.
When to use: When you need to make decisions based on cause-and-effect (like “will this marketing campaign increase sales?” not just “are they correlated?”).
Predictive Analysis
Purpose: Forecast what will happen in the future based on historical patterns Question it answers: “Based on past data, what should we expect next?”
What you’re doing: Building a model that learns from historical data to make educated guesses about future outcomes. Like teaching a computer to recognize patterns you’ve seen before.
Tools you use: Machine learning algorithms, regression models, neural networks Real example: Netflix predicting which movies you’ll like based on what you’ve watched before, or banks predicting loan defaults.
Key insight: The model can only be as good as your data - “garbage in, garbage out.”
Difference from inferential: Inferential analysis generalizes about populations; predictive analysis forecasts future events.
When to use: When you need to plan for the future or make decisions based on likely outcomes.
Data Visualization
Univariate Charts (One Variable)
- Histogram: Distribution of continuous data
- Bar chart: Frequency of categories
- Box plot: Shows median, quartiles, outliers
- Density plot: Smooth version of histogram
Bivariate Charts (Two Variables)
- Scatter plot: Relationship between two continuous variables
- Line chart: Change over time
- Grouped bar chart: Compare categories across groups
Multivariate Charts (Three+ Variables)
- Multi-line chart: Multiple trends over time
- Heat map: Show relationships using color intensity
- 3D scatter plot: Three continuous variables
Correlation Visualization
Correlation Coefficient ranges from -1 to +1:
- +1: Perfect positive correlation
- 0: No linear relationship
- -1: Perfect negative correlation
Remember: Correlation ≠ Causation
Modern Analytics Trends (2025)
AI Integration
- 65% of organizations have adopted or are actively investigating AI technologies for data and analytics
- AI agents automate complex analysis tasks
- Natural Language Processing enables asking questions in plain English
Real-Time Analytics
- Edge computing reduces delays for instant insights
- Critical for personalized user experiences
- More than 70% of healthcare institutions use cloud computing to facilitate real-time data sharing
Synthetic Data
- By 2026, 75% of businesses are expected to create synthetic customer data using Gen AI
- Protects privacy while enabling analysis
- Cost-effective way to supplement real data
Self-Service Analytics
- The global self-service BI market is projected to grow from $6.73 billion in 2024 to $27.32 billion by 2032
- Tools that non-technical users can operate
- Democratizes data access across organizations
Best Practices
Data Quality First
- Clean your data: Remove duplicates, handle missing values
- Validate sources: Ensure data accuracy and reliability
- Document everything: Track data sources, transformations, assumptions
Analysis Approach
- Start with business questions: What decisions will this analysis inform?
- Choose appropriate methods: Match analysis type to question type
- Consider context: Industry standards, seasonal patterns, external factors
- Validate results: Do findings make business sense?
Visualization Guidelines
- Choose the right chart: Match chart type to data type and message
- Keep it simple: Avoid 3D effects, too many colors, cluttered layouts
- Tell a story: Use titles, annotations, and logical flow
- Consider your audience: Technical vs. executive presentation styles
Ethics and Governance
- Protect privacy: Follow GDPR, CCPA, and other regulations
- Avoid bias: Check for discrimination in models and data
- Be transparent: Document methodology and limitations
- Secure data: Implement proper access controls and encryption
Free Datasets for Practice
Government and Health
- World Health Organization (WHO): Mental health, COVID-19, air pollution data
- Data.gov: US government data on agriculture, climate, energy
- Storm Events Database: Weather and natural disaster data
- US Census and EU Census: Population demographics
News and Analysis
- FiveThirtyEight: Sports, politics, culture datasets
- Political polls: Election and opinion data
- NBA player ratings: Sports analytics
Social Good
- UNICEF Data: Global data on children’s welfare
- Biodiversity at US National Parks: Species identification data
Business and Economics
- Netherlands Orchid Trade: €12 billion flower export market
- US Cosmetics Industry Revenue: $49.2 billion market analysis
Finding More Datasets
Google Dataset Search - Search engine specifically for datasets
Quick Reference Formulas
Basic Statistics
- Mean: (Sum of all values) ÷ (Count of values)
- Standard Deviation: √[(Σ(x - mean)²) ÷ (n-1)]
- Correlation: Ranges from -1 to +1
Identifying Outliers
- IQR Method: Beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR
- Standard Deviation: Beyond mean ± 2 or 3 standard deviations
Data Distribution Check
- Right-skewed: Mean > Median (use median)
- Left-skewed: Mean < Median (use median)
- Normal: Mean ≈ Median (either works)
Sample Size Guidelines
- Minimum for inference: Sample ≥ 10% of population
- For normal approximation: n ≥ 30 (Central Limit Theorem)
- For proportions: np ≥ 5 and n(1-p) ≥ 5
Found this guide helpful? Share it with your team:
Share on LinkedIn