Analytics and Statistics

Data & Analytics

Table of Contents

  1. Introduction to Analytics
  2. Variable Types
  3. Descriptive Statistics
  4. Statistical Distributions
  5. Data Analysis Methods
  6. Data Visualization
  7. Modern Analytics Trends (2025)
  8. Best Practices
  9. Free Datasets for Practice
  10. Quick Reference Formulas

Analytics is the process of examining data to draw conclusions, make predictions, and support decision-making.

Four Types of Analytics

  1. Descriptive: What happened?
  2. Diagnostic: Why did it happen?
  3. Predictive: What will happen?
  4. Prescriptive: What should we do?

Variable Types

Categorical Variables

Nominal

  • Categories with no inherent order
  • Examples: Colors (red, blue, green), car types (sedan, SUV, truck), gender
  • Used with: Bar charts, pie charts

Ordinal

  • Categories with meaningful order/ranking
  • Examples: Education level (high school < bachelor’s < master’s), satisfaction ratings, size (small < medium < large)
  • Used with: Bar charts, stacked charts

Quantitative Variables

Discrete

  • Countable whole numbers
  • Examples: Number of students, cars in parking lot, goals scored
  • Used with: Histograms, bar charts

Continuous

  • Any value within a range (can have decimals)
  • Examples: Height, weight, temperature, time
  • Used with: Histograms, scatter plots, line charts

Research Variables

Independent Variable

  • What you change/manipulate in an experiment
  • The “cause” in cause-and-effect

Dependent Variable

  • What you measure/observe
  • The “effect” that changes based on the independent variable

Confounding Variable

  • Hidden variable that affects both independent and dependent variables
  • Can create false relationships if not controlled

Descriptive Statistics

Measures of Central Tendency

Mean (Average)

  • Add all values ÷ count of values
  • Sensitive to outliers
  • Best for: Normally distributed data

Median

  • Middle value when data is sorted
  • Resistant to outliers
  • Best for: Skewed distributions

Mode

  • Most frequently occurring value
  • Can have multiple modes
  • Best for: Categorical data, finding peaks

Measures of Spread

Range

  • Highest value - lowest value
  • Simple but affected by outliers

Standard Deviation

  • Average distance from the mean
  • Shows how spread out data points are
  • Low = clustered around mean, High = widely spread

Interquartile Range (IQR)

  • Q3 - Q1 (range of middle 50% of data)
  • Robust to outliers
  • Used to identify outliers: values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR

Mean Absolute Deviation (MAD)

  • Average distance of all points from the center
  • More intuitive than standard deviation

Robust Statistics

Statistics that aren’t heavily influenced by outliers:

  • Median (vs. mean)
  • IQR (vs. range)
  • MAD (vs. standard deviation)

Use robust measures when you have outliers or skewed data.


Statistical Distributions

A distribution shows you how your data is spread out - where most values cluster and how they trail off.

Think of a distribution as the “shape” your data makes when you plot it. Understanding this shape helps you choose the right statistics and spot unusual patterns.

Normal Distribution (The “Bell Curve”)

What it looks like: A symmetric hill - most data clusters in the middle, fewer values at the extremes.

Why it matters: Many real-world phenomena follow this pattern (heights, test scores, measurement errors), and many statistical tests assume your data looks like this.

Key characteristics:

  • Mean = median = mode (they all sit at the peak)
  • 68% of data falls within 1 standard deviation of the mean
  • 95% of data falls within 2 standard deviations

Real example: If you measured the heights of 1000 adults, most would cluster around average height (5’6”), with fewer very tall or very short people.

Skewed Distributions (Lopsided Data)

What causes skewness: When you have a natural boundary on one side but not the other.

Right-skewed (Positive skew)

  • What it looks like: Most data bunches up on the left, with a long tail stretching right
  • Key signal: Mean > median (the tail pulls the average higher)
  • Example: Household income - most families earn modest amounts, but a few millionaires pull the average way up
  • Why it happens: Income can’t go below $0, but there’s no upper limit

Left-skewed (Negative skew)

  • What it looks like: Most data bunches up on the right, with a long tail stretching left
  • Key signal: Mean < median (the tail pulls the average lower)
  • Example: Test scores when most students do well - scores cluster near 90-100, but a few low scores drag the average down
  • Why it happens: Scores can’t go above 100%, but can drop to 0%

Why This Matters for Analysis

  • Normal distribution: Mean and median both work well
  • Skewed distribution: Use median instead of mean (the mean gets “fooled” by the tail)
  • Choosing tests: Many statistical tests assume normal distribution - if your data is skewed, you may need different approaches

Data Analysis Methods

Different types of analysis answer different questions. Think of them as different tools - you wouldn’t use a hammer when you need a screwdriver.

Descriptive Analysis

Purpose: Summarize and describe what happened in your data Question it answers: “What does my data look like?”

What you’re doing: Taking a pile of raw numbers and turning them into understandable summaries - like calculating averages, making charts, or counting how often things happen.

Tools you use: Mean, median, mode, standard deviation, histograms, tables Real example: A store owner calculating that their average daily sales is $2,500, their busiest day is Friday, and 60% of customers pay with credit cards.

When to use: This is always your first step - you need to understand your data before doing anything else.

Exploratory Data Analysis (EDA)

Purpose: Investigate your data to discover hidden patterns and generate ideas Question it answers: “What interesting patterns or relationships might be hiding in here?”

What you’re doing: Playing detective with your data - looking for unexpected connections, unusual patterns, or groups that naturally form. You’re not trying to prove anything yet, just exploring.

Tools you use: Scatter plots, correlation analysis, clustering, PCA Real example: A marketing team discovers that customers who buy coffee also tend to buy pastries, and that there seem to be three distinct customer groups with different buying patterns.

When to use: After descriptive analysis, when you want to generate hypotheses or understand your data’s structure before formal testing.

Key EDA Techniques Explained:

Principal Component Analysis (PCA)

  • What it does: Takes complex data with many variables and finds the most important patterns
  • Why useful: Like creating a highlights reel - keeps the important stuff, drops the noise
  • Example: Instead of tracking 20 different customer behaviors, PCA might show that just 3 underlying patterns explain most of the differences

K-means Clustering

  • What it does: Automatically groups similar data points together
  • How it works: You tell it “find me 3 groups” and it sorts your data into 3 clusters based on similarity
  • Example: Grouping customers into “big spenders,” “bargain hunters,” and “occasional shoppers” based on their purchasing behavior

Inferential Analysis

Purpose: Make educated guesses about a large group based on a smaller sample Question it answers: “If this is true for my sample, is it likely true for everyone?”

What you’re doing: Like conducting a poll - you survey 1,000 voters to predict how 1 million will vote. You’re using statistics to bridge the gap between what you observed and what you can conclude.

Tools you use: Hypothesis testing, confidence intervals, statistical significance tests Real example: A pharmaceutical company tests a drug on 500 patients and uses statistics to determine if it would work for the broader population.

Critical requirements:

  • Your sample must be large enough (≥ 10% of population)
  • Your sample must be randomly selected (no cherry-picking)
  • Test one hypothesis at a time (avoid “fishing” for results)

When to use: When you want to make claims about a population but can only study a sample.

Causal Analysis

Purpose: Determine if one thing actually causes another (not just correlation) Question it answers: “Does X really cause Y, or do they just happen together?”

What you’re doing: The hardest type of analysis - proving cause and effect. Just because two things happen together doesn’t mean one causes the other (ice cream sales and drowning both increase in summer, but ice cream doesn’t cause drowning - heat causes both).

Tools you use: Controlled experiments, randomized trials, advanced statistical techniques Real example: Testing whether a new teaching method actually improves test scores, not just whether schools using it happen to have higher scores.

Why it’s difficult: You need to control for confounding variables - other factors that might explain the relationship.

When to use: When you need to make decisions based on cause-and-effect (like “will this marketing campaign increase sales?” not just “are they correlated?”).

Predictive Analysis

Purpose: Forecast what will happen in the future based on historical patterns Question it answers: “Based on past data, what should we expect next?”

What you’re doing: Building a model that learns from historical data to make educated guesses about future outcomes. Like teaching a computer to recognize patterns you’ve seen before.

Tools you use: Machine learning algorithms, regression models, neural networks Real example: Netflix predicting which movies you’ll like based on what you’ve watched before, or banks predicting loan defaults.

Key insight: The model can only be as good as your data - “garbage in, garbage out.”

Difference from inferential: Inferential analysis generalizes about populations; predictive analysis forecasts future events.

When to use: When you need to plan for the future or make decisions based on likely outcomes.


Data Visualization

Univariate Charts (One Variable)

  • Histogram: Distribution of continuous data
  • Bar chart: Frequency of categories
  • Box plot: Shows median, quartiles, outliers
  • Density plot: Smooth version of histogram

Bivariate Charts (Two Variables)

  • Scatter plot: Relationship between two continuous variables
  • Line chart: Change over time
  • Grouped bar chart: Compare categories across groups

Multivariate Charts (Three+ Variables)

  • Multi-line chart: Multiple trends over time
  • Heat map: Show relationships using color intensity
  • 3D scatter plot: Three continuous variables

Correlation Visualization

Correlation Coefficient ranges from -1 to +1:

  • +1: Perfect positive correlation
  • 0: No linear relationship
  • -1: Perfect negative correlation

Remember: Correlation ≠ Causation


AI Integration

  • 65% of organizations have adopted or are actively investigating AI technologies for data and analytics
  • AI agents automate complex analysis tasks
  • Natural Language Processing enables asking questions in plain English

Real-Time Analytics

  • Edge computing reduces delays for instant insights
  • Critical for personalized user experiences
  • More than 70% of healthcare institutions use cloud computing to facilitate real-time data sharing

Synthetic Data

  • By 2026, 75% of businesses are expected to create synthetic customer data using Gen AI
  • Protects privacy while enabling analysis
  • Cost-effective way to supplement real data

Self-Service Analytics

  • The global self-service BI market is projected to grow from $6.73 billion in 2024 to $27.32 billion by 2032
  • Tools that non-technical users can operate
  • Democratizes data access across organizations

Best Practices

Data Quality First

  1. Clean your data: Remove duplicates, handle missing values
  2. Validate sources: Ensure data accuracy and reliability
  3. Document everything: Track data sources, transformations, assumptions

Analysis Approach

  1. Start with business questions: What decisions will this analysis inform?
  2. Choose appropriate methods: Match analysis type to question type
  3. Consider context: Industry standards, seasonal patterns, external factors
  4. Validate results: Do findings make business sense?

Visualization Guidelines

  1. Choose the right chart: Match chart type to data type and message
  2. Keep it simple: Avoid 3D effects, too many colors, cluttered layouts
  3. Tell a story: Use titles, annotations, and logical flow
  4. Consider your audience: Technical vs. executive presentation styles

Ethics and Governance

  1. Protect privacy: Follow GDPR, CCPA, and other regulations
  2. Avoid bias: Check for discrimination in models and data
  3. Be transparent: Document methodology and limitations
  4. Secure data: Implement proper access controls and encryption

Free Datasets for Practice

Government and Health

News and Analysis

Social Good

Business and Economics

Finding More Datasets

Google Dataset Search - Search engine specifically for datasets


Quick Reference Formulas

Basic Statistics

  • Mean: (Sum of all values) ÷ (Count of values)
  • Standard Deviation: √[(Σ(x - mean)²) ÷ (n-1)]
  • Correlation: Ranges from -1 to +1

Identifying Outliers

  • IQR Method: Beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR
  • Standard Deviation: Beyond mean ± 2 or 3 standard deviations

Data Distribution Check

  • Right-skewed: Mean > Median (use median)
  • Left-skewed: Mean < Median (use median)
  • Normal: Mean ≈ Median (either works)

Sample Size Guidelines

  • Minimum for inference: Sample ≥ 10% of population
  • For normal approximation: n ≥ 30 (Central Limit Theorem)
  • For proportions: np ≥ 5 and n(1-p) ≥ 5

Found this guide helpful? Share it with your team:

Share on LinkedIn