concept_2

ECON 0150 | Economic Data Analysis

The economist’s data analysis skillset.

Part 2.2 | Numerical Variables by Category

Behavioral Response to Incentives

How do buyers respond to different discount structures?

Starbucks sent different promotional offers to different buyers
Each offer has a different structure (BOGO, $2 off $10, $5 off $20, etc.)

Question: Which incentive structure affects buying behavior the most?

The Data

Let’s load the data and take a look

	Event	Revenue	Offer ID
0	transaction	34.56	2off10
1	transaction	18.97	2off10
2	transaction	33.90	Bogo 5
3	transaction	18.01	Bogo 10
4	transaction	19.11	Bogo 10

> which would we expect customers to respond most to: Bogo 5 or Bogo 10?

Exercise 2.2 | Revenue by Offer Type

Visualize the data to answer whether Bogo 5 or Bogo 10 has higher average spending.

Use a boxplot to show the distribution of numerical variables by category.

# Boxplot
sns.boxplot(data, x='Offer ID', y='Revenue', whis=(0,100))

Revenue by Offer Type: Boxplot

The distribution of revenue by offer type.

> hard to see — why are so many values compressed at zero?

Log Transformation: Skewed Data

Each unit = a doubling of spending

	Revenue	log2_Revenue
0	34.56	5.152183
1	18.97	4.319762
2	33.90	5.125155
3	18.01	4.248687
4	19.11	4.329841

> log2(1+$7) = 3, log2(1+$15) = 4, log2(1+$31) = 5

Log Transformation: Visualized

The transformation spreads out skewed data

> x-axis is compressed at low values; y-axis spreads them out evenly

Exercise 2.2 | Log Revenue by Offer Type

Create a boxplot with the log-transformed variable to better see the distribution.

Log transform Revenue.

data['log2_Revenue'] = np.log2(1 + data['Revenue'])

Create a boxplot of log revenue log2_Revenue.

sns.boxplot(data, x='Offer ID', y='log2_Revenue', whis=(0,100))

Add a stripplot.

sns.stripplot(data, x='Offer ID', y='log2_Revenue', alpha=0.3, color='black')

Log Revenue by Offer Type: Boxplot

Now we can see the data better.

> why are there so many zeros?

Exercise 2.2 | Investigate the Data

What’s in the Event column?

Count the unique values in Event.

data['Event'].value_counts()

Summarize counts using a countplot.

sns.countplot(data, x='Event')

Three Event Types

Not all rows are purchases

> most rows are offers, not transactions

Exercise 2.2 | Revenue by Event

Where is revenue coming from?

Create a boxplot of Revenue by Event.

sns.boxplot(data, x='Event', y='Revenue', whis=(0,100))

Revenue by Event Type

Only transactions have revenue

> offers and completions have zero revenue — that’s why we see so many zeros

Exercise 2.2 | Summarize Transactions

Keep only rows where Event equals transaction.

transactions = data[data['Event'] == 'transaction']

Create a boxplot of log revenue by offer type using only transactions.

sns.boxplot(transactions, x='Offer ID', y='log2_Revenue', whis=(0,100))
sns.stripplot(transactions, x='Offer ID', y='log2_Revenue')

Summarize Transactions

Every row is a real purchase.

> which offer type has higher spending?

Exercise 2.2 | Grouped Statistics

Calculate the mean, standard deviation, and count of log revenue by offer type.

transactions.groupby('Offer ID')['log2_Revenue'].agg(['mean', 'std', 'count'])

Grouped Statistics

Average log spending by offer type

	mean	std	count
Offer ID
2off10	3.89	1.03	8569
3off7	3.75	1.06	4698
5off20	4.31	0.89	3239
Bogo 10	4.33	0.65	6308
Bogo 5	3.95	0.81	7803

> 5off20 has the highest mean

> Bogo 10 has a higher mean than Bogo 5

> but is this the whole story?

The Workflow

Filter → Transform → Group → Visualize

Filter — keep only relevant rows
Transform — log scale for skewed data
Group — organize by a categorical variable
Summarize — compare distributions across groups

> you can also see this doesn’t always progress in a straight line!

Distributions by Offer Type

Each point is one transaction

> substantial variation within each offer type

> why are there small purchases in 5off20?

Exercise 2.2 | Compare Two Offers

Filter for just Bogo 5 and Bogo 10, then create a boxplot to compare them.

two_offers = transactions[transactions['Offer ID'].isin(['Bogo 5', 'Bogo 10'])]
sns.boxplot(two_offers, x='Offer ID', y='log2_Revenue', whis=(0,100))

Comparing Two Offers

BOGO 5 vs BOGO 10: Do buyers respond differently?

> BOGO 10 has higher average spending — but look at the overlap

The Overlap Problem

Many BOGO 5 buyers spent more than BOGO 10 buyers

> when distributions overlap this much, is the difference meaningful?

The Key Question

Is the difference real or just noise?

Average spending differs across offer types
But there’s substantial variation within each group
Some “lower” offer buyers outspent “higher” offer buyers

Question: Is this difference we observe actually meaningful?

Part 2.2 | Summary

Summary statistics can hide problems — always visualize
Filter your data — make sure you’re analyzing what you think
Log transformation helps with skewed data
Boxplots by category show distributions, not just means
Overlapping distributions raise inference questions

Building Blocks

What this unit adds to your toolkit

Block	Part 2.2
Variables	Numerical + Categorical
Structures	Cross-section
Operations	Filter, Log transform, Groupby
Visualizations	Bar chart, Boxplot, Stripplot by category