ECON 0150 | Economic Data Analysis

The economist’s data analysis pipeline.


Part 2.3 | Filtering Data

A New US Coffee Shop

Lets use Starbucks_Location_Hours.csv to inform a new shop’s hours.



  • The coffee shop is opening in upstate New York, near the border to Canada.
  • You’re asked to help make some decisions about how to run the shop when it opens.
  • The dataset Starbucks_Location_Hours.csv contains information about Starbucks coffee shops globally.

A New Coffee Shop

Q. When might be a good time for the coffee shop to open?

storeNumber COUNTRY_CODE open close duration_hr
0 34638-85784 HK 8 22 14
1 32141-267986 HK 7 22 14
2 15035-155445 HK 8 22 14
3 49646-268445 HK 8 22 14
4 31944-224544 HK 8 20 12


> we should use a figure

A New Coffee Shop

Q. When might be a good time for the coffee shop to open?

> so it seems best to open sometime in the morning… makes sense

> but what if there’s something specific about US coffee drinkers though?

Opening Time: filter by category

Q. When might be a good time for the coffee shop to open?

> here we’ve filtered for US locations

> so it seems US Starbucks open earlier

Opening Time: filter by category

Q. When might be a good time for the coffee shop to open?

> but maybe we should look at Canadian shops too…

> let filter for BOTH countries

Opening Time: combining filters

Lets us some Boolian logic :)

> is there something different between the US and Canada?

Opening Time: combining filters

Q. When might be a good time for the coffee shop to open?

Opening Time: combining filters

Q. When might be a good time for the coffee shop to open?

> so not much difference between when shops in the US and Canada open

Opening Time: combining filters

Q. When might be a good time for the coffee shop to open?

Opening Time: combining filters

Q. When might be a good time for the coffee shop to open?

> no data! no coffee shop can be in the US AND Canada!

Opening Time: combining filters

Q. When might be a good time for the coffee shop to open?

Opening Time: combining filters

Q. When might be a good time for the coffee shop to open?

> so coffee shops in US and Canada open much earlier than the rest of the world

Opening Time: conclusions

Q. When might be a good time for the coffee shop to open?


  • So lets open earlier than 7 AM.

Exercise 2.3 | Filter by Category

Use Starbucks_Location_Hours.csv to inform a new shop’s hours.



  • The coffee shop is opening in upstate New York, near the border to Canada.
  • You’re asked to help make some decisions about how to run the shop when it opens.
  • The dataset Starbucks_Location_Hours.csv contains information about Starbucks coffee shops globally.

Coffee Shop Hours: filter by category

Q. When might be a good time for the coffee shop to open?

Coffee Shop Hours: plot all opening times

Q. When might be a good time for the coffee shop to open?

# Histogram
sns.histplot(hours, x='open', bins=range(0,25,1))

Filtering Data by Category

Filtering categorical data requires logical operations.

Logic Python Example
Equals == df[df['shop'] == 'A']
Unequal != df[df['shop'] != 'A']
NOT ~ df[~(df['shop'] == 'A')]
In list .isin() df[df['shop'].isin(['A', 'B'])]
AND & (df['shop'] == 'A') & (df['open'] < 7)
OR | (df['shop'] == 'A') | (df['open'] < 7)

Coffee Shop Hours: filter for US locations

Q. When might be a good time for the coffee shop to open?

# Decide whether each row's country code is 'US'
us_codes = (hours['COUNTRY_CODE'] == 'US')

Coffee Shop Hours: filter for US locations

Q. When might be a good time for the coffee shop to open?

# Decide whether each row's country code is 'US'
us_codes = (hours['COUNTRY_CODE'] == 'US')

# Select the rows with True
us_hours = hours[us_codes]

Coffee Shop Hours: filter for US locations

Q. When might be a good time for the coffee shop to open?

# Select the rows with country code equal to 'US'
us_hours = hours[hours['COUNTRY_CODE'] == 'US']

Coffee Shop Hours: filter for US locations

Q. When might be a good time for the coffee shop to open?

# Histogram of US locations
sns.histplot(us_hours, x='open', bins=range(0,25,1))

Coffee Shop Hours: shops in either US or CA

Q. When might be a good time for a US coffee shop to open?

# Find the data in either the US or in Canada (CA)
# Method 1: Using OR operator |
us_ca = hours[(hours['COUNTRY_CODE'] == 'US') | (hours['COUNTRY_CODE'] == 'CA')]

Coffee Shop Hours: shops in either US or CA

Q. When might be a good time for a US coffee shop to open?

# Find the data in either the US or in Canada (CA)
# Method 2: Using isin()
us_ca = hours[hours['COUNTRY_CODE'].isin(['US', 'CA'])]

Coffee Shop Hours: shops in either US or CA

Q. When might be a good time for a US coffee shop to open?

# Create histogram
sns.histplot(us_ca, x='open', bins=range(0,25,1))

Coffee Shop Hours: shops in both US and CA

Q. When might be a good time for a US coffee shop to open?

What would this dataset look like?

hours[(hours['COUNTRY_CODE'] == 'US') & (hours['COUNTRY_CODE'] == 'CN')]

> it would contain no data!

Opening Time

Q. How long might be good for the coffee shop to stay open?


  • So lets open earlier than 7 AM.
  • How long should we stay open?
storeNumber COUNTRY_CODE open close duration_hr
0 34638-85784 HK 8 22 14
1 32141-267986 HK 7 22 14
2 15035-155445 HK 8 22 14
3 49646-268445 HK 8 22 14
4 31944-224544 HK 8 20 12

Duration

Q. How long might be good for the coffee shop to stay open?

> so most shops stay open for around 15 hours

> does that mean we should stay open for 15 hours?

Duration: filter by inequality

Q. How long might be good for the coffee shop to stay open?

Symbol Meaning
= Equal to
Not equal to
< Less than
> Greater than
Less than or equal to
Greater than or equal to

Duration: filter by inequality

Q. How long might be good for the coffee shop to stay open?

> lets filter for coffee shops that open before 7 AM

Duration: filter by inequality

Q. How long might be good for the coffee shop to stay open?

How would we filter for coffee shops that open before 7 AM?


storeNumber COUNTRY_CODE open close duration_hr
0 34638-85784 HK 8 22 14
1 32141-267986 HK 7 22 14
2 15035-155445 HK 8 22 14
3 49646-268445 HK 8 22 14
4 31944-224544 HK 8 20 12

> use “open < 7”

Duration: filter by inequality

Q. How long might be good for the coffee shop to stay open?

Duration: filter by inequality

Q. How long might be good for the coffee shop to stay open?

> so shops that open early stay open longer

Duration: filter by inequality

Q. How long might be good for the coffee shop to stay open?

> but here we’re looking at all shops globally!

> our shop is opening in the US near Canada, so lets filter by country too

Duration: filter on category and inequality

Q. How long might be good for the coffee shop to stay open?

> shops that open early will stay open longer in the US or Canada

> this is hard to see: maybe there’s a more systematic way of showing differences

Duration: conclusions

Q. How long might be good for the coffee shop to stay open?

Coffee Shop Hours: recommendation

Q. How long might be good for the coffee shop to stay open?

  • Opening time: Before 7 AM (around 5-6 AM)
  • Duration: About 16-17 hours (based on early-opening US/CA shops)
  • Closing time: Around 9-11 PM

> this matches what successful coffee shops do in similar markets

Exercise 2.3 | Filter by Inequality

Use Starbucks_Location_Hours.csv to inform a new shop’s hours.

Filtering Data by Inequality

Filtering numerical data requires inequalities.

Symbol Python Example
= == df[df['open'] == 7]
!= df[df['open'] != 7]
< < df[df['open'] < 7]
> > df[df['open'] > 7]
<= df[df['open'] <= 7]
>= df[df['open'] >= 7]

Coffee Shop Hours: filter for early opening shops

Q. How long might be good for the coffee shop to stay open?

# Filter for shops that open before 7 AM
early_hours = hours[hours['open'] < 7]

Coffee Shop Hours: filter for early opening shops

Q. How long might be good for the coffee shop to stay open?

# Create histogram of duration for early-opening shops
sns.histplot(early_hours, x='duration_hr', bins=range(0,25,1))

Coffee Shop Hours: combine filters

Q. How long might be good for the coffee shop to stay open?

# Filter for shops that open early AND are in US or Canada
early_us_ca = hours[(hours['open'] < 7) & (hours['COUNTRY_CODE'].isin(['US', 'CA']))]

Coffee Shop Hours: combine filters

Q. How long might be good for the coffee shop to stay open?

# Create histogram of duration for early-opening US/CA shops
sns.histplot(early_us_ca, x='duration_hr', bins=range(0,25,1))

Coffee Shop Hours: compare opening times

Q. How long might be good for the coffee shop to stay open?

# Compare early vs all shops in US/CA
sns.histplot(us_ca, x='duration_hr', bins=range(0,25,1), label='All US/CA')
sns.histplot(early_us_ca, x='duration_hr', bins=range(0,25,1), label='Early US/CA')
plt.legend()