| storeNumber | COUNTRY_CODE | open | close | duration_hr | |
|---|---|---|---|---|---|
| 0 | 34638-85784 | HK | 8 | 22 | 14 |
| 1 | 32141-267986 | HK | 7 | 22 | 14 |
| 2 | 15035-155445 | HK | 8 | 22 | 14 |
| 3 | 49646-268445 | HK | 8 | 22 | 14 |
| 4 | 31944-224544 | HK | 8 | 20 | 12 |
The economist’s data analysis pipeline.
Lets use Starbucks_Location_Hours.csv to inform a new shop’s hours.
Starbucks_Location_Hours.csv contains information about Starbucks coffee shops globally.Q. When might be a good time for the coffee shop to open?
| storeNumber | COUNTRY_CODE | open | close | duration_hr | |
|---|---|---|---|---|---|
| 0 | 34638-85784 | HK | 8 | 22 | 14 |
| 1 | 32141-267986 | HK | 7 | 22 | 14 |
| 2 | 15035-155445 | HK | 8 | 22 | 14 |
| 3 | 49646-268445 | HK | 8 | 22 | 14 |
| 4 | 31944-224544 | HK | 8 | 20 | 12 |
> we should use a figure
Q. When might be a good time for the coffee shop to open?
> so it seems best to open sometime in the morning… makes sense
> but what if there’s something specific about US coffee drinkers though?
Q. When might be a good time for the coffee shop to open?
> here we’ve filtered for US locations
> so it seems US Starbucks open earlier
Q. When might be a good time for the coffee shop to open?
> but maybe we should look at Canadian shops too…
> let filter for BOTH countries
Lets us some Boolian logic :)
> is there something different between the US and Canada?
Q. When might be a good time for the coffee shop to open?
Q. When might be a good time for the coffee shop to open?
> so not much difference between when shops in the US and Canada open
Q. When might be a good time for the coffee shop to open?
Q. When might be a good time for the coffee shop to open?
> no data! no coffee shop can be in the US AND Canada!
Q. When might be a good time for the coffee shop to open?
Q. When might be a good time for the coffee shop to open?
> so coffee shops in US and Canada open much earlier than the rest of the world
Q. When might be a good time for the coffee shop to open?
Use Starbucks_Location_Hours.csv to inform a new shop’s hours.
Starbucks_Location_Hours.csv contains information about Starbucks coffee shops globally.Q. When might be a good time for the coffee shop to open?
Q. When might be a good time for the coffee shop to open?

Filtering categorical data requires logical operations.
| Logic | Python | Example | |
|---|---|---|---|
| Equals | == | df[df['shop'] == 'A'] |
|
| Unequal | != | df[df['shop'] != 'A'] |
|
| NOT | ~ | df[~(df['shop'] == 'A')] |
|
| In list | .isin() | df[df['shop'].isin(['A', 'B'])] |
|
| AND | & | (df['shop'] == 'A') & (df['open'] < 7) |
|
| OR | | | (df['shop'] == 'A') | (df['open'] < 7) |
Q. When might be a good time for the coffee shop to open?
Q. When might be a good time for the coffee shop to open?
Q. When might be a good time for the coffee shop to open?
Q. When might be a good time for the coffee shop to open?
Q. When might be a good time for a US coffee shop to open?
Q. When might be a good time for a US coffee shop to open?
Q. When might be a good time for a US coffee shop to open?
Q. When might be a good time for a US coffee shop to open?
What would this dataset look like?
> it would contain no data!

Q. How long might be good for the coffee shop to stay open?
| storeNumber | COUNTRY_CODE | open | close | duration_hr | |
|---|---|---|---|---|---|
| 0 | 34638-85784 | HK | 8 | 22 | 14 |
| 1 | 32141-267986 | HK | 7 | 22 | 14 |
| 2 | 15035-155445 | HK | 8 | 22 | 14 |
| 3 | 49646-268445 | HK | 8 | 22 | 14 |
| 4 | 31944-224544 | HK | 8 | 20 | 12 |
Q. How long might be good for the coffee shop to stay open?
> so most shops stay open for around 15 hours
> does that mean we should stay open for 15 hours?
Q. How long might be good for the coffee shop to stay open?
| Symbol | Meaning |
|---|---|
| = | Equal to |
| ≠ | Not equal to |
| < | Less than |
| > | Greater than |
| ≤ | Less than or equal to |
| ≥ | Greater than or equal to |
Q. How long might be good for the coffee shop to stay open?
> lets filter for coffee shops that open before 7 AM
Q. How long might be good for the coffee shop to stay open?
How would we filter for coffee shops that open before 7 AM?
| storeNumber | COUNTRY_CODE | open | close | duration_hr | |
|---|---|---|---|---|---|
| 0 | 34638-85784 | HK | 8 | 22 | 14 |
| 1 | 32141-267986 | HK | 7 | 22 | 14 |
| 2 | 15035-155445 | HK | 8 | 22 | 14 |
| 3 | 49646-268445 | HK | 8 | 22 | 14 |
| 4 | 31944-224544 | HK | 8 | 20 | 12 |
> use “open < 7”
Q. How long might be good for the coffee shop to stay open?
Q. How long might be good for the coffee shop to stay open?
> so shops that open early stay open longer
Q. How long might be good for the coffee shop to stay open?
> but here we’re looking at all shops globally!
> our shop is opening in the US near Canada, so lets filter by country too
Q. How long might be good for the coffee shop to stay open?
> shops that open early will stay open longer in the US or Canada
> this is hard to see: maybe there’s a more systematic way of showing differences
Q. How long might be good for the coffee shop to stay open?
Q. How long might be good for the coffee shop to stay open?
> this matches what successful coffee shops do in similar markets
Use Starbucks_Location_Hours.csv to inform a new shop’s hours.
Filtering numerical data requires inequalities.
| Symbol | Python | Example | |
|---|---|---|---|
| = | == | df[df['open'] == 7] |
|
| ≠ | != | df[df['open'] != 7] |
|
| < | < | df[df['open'] < 7] |
|
| > | > | df[df['open'] > 7] |
|
| ≤ | <= | df[df['open'] <= 7] |
|
| ≥ | >= | df[df['open'] >= 7] |
Q. How long might be good for the coffee shop to stay open?
Q. How long might be good for the coffee shop to stay open?
# Create histogram of duration for early-opening shops
sns.histplot(early_hours, x='duration_hr', bins=range(0,25,1))Q. How long might be good for the coffee shop to stay open?
Q. How long might be good for the coffee shop to stay open?
# Create histogram of duration for early-opening US/CA shops
sns.histplot(early_us_ca, x='duration_hr', bins=range(0,25,1))Q. How long might be good for the coffee shop to stay open?
# Compare early vs all shops in US/CA
sns.histplot(us_ca, x='duration_hr', bins=range(0,25,1), label='All US/CA')
sns.histplot(early_us_ca, x='duration_hr', bins=range(0,25,1), label='Early US/CA')
plt.legend()