Sharks and ice cream¶
A confounding variable can induce association between variables that are not causally linked. To tackle this problem, you have to control for all confounders, that is, include them in your model.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from plot_style import line_params, scatter_params
Define the simulation mechanism¶
The variables here are: ice cream sales, shark attacks, and temperature.
We assume the causal structure: temperature affects ice cream (TEMP → ICE), and temperature affects shark attacks (TEMP → SHARK).
np.random.seed(42)
n = 300
# Temperature is indedependent and influences both variables
temp = np.random.uniform(10, 30, size=n) # Temperature in degrees Celsius
ice_sales = 10 * temp + np.random.normal(0, 10, size=n)
shark_attacks = 0.5 * temp + np.random.normal(0, 2, size=n)
df = pd.DataFrame({"temp": temp, "ice_sales": ice_sales, "shark_attacks": shark_attacks})
# Scatter plot: ice_sales vs. shark_attacks with naive regression line
plt.scatter(df["ice_sales"], df["shark_attacks"], **scatter_params)
plt.xlabel("ice cream sales")
plt.ylabel("shark attacks")
# Naive correlation between ice cream sales and shark attacks
corr = df["ice_sales"].corr(df["shark_attacks"])
print(f"Correlation between variables = {corr:.3f}. Very high!")
Correlation between variables = 0.801. Very high!
Train model 1¶
The first model predicts shark_attacks from ice_sales
# Naive regression
X1 = sm.add_constant(df['ice_sales'])
model1 = sm.OLS(df['shark_attacks'], X1).fit()
print("\nNaive regression (shark_attacks ~ ice_sales):")
print(model1.summary())
x_vals = np.linspace(df["ice_sales"].min(), df["ice_sales"].max(), 100)
y_vals = model1.params['const'] + model1.params["ice_sales"] * x_vals
plt.plot(x_vals, y_vals, **line_params)
plt.scatter(df["ice_sales"], df["shark_attacks"], **scatter_params)
plt.xlabel("ice cream sales")
plt.ylabel("shark attacks")
Naive regression (shark_attacks ~ ice_sales): OLS Regression Results ============================================================================== Dep. Variable: shark_attacks R-squared: 0.642 Model: OLS Adj. R-squared: 0.641 Method: Least Squares F-statistic: 534.0 Date: Tue, 26 Aug 2025 Prob (F-statistic): 2.09e-68 Time: 07:17:17 Log-Likelihood: -637.45 No. Observations: 300 AIC: 1279. Df Residuals: 298 BIC: 1286. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 1.0371 0.404 2.566 0.011 0.242 1.832 ice_sales 0.0449 0.002 23.108 0.000 0.041 0.049 ============================================================================== Omnibus: 0.251 Durbin-Watson: 2.087 Prob(Omnibus): 0.882 Jarque-Bera (JB): 0.299 Skew: 0.069 Prob(JB): 0.861 Kurtosis: 2.929 Cond. No. 716. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Text(0, 0.5, 'shark attacks')
Train model 2¶
The second model predicts shark_attacks from ice_sales and temperature
# Fit linear model
X2 = sm.add_constant(df[["ice_sales", "temp"]])
model2 = sm.OLS(df["shark_attacks"], X2).fit()
print("\nRegression (shark_attacks ~ ice_sales, temp):")
print(model2.summary())
# Prepare grid for regression plane
ice_grid, temp_grid = np.meshgrid(
np.linspace(df["ice_sales"].min(), df["ice_sales"].max(), 30),
np.linspace(df["temp"].min(), df["temp"].max(), 30),
)
shark_pred = (
model2.params["const"]
+ model2.params["ice_sales"] * ice_grid
+ model2.params["temp"] * temp_grid
)
# 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.scatter(
df["ice_sales"],
df["temp"],
df["shark_attacks"],
**scatter_params
)
ax.plot_surface(ice_grid, temp_grid, shark_pred, color="gray", alpha=0.5)
ax.set_xlabel("ice cream sales")
ax.set_ylabel("temperature")
ax.set_zlabel("shark attacks")
Regression (shark_attacks ~ ice_sales, temp): OLS Regression Results ============================================================================== Dep. Variable: shark_attacks R-squared: 0.662 Model: OLS Adj. R-squared: 0.660 Method: Least Squares F-statistic: 291.0 Date: Tue, 26 Aug 2025 Prob (F-statistic): 1.06e-70 Time: 07:32:23 Log-Likelihood: -628.71 No. Observations: 300 AIC: 1263. Df Residuals: 297 BIC: 1275. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.6378 0.404 1.577 0.116 -0.158 1.434 ice_sales -0.0033 0.012 -0.287 0.775 -0.026 0.019 temp 0.5022 0.119 4.222 0.000 0.268 0.736 ============================================================================== Omnibus: 0.106 Durbin-Watson: 2.077 Prob(Omnibus): 0.948 Jarque-Bera (JB): 0.182 Skew: 0.041 Prob(JB): 0.913 Kurtosis: 2.911 Cond. No. 742. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Text(0.5, 0, 'shark attacks')
Visualize fit in 3D¶
(Doesn't render on a browser, but it's a cool visual)
import plotly.graph_objs as go
# Scatter points
scatter = go.Scatter3d(
x=df["ice_sales"],
y=df["temp"],
z=df["shark_attacks"],
mode="markers",
marker=dict(size=4, color="lightblue", line=dict(width=0.5, color="gray")),
name="Data",
)
# Regression plane
surface = go.Surface(
x=ice_grid,
y=temp_grid,
z=shark_pred,
opacity=0.5,
colorscale="Greys",
name="Regression Plane",
showscale=False,
)
fig = go.Figure(data=[scatter, surface])
fig.update_layout(
scene=dict(
xaxis_title="ice cream sales", yaxis_title="temperature", zaxis_title="shark attacks"
),
margin=dict(l=0, r=0, b=0, t=0),
)
Check residual model¶
Train a model on the residuals
# 5. Partial regression plot: residuals after removing temperature effect
ice_model = sm.OLS(df['ice_sales'], sm.add_constant(df['temp'])).fit()
shark_model = sm.OLS(df['shark_attacks'], sm.add_constant(df['temp'])).fit()
res_ice = pd.Series(ice_model.resid, name='res_ice')
res_shark = pd.Series(shark_model.resid, name='res_shark')
res_df = pd.DataFrame({'res_ice': res_ice, 'res_shark': res_shark})
# Fit regression on residuals
partial_model = sm.OLS(res_df['res_shark'], sm.add_constant(res_df['res_ice'])).fit()
print(partial_model.summary())
plt.figure()
plt.scatter(res_df['res_ice'], res_df['res_shark'], **scatter_params)
x_vals2 = np.linspace(res_df['res_ice'].min(), res_df['res_ice'].max(), 100)
y_vals2 = partial_model.params['const'] + partial_model.params['res_ice'] * x_vals2
plt.plot(x_vals2, y_vals2, **line_params)
plt.xlabel('residual ice cream sales')
plt.ylabel('residual shark attacks')
plt.title('including temperature')
OLS Regression Results ============================================================================== Dep. Variable: res_shark R-squared: 0.000 Model: OLS Adj. R-squared: -0.003 Method: Least Squares F-statistic: 0.08246 Date: Tue, 26 Aug 2025 Prob (F-statistic): 0.774 Time: 07:36:36 Log-Likelihood: -628.71 No. Observations: 300 AIC: 1261. Df Residuals: 298 BIC: 1269. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -7.536e-15 0.114 -6.61e-14 1.000 -0.224 0.224 res_ice -0.0033 0.012 -0.287 0.774 -0.026 0.019 ============================================================================== Omnibus: 0.106 Durbin-Watson: 2.077 Prob(Omnibus): 0.948 Jarque-Bera (JB): 0.182 Skew: 0.041 Prob(JB): 0.913 Kurtosis: 2.911 Cond. No. 9.86 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Text(0.5, 1.0, 'including temperature')
Conclusion¶
We know ice cream sales have nothing to do with shark attack rates... but training a model on these two variables will suggest there is! 🦈🍨
Next time you train a model and want to use it to make better decisions, be careful. If your goal is to change X to affect Y, make sure you didn't forget to include any confounders in your analysis.