Sharks and ice cream¶

A confounding variable can induce association between variables that are not causally linked. To tackle this problem, you have to control for all confounders, that is, include them in your model.

In [2]:

Copied!





import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from plot_style import line_params, scatter_params
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from plot_style import line_params, scatter_params

Define the simulation mechanism¶

The variables here are: ice cream sales, shark attacks, and temperature.

We assume the causal structure: temperature affects ice cream (TEMP → ICE), and temperature affects shark attacks (TEMP → SHARK).

In [3]:

Copied!





np.random.seed(42)
n = 300

# Temperature is indedependent and influences both variables
temp = np.random.uniform(10, 30, size=n)  # Temperature in degrees Celsius
ice_sales = 10 * temp + np.random.normal(0, 10, size=n)
shark_attacks = 0.5 * temp + np.random.normal(0, 2, size=n)

df = pd.DataFrame({"temp": temp, "ice_sales": ice_sales, "shark_attacks": shark_attacks})

# Scatter plot: ice_sales vs. shark_attacks with naive regression line
plt.scatter(df["ice_sales"], df["shark_attacks"], **scatter_params)
plt.xlabel("ice cream sales")
plt.ylabel("shark attacks")

# Naive correlation between ice cream sales and shark attacks
corr = df["ice_sales"].corr(df["shark_attacks"])
print(f"Correlation between variables = {corr:.3f}. Very high!")
np.random.seed(42)
n = 300

# Temperature is indedependent and influences both variables
temp = np.random.uniform(10, 30, size=n)  # Temperature in degrees Celsius
ice_sales = 10 * temp + np.random.normal(0, 10, size=n)
shark_attacks = 0.5 * temp + np.random.normal(0, 2, size=n)

df = pd.DataFrame({"temp": temp, "ice_sales": ice_sales, "shark_attacks": shark_attacks})

# Scatter plot: ice_sales vs. shark_attacks with naive regression line
plt.scatter(df["ice_sales"], df["shark_attacks"], **scatter_params)
plt.xlabel("ice cream sales")
plt.ylabel("shark attacks")

# Naive correlation between ice cream sales and shark attacks
corr = df["ice_sales"].corr(df["shark_attacks"])
print(f"Correlation between variables = {corr:.3f}. Very high!")

Correlation between variables = 0.801. Very high!

No description has been provided for this image

Train model 1¶

The first model predicts shark_attacks from ice_sales

In [ ]:

Copied!





# Naive regression
X1 = sm.add_constant(df['ice_sales'])
model1 = sm.OLS(df['shark_attacks'], X1).fit()

print("\nNaive regression (shark_attacks ~ ice_sales):")
print(model1.summary())

x_vals = np.linspace(df["ice_sales"].min(), df["ice_sales"].max(), 100)
y_vals = model1.params['const'] + model1.params["ice_sales"] * x_vals
plt.plot(x_vals, y_vals, **line_params)
plt.scatter(df["ice_sales"], df["shark_attacks"], **scatter_params)
plt.xlabel("ice cream sales")
plt.ylabel("shark attacks")
# Naive regression
X1 = sm.add_constant(df['ice_sales'])
model1 = sm.OLS(df['shark_attacks'], X1).fit()

print("\nNaive regression (shark_attacks ~ ice_sales):")
print(model1.summary())

x_vals = np.linspace(df["ice_sales"].min(), df["ice_sales"].max(), 100)
y_vals = model1.params['const'] + model1.params["ice_sales"] * x_vals
plt.plot(x_vals, y_vals, **line_params)
plt.scatter(df["ice_sales"], df["shark_attacks"], **scatter_params)
plt.xlabel("ice cream sales")
plt.ylabel("shark attacks")

Naive regression (shark_attacks ~ ice_sales):
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          shark_attacks   R-squared:                       0.642
Model:                            OLS   Adj. R-squared:                  0.641
Method:                 Least Squares   F-statistic:                     534.0
Date:                Tue, 26 Aug 2025   Prob (F-statistic):           2.09e-68
Time:                        07:17:17   Log-Likelihood:                -637.45
No. Observations:                 300   AIC:                             1279.
Df Residuals:                     298   BIC:                             1286.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0371      0.404      2.566      0.011       0.242       1.832
ice_sales      0.0449      0.002     23.108      0.000       0.041       0.049
==============================================================================
Omnibus:                        0.251   Durbin-Watson:                   2.087
Prob(Omnibus):                  0.882   Jarque-Bera (JB):                0.299
Skew:                           0.069   Prob(JB):                        0.861
Kurtosis:                       2.929   Cond. No.                         716.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Out[ ]:

Text(0, 0.5, 'shark attacks')

Train model 2¶

The second model predicts shark_attacks from ice_sales and temperature

In [ ]:

Copied!





# Fit linear model
X2 = sm.add_constant(df[["ice_sales", "temp"]])
model2 = sm.OLS(df["shark_attacks"], X2).fit()

print("\nRegression (shark_attacks ~ ice_sales, temp):")
print(model2.summary())

# Prepare grid for regression plane
ice_grid, temp_grid = np.meshgrid(
    np.linspace(df["ice_sales"].min(), df["ice_sales"].max(), 30),
    np.linspace(df["temp"].min(), df["temp"].max(), 30),
)
shark_pred = (
    model2.params["const"]
    + model2.params["ice_sales"] * ice_grid
    + model2.params["temp"] * temp_grid
)

# 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.scatter(
    df["ice_sales"],
    df["temp"],
    df["shark_attacks"],
    **scatter_params
)
ax.plot_surface(ice_grid, temp_grid, shark_pred, color="gray", alpha=0.5)
ax.set_xlabel("ice cream sales")
ax.set_ylabel("temperature")
ax.set_zlabel("shark attacks")
# Fit linear model
X2 = sm.add_constant(df[["ice_sales", "temp"]])
model2 = sm.OLS(df["shark_attacks"], X2).fit()

print("\nRegression (shark_attacks ~ ice_sales, temp):")
print(model2.summary())

# Prepare grid for regression plane
ice_grid, temp_grid = np.meshgrid(
    np.linspace(df["ice_sales"].min(), df["ice_sales"].max(), 30),
    np.linspace(df["temp"].min(), df["temp"].max(), 30),
)
shark_pred = (
    model2.params["const"]
    + model2.params["ice_sales"] * ice_grid
    + model2.params["temp"] * temp_grid
)

# 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.scatter(
    df["ice_sales"],
    df["temp"],
    df["shark_attacks"],
    **scatter_params
)
ax.plot_surface(ice_grid, temp_grid, shark_pred, color="gray", alpha=0.5)
ax.set_xlabel("ice cream sales")
ax.set_ylabel("temperature")
ax.set_zlabel("shark attacks")

Regression (shark_attacks ~ ice_sales, temp):
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          shark_attacks   R-squared:                       0.662
Model:                            OLS   Adj. R-squared:                  0.660
Method:                 Least Squares   F-statistic:                     291.0
Date:                Tue, 26 Aug 2025   Prob (F-statistic):           1.06e-70
Time:                        07:32:23   Log-Likelihood:                -628.71
No. Observations:                 300   AIC:                             1263.
Df Residuals:                     297   BIC:                             1275.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6378      0.404      1.577      0.116      -0.158       1.434
ice_sales     -0.0033      0.012     -0.287      0.775      -0.026       0.019
temp           0.5022      0.119      4.222      0.000       0.268       0.736
==============================================================================
Omnibus:                        0.106   Durbin-Watson:                   2.077
Prob(Omnibus):                  0.948   Jarque-Bera (JB):                0.182
Skew:                           0.041   Prob(JB):                        0.913
Kurtosis:                       2.911   Cond. No.                         742.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Out[ ]:

Text(0.5, 0, 'shark attacks')

Visualize fit in 3D¶

(Doesn't render on a browser, but it's a cool visual)

In [ ]:

Copied!





import plotly.graph_objs as go

# Scatter points
scatter = go.Scatter3d(
    x=df["ice_sales"],
    y=df["temp"],
    z=df["shark_attacks"],
    mode="markers",
    marker=dict(size=4, color="lightblue", line=dict(width=0.5, color="gray")),
    name="Data",
)

# Regression plane
surface = go.Surface(
    x=ice_grid,
    y=temp_grid,
    z=shark_pred,
    opacity=0.5,
    colorscale="Greys",
    name="Regression Plane",
    showscale=False,
)

fig = go.Figure(data=[scatter, surface])
fig.update_layout(
    scene=dict(
        xaxis_title="ice cream sales", yaxis_title="temperature", zaxis_title="shark attacks"
    ),
    margin=dict(l=0, r=0, b=0, t=0),
)
import plotly.graph_objs as go

# Scatter points
scatter = go.Scatter3d(
    x=df["ice_sales"],
    y=df["temp"],
    z=df["shark_attacks"],
    mode="markers",
    marker=dict(size=4, color="lightblue", line=dict(width=0.5, color="gray")),
    name="Data",
)

# Regression plane
surface = go.Surface(
    x=ice_grid,
    y=temp_grid,
    z=shark_pred,
    opacity=0.5,
    colorscale="Greys",
    name="Regression Plane",
    showscale=False,
)

fig = go.Figure(data=[scatter, surface])
fig.update_layout(
    scene=dict(
        xaxis_title="ice cream sales", yaxis_title="temperature", zaxis_title="shark attacks"
    ),
    margin=dict(l=0, r=0, b=0, t=0),
)

Check residual model¶

Train a model on the residuals

In [ ]:

Copied!





# 5. Partial regression plot: residuals after removing temperature effect
ice_model = sm.OLS(df['ice_sales'], sm.add_constant(df['temp'])).fit()
shark_model = sm.OLS(df['shark_attacks'], sm.add_constant(df['temp'])).fit()
res_ice = pd.Series(ice_model.resid, name='res_ice')
res_shark = pd.Series(shark_model.resid, name='res_shark')
res_df = pd.DataFrame({'res_ice': res_ice, 'res_shark': res_shark})

# Fit regression on residuals
partial_model = sm.OLS(res_df['res_shark'], sm.add_constant(res_df['res_ice'])).fit()
print(partial_model.summary())

plt.figure()
plt.scatter(res_df['res_ice'], res_df['res_shark'], **scatter_params)
x_vals2 = np.linspace(res_df['res_ice'].min(), res_df['res_ice'].max(), 100)
y_vals2 = partial_model.params['const'] + partial_model.params['res_ice'] * x_vals2
plt.plot(x_vals2, y_vals2, **line_params)
plt.xlabel('residual ice cream sales')
plt.ylabel('residual shark attacks')
plt.title('including temperature')
# 5. Partial regression plot: residuals after removing temperature effect
ice_model = sm.OLS(df['ice_sales'], sm.add_constant(df['temp'])).fit()
shark_model = sm.OLS(df['shark_attacks'], sm.add_constant(df['temp'])).fit()
res_ice = pd.Series(ice_model.resid, name='res_ice')
res_shark = pd.Series(shark_model.resid, name='res_shark')
res_df = pd.DataFrame({'res_ice': res_ice, 'res_shark': res_shark})

# Fit regression on residuals
partial_model = sm.OLS(res_df['res_shark'], sm.add_constant(res_df['res_ice'])).fit()
print(partial_model.summary())

plt.figure()
plt.scatter(res_df['res_ice'], res_df['res_shark'], **scatter_params)
x_vals2 = np.linspace(res_df['res_ice'].min(), res_df['res_ice'].max(), 100)
y_vals2 = partial_model.params['const'] + partial_model.params['res_ice'] * x_vals2
plt.plot(x_vals2, y_vals2, **line_params)
plt.xlabel('residual ice cream sales')
plt.ylabel('residual shark attacks')
plt.title('including temperature')

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              res_shark   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.003
Method:                 Least Squares   F-statistic:                   0.08246
Date:                Tue, 26 Aug 2025   Prob (F-statistic):              0.774
Time:                        07:36:36   Log-Likelihood:                -628.71
No. Observations:                 300   AIC:                             1261.
Df Residuals:                     298   BIC:                             1269.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -7.536e-15      0.114  -6.61e-14      1.000      -0.224       0.224
res_ice       -0.0033      0.012     -0.287      0.774      -0.026       0.019
==============================================================================
Omnibus:                        0.106   Durbin-Watson:                   2.077
Prob(Omnibus):                  0.948   Jarque-Bera (JB):                0.182
Skew:                           0.041   Prob(JB):                        0.913
Kurtosis:                       2.911   Cond. No.                         9.86
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Out[ ]:

Text(0.5, 1.0, 'including temperature')

Conclusion¶

We know ice cream sales have nothing to do with shark attack rates... but training a model on these two variables will suggest there is! 🦈🍨

Next time you train a model and want to use it to make better decisions, be careful. If your goal is to change X to affect Y, make sure you didn't forget to include any confounders in your analysis.