Uplift modelling¶
An example of how heterogeneous treatment effects are used to optimize ecommerce platforms. 🚀🚀🚀
The business problem:
We are concerned with giving or not giving users a 5% discount voucher.
If we give it to everyone, we'll likely get more bookings, but at the same time we hurt our margins. So the best way to proceed is to send coupons only to those customers that were unsure about the decision and that will actually convert after receiving it. Also, not giving the coupons to customers that were already going to book no matter what!
The causal problem:
Our treatment T is offering customers this coupon, a binary variable.
Our outcome Y is the customer finalizing their booking or not, also a binary variable.
Additionally, we have access to a number of other variables X characterizing users: the channel that brough them to the platform, the specific ongoing session, and so on... We believe some of them could be valuable in deciding the effectiveness of the intervention. In other words, the effect isn't uniform but varies depending on some of these factors.
import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
from sklearn.model_selection import train_test_split
from uplift_modelling_utils import binned_validation, data_generating_process, uplift_curve
np.random.seed(111)
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.plotting_context("talk", font_scale=1.2)
Collect data and check basic properties¶
The data_generating_process is a complex function defined here. But, in essence, it returns data, which can be used to train models. It also returns metadata, which is NOT available in practice, but which we can explore here to assess model performance exactly.
Let's draw samples and look at some basic metrics...
data, metadata = data_generating_process(20000, setting="observational")
data.head()
| client_id | Y | T | true_uplift | channel | device | lead_time | funnel_depth | price_sort_used | past_coupon_user | tenure_days | dest_tier | hour_local | recent_ad_exposure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0.039828 | mobile_web | 45 | 0 | 0 | 1 | 449 | tier1 | 11 | 1 | |
| 1 | 1 | 0 | 0 | 0.196222 | paid_search | desktop | 3 | 2 | 0 | 1 | 1178 | tier2 | 13 | 0 |
| 2 | 2 | 1 | 1 | 0.420412 | paid_search | mobile_web | 45 | 2 | 1 | 0 | 20 | tier1 | 18 | 1 |
| 3 | 3 | 0 | 1 | 0.338939 | paid_search | mobile_web | 21 | 1 | 1 | 0 | 575 | tier2 | 19 | 0 |
| 4 | 4 | 0 | 0 | 0.436834 | paid_search | mobile_web | 45 | 2 | 1 | 0 | 932 | tier1 | 15 | 0 |
# Quick sanity checks
print("Dataset:")
print(" n =", len(data))
print(" treat_rate =", data["T"].mean().round(4))
print(" conv_rate_T1 =", data.loc[data["T"] == 1, "Y"].mean().round(4))
print(" conv_rate_T0 =", data.loc[data["T"] == 0, "Y"].mean().round(4))
print(" ATE (naive) =", (data.loc[data["T"] == 1, "Y"].mean() - data.loc[data["T"] == 0, "Y"].mean()).round(4))
Dataset: n = 20000 treat_rate = 0.1898 conv_rate_T1 = 0.29 conv_rate_T0 = 0.108 ATE (naive) = 0.182
We see here that less than 20% of users got a voucher, and that the conversion rate is 18.2% higher in the group of people who actually got the discount.
How much of it was actually due them receiving the voucher?
Could the voucher have been more effective for some specific user groups?
To get more insights into what's really going on, let's train some causal models.
Pre-process the data¶
Categorical variables have to be encoded.
channel and device will be one-hot encoded, whereas oridnal encoding is used to transform dest_tier.
Next, our dataset is split into a train and a test set. Note that, we extract the true per-user uplift values u_test from the dataset for validation purposes, but those are NOT available in practice (because they would require having access to counterfactuals...).
Causal validation is not the main goal of this notebook, so it'll be discussed somwhere else 😈
# One hot encode "channel" and "device"
data = pd.get_dummies(data, columns=["channel", "device"], drop_first=True, dtype=int)
# Ordinal encode "dest_tier"
data.loc[:, "dest_tier"] = data["dest_tier"].map({"tier1": 1, "tier2": 2, "tier3": 3})
data
| client_id | Y | T | true_uplift | lead_time | funnel_depth | price_sort_used | past_coupon_user | tenure_days | dest_tier | hour_local | recent_ad_exposure | channel_direct | channel_email | channel_organic | channel_paid_search | device_desktop | device_mobile_web | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0.039828 | 45 | 0 | 0 | 1 | 449 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0.196222 | 3 | 2 | 0 | 1 | 1178 | 2 | 13 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 2 | 2 | 1 | 1 | 0.420412 | 45 | 2 | 1 | 0 | 20 | 1 | 18 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 3 | 3 | 0 | 1 | 0.338939 | 21 | 1 | 1 | 0 | 575 | 2 | 19 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 4 | 4 | 0 | 0 | 0.436834 | 45 | 2 | 1 | 0 | 932 | 1 | 15 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19995 | 19995 | 1 | 0 | 0.390284 | 0 | 2 | 1 | 0 | 348 | 1 | 17 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
| 19996 | 19996 | 0 | 0 | 0.070274 | 7 | 1 | 1 | 1 | 1592 | 1 | 10 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 19997 | 19997 | 0 | 1 | 0.302809 | 30 | 1 | 0 | 0 | 1242 | 1 | 22 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 19998 | 19998 | 0 | 0 | 0.021376 | 90 | 0 | 0 | 0 | 474 | 3 | 15 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 19999 | 19999 | 0 | 0 | 0.135958 | 45 | 1 | 1 | 0 | 686 | 1 | 17 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
20000 rows × 18 columns
def train_test_split_df(df: pd.DataFrame, features: list, test_size=0.25, seed=123):
x = df[features].values
y = df["Y"].astype(float).values
t = df["T"].astype(float).values
u = df["true_uplift"].astype(float).values
return train_test_split(x, y, t, u, test_size=test_size, random_state=seed, stratify=t)
X_train, X_test, y_train, y_test, t_train, t_test, _, u_test = train_test_split_df(
data,
features=[col for col in data.columns if col not in ["client_id", "Y", "T", "true_uplift"]],
test_size=0.25,
)
Define metalearners¶
All learners map $R_x$ to $R$, a set of features $x$ to the predicted uplift, a continuous scalar value.
class TLearner:
"""A simple T-Learner implementation."""
def __init__(self, model_factory=lambda: GradientBoostingClassifier(random_state=7)) -> None:
"""Specify the model class (factory) to be used for the two models."""
self.m1 = model_factory()
self.m0 = model_factory()
def fit(self, x: np.ndarray, t: np.ndarray, y: np.ndarray) -> None:
"""Fit the control and treatment models."""
self.m1.fit(x[t == 1], y[t == 1])
self.m0.fit(x[t == 0], y[t == 0])
def predict_uplift(self, x: np.ndarray) -> np.ndarray:
"""Predict the uplift (treatment effect) for each sample in x."""
p1 = self.m1.predict_proba(x)[:, 1]
p0 = self.m0.predict_proba(x)[:, 1]
return p1 - p0
# T-learner
t_learner = TLearner()
t_learner.fit(X_train, t_train, y_train)
class XLearner:
"""A simple X-Learner implementation."""
def __init__(
self,
clf_factory=lambda: GradientBoostingClassifier(random_state=7),
reg_factory=lambda: GradientBoostingRegressor(random_state=7),
prop_factory=lambda: LogisticRegression(max_iter=300),
) -> None:
"""Specify the model classes (factories) to be used."""
self.m1 = clf_factory()
self.m0 = clf_factory()
self.g1 = reg_factory()
self.g0 = reg_factory()
self.prop = prop_factory()
def fit(self, x: np.ndarray, t: np.ndarray, y: np.ndarray) -> None:
"""Fit the two outcome models, the two treatment effect models, and the propensity model."""
self.m1.fit(x[t == 1], y[t == 1])
self.m0.fit(x[t == 0], y[t == 0])
# Estimate unobserved counterfactuals
y1_t1 = y[t == 1]
y0_t1 = self.m0.predict_proba(x[t == 1])[:, 1]
tau_t1 = y1_t1 - y0_t1
y0_t0 = y[t == 0]
y1_t0 = self.m1.predict_proba(x[t == 0])[:, 1]
tau_t0 = y1_t0 - y0_t0
self.g1.fit(x[t == 1], tau_t1)
self.g0.fit(x[t == 0], tau_t0)
self.prop.fit(x, t)
def predict_uplift(self, x: np.ndarray) -> np.ndarray:
"""Predict the uplift for a given set of features."""
e = np.clip(self.prop.predict_proba(x)[:, 1], 1e-3, 1 - 1e-3)
tau1 = self.g1.predict(x)
tau0 = self.g0.predict(x)
return e * tau0 + (1.0 - e) * tau1
x_learner = XLearner()
x_learner.fit(X_train, t_train, y_train)
# Predict uplift on test
tau_T = t_learner.predict_uplift(X_test)
tau_X = x_learner.predict_uplift(X_test)
uplift = pd.DataFrame(
{
"u_true": u_test,
"tau_T": tau_T,
"tau_X": tau_X,
}
)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Scatter plot
axes[0].scatter(
uplift["u_true"],
uplift["tau_T"],
alpha=0.3,
s=12,
label="T-learner",
)
axes[0].scatter(
uplift["u_true"],
uplift["tau_X"],
alpha=0.3,
s=12,
label="X-learner",
)
line = np.linspace(-0.6, 0.8, 100)
axes[0].plot(line, line, ls="--", c="gray", lw=1.5, zorder=-2)
axes[0].set_xlabel("True uplift (pointwise)")
axes[0].set_ylabel("Predicted uplift (pointwise)")
axes[0].set_title("Pointwise testing")
axes[0].legend(loc="best")
axes[0].set_xlim([-0.4, 0.8])
axes[0].set_ylim([-0.4, 0.8])
# Plot with shared bins (same groups across models)
cal = binned_validation(
y_true=uplift["u_true"].values,
scores={
"T-learner": uplift["tau_T"].values,
"X-learner": uplift["tau_X"].values,
},
n_bins=40,
bin_var="true"
)
for name, g in cal.groupby("model"):
axes[1].plot(g["mean_true"], g["mean_pred"], "o-", label=name)
axes[1].plot(line, line, ls="--", c="gray", lw=1.5, zorder=-2)
axes[1].set_ylabel("Predicted uplift (bin mean)")
axes[1].set_xlabel("True uplift (bin mean)")
axes[1].set_title("Binned testing")
axes[1].legend(loc="best")
axes[1].set_xlim([-0.1, 0.5])
axes[1].set_ylim([-0.1, 0.5])
plt.tight_layout()
plt.show()
# Compute metrics
rmse_T = root_mean_squared_error(uplift["tau_T"], uplift["u_true"])
rmse_X = root_mean_squared_error(uplift["tau_X"], uplift["u_true"])
mae_T = mean_absolute_error(uplift["tau_T"], uplift["u_true"])
mae_X = mean_absolute_error(uplift["tau_X"], uplift["u_true"])
print("RMSE:", f"\nT-learner {rmse_T:.4f}, X-learner: {rmse_X:.4f}")
print("MAE:", f"\nT-learner {mae_T:.4f}, X-learner: {mae_X:.4f}")
RMSE: T-learner 0.0797, X-learner: 0.0671 MAE: T-learner 0.0539, X-learner: 0.0473
# Curves for each model
f_T, y_T, auuc_T = uplift_curve(u_test, tau_T)
f_X, y_X, auuc_X = uplift_curve(u_test, tau_X)
# Oracle (best possible) and random baselines
f_oracle, y_oracle, auuc_oracle = uplift_curve(u_test, u_test)
y_rand = np.linspace(0, 1, 200)
auuc_rand = np.trapezoid(y_rand, y_rand)
plt.figure(figsize=(9, 6))
plt.plot(f_T, y_T, label=f"T-learner (AUUC={auuc_T:.3f})", color="C0")
plt.plot(f_X, y_X, label=f"X-learner (AUUC={auuc_X:.3f})", color="C1")
plt.plot(f_oracle, y_oracle, label=f"Oracle (AUUC={auuc_oracle:.3f})", color="k")
plt.plot(y_rand, y_rand, "--", color="gray", label=f"Random (AUUC={auuc_rand:.3f})")
# Add a star marker at x=0.4, and y where the X learner curve is at x=0.4
x_star = 0.4
y_star = np.interp(x_star, f_X, y_X)
plt.plot(
x_star,
y_star,
marker="*",
markersize=25,
markerfacecolor="k",
markeredgecolor="k",
markeredgewidth=2,
alpha=0.8,
)
plt.xlabel("Targeted fraction of users")
plt.ylabel("Cumulative share of total uplift captured")
plt.title("Uplift curves (test set) and AUUC")
plt.legend(loc="lower right")
plt.ylim([min(0, y_T.min(), y_X.min(), y_oracle.min(), 0), 1.05])
plt.xlim([0, 1])
plt.grid(True, alpha=0.3)
plt.show()
Conclusions¶
TBW