Good Enough: Satisficing in Production Machine Learning
Herbert Simon observed that managers rarely chase the global optimum. Instead, they set an aspiration level, a “good enough” performance, and quit searching once they found an option that met it. Simon called this satisficing.
That habit makes sense because every decision is a trade‑off. In modeling, the benefit side of the ledger is higher predictive accuracy (or some task‑specific utility). The cost side shows up on three fronts: search cost (extra tuning runs, longer training, larger data collection), inference cost (slower, more power‑hungry models), and interpretability cost (complex artifacts that are harder to trust and maintain). Continuing the search is sensible only while the expected marginal gain in accuracy outweighs these combined marginal costs.
Determining the aspiration level is, therefore, the crucial step. Decision theory offers three lenses:
- Linear expected value. Assume costs are linear and the future distribution is stable; stop when the next unit of search no longer raises expected utility.
- Concave expected utility. Risk aversion emerges naturally from diminishing marginal utility—large losses hurt disproportionately more than equivalent gains help; stop when the utility-weighted improvement falls below search cost.
- Distributional robustness (ε, δ). When the distribution is uncertain, add a safety margin: accept any model whose probability of trailing the empirical best by more than ε stays below δ.
Whichever lens you choose, list every rising cost—extra tuning runs, slower inference, tougher audits—convert them to the same units as the benefit, and halt at the point where marginal gain meets marginal cost.
Machine learning practice is full of stop rules. Most follow delta improvement logic: halt when marginal gains fall below some threshold. Most ML algorithms check marginal improvement: halt training once validation loss has failed to improve by more than ε for m consecutive passes. Hyperparameter search, ensembling, etc., typically run for fixed iterations or until performance plateaus rather than stopping at target performance levels.
Fewer techniques target absolute performance. Early‑exit Transformers (BERTxiT, FastBERT, DeeBERT) train intermediate classifiers at internal layers and exit when confidence scores cross pre-set thresholds—they stop when performance is "good enough," not when improvement stagnates.
The question is: when should we target absolute performance instead of marginal improvement? The answer depends on your cost structure. Search costs dominate when the search is repeated frequently, done under tight resource constraints, or when marginal performance gains have low value relative to ongoing search costs.
This happens in real-time inference scenarios where latency budgets are strict or in online learning, where models are constantly retrained, and you need to decide how much compute to spend on each update. In contrast, one-time expensive searches like hyperparameter tuning or neural architecture search are typically worth the cost since the benefits compound over the model's lifetime.
However, many inference procedures optimize rather than early exit: some beam search implementations continue until exhaustion rather than stopping when a good sequence is found, and ensemble prediction usually averages all models rather than stopping when confidence is sufficiently high.
These missing early-exit options aren't limited to inference. Training algorithms show the same pattern. While training typically justifies paying search costs since it's a one-time investment, there are scenarios where you've already determined your target performance through business requirements or cost-benefit analysis and simply want to minimize search costs to reach it. This may not be the most common training scenario, but it's a reasonable one.
Consider decision trees: you could reach a target validation accuracy through two different routes. The conventional approach grows the full tree and then prunes it back. An alternative would be to stop growing the moment validation accuracy hits your target. Both paths might yield similar final performance, but early stopping saves the search cost of growing nodes that will just be pruned away later.
Yet decision tree algorithms like CART don't offer this second option. They provide min_impurity_decrease
(stop when the marginal information gain falls below a threshold) but no min_val_accuracy
parameter (stop when validation performance hits a target). This reflects a broader pattern: algorithms target marginal improvement rather than absolute performance thresholds, even when the latter would be more economically rational.
Good‑enough CART: Search Costs Vs. Grow and Prune
Goal: validation accuracy ≥ 0.85.
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np, time
# 20 000 samples, 20 features
X, y = make_classification(n_samples=20000, n_features=20,
n_informative=15, n_redundant=5,
random_state=0)
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2,
random_state=1)
TARGET = 0.85
# ---------------- Early‑stop loop ----------------
depth = 1
start = time.time()
while True:
clf = DecisionTreeClassifier(max_depth=depth, random_state=0)
clf.fit(X_tr, y_tr)
acc = accuracy_score(y_val, clf.predict(X_val))
if acc >= TARGET or depth >= 30:
break
depth += 1
stop_time = time.time() - start
print(f"Early‑stop: depth {depth}, acc {acc:.3f}, {stop_time:.2f}s")
# -------------- Grow full tree then prune --------------
start = time.time()
full = DecisionTreeClassifier(random_state=0)
full.fit(X_tr, y_tr)
path = full.cost_complexity_pruning_path(X_tr, y_tr)
# sample ≤25 alphas for speed
best = 0
for alpha in np.linspace(path.ccp_alphas.min(), path.ccp_alphas.max(), 25):
pruned = DecisionTreeClassifier(random_state=0, ccp_alpha=alpha)
pruned.fit(X_tr, y_tr)
score = accuracy_score(y_val, pruned.predict(X_val))
best = max(best, score)
prune_time = time.time() - start
print(f"Grow+prune: full depth {full.get_depth()}, best val acc {best:.3f}, {prune_time:.2f}s")
Note: You can also implement a pruning routine that tries different alphas or prunes while monitoring validation loss. Those methods are even more expensive.
An outcome...
Strategy | Tree depth | Wall‑time | Val. accuracy |
---|---|---|---|
Early‑stop | 8 | 1.7 s | 0.852 |
Grow → prune | 23 | 14.8 s | 0.872 |
The early‑stopped tree meets the spec eight times faster and with one‑third the depth.