MLB PREDICTION

Why Models Lose Even When They Are Right

One of the most counterintuitive aspects of prediction is that correct models still lose frequently. Understanding why this happens, and why it does not indicate failure, is essential for anyone evaluating forecasting approaches. This page explains the relationship between process and outcome.

The Distinction Between Right and Lucky

In everyday life, we often conflate being right with having good outcomes. If a decision works out, we call it a good decision. If it fails, we call it a bad decision. This intuition is natural but deeply flawed when applied to probabilistic predictions.

A prediction can be correct in its assessment of probability and still lose. A prediction can be incorrect in its assessment and still win. The relationship between process and outcome is probabilistic, not deterministic.

A Simple Example

Imagine a model that correctly identifies Team A as having a 70% chance to win. That model is right. It accurately reflects the underlying probabilities. But Team B still wins 30% of the time. When Team B wins, was the model wrong?

No. The model was correct about the probability. The 30% outcome simply occurred. This is expected to happen roughly three out of every ten times. The loss does not invalidate the assessment. It confirms that uncertainty exists.

Now imagine a different model that incorrectly assesses Team A at 90% when the true probability is 70%. If Team A wins, this model looks good but was actually wrong. Its calibration was off. It got lucky that the outcome matched its prediction despite the flawed probability estimate.

Expected Value Versus Results

Expected value is the average outcome you would expect over many repetitions of the same situation. It is calculated by weighting each possible outcome by its probability. Results are what actually happens in any single instance.

Why Expected Value Matters

Good decision-making focuses on expected value, not results. If a decision has positive expected value, it is a good decision regardless of whether it works out in any single case. Over many repetitions, positive expected value decisions produce positive outcomes on average. Negative expected value decisions produce negative outcomes on average.

This is why casinos are profitable despite losing individual bets regularly. Each bet has negative expected value for the player and positive expected value for the house. In the short run, players win sometimes. In the long run, the house always comes out ahead because the math is on their side.

Short-Term Results Are Noisy

The problem is that short-term results are dominated by variance. In a small sample, luck overwhelms skill. A positive expected value process can easily produce negative results over 10, 20, or even 50 repetitions. Only over hundreds or thousands of instances does the underlying expected value reliably manifest in results.

This creates a painful reality for anyone evaluating predictions: you cannot reliably judge quality from short-term outcomes. A week of losses does not mean the model is broken. A week of wins does not mean the model is excellent. Both could be noise.

The smaller the sample, the more results are dominated by variance rather than skill. Judging prediction quality from a handful of outcomes is essentially reading tea leaves. Meaningful evaluation requires patience and proper methodology.

Sample Size and Statistical Significance

Sample size is the number of observations you have. Statistical significance is the confidence you can have that observed patterns reflect real differences rather than random variation.

How Many Predictions Are Enough

For baseball predictions, most statisticians would say you need at least several hundred games to have reasonable confidence in accuracy assessments. Even then, distinguishing a 55% accurate model from a 53% accurate model would require thousands of games. The margins are thin and the variance is high.

This means that seasonal evaluation of prediction models is inherently limited. A full MLB season has about 2,430 games. If a model covers all of them, that provides a reasonable sample. If it covers a subset, the confidence decreases proportionally.

The Perils of Premature Judgment

Most people evaluate prediction quality far too quickly. After a few days or weeks, they form strong opinions about whether a model works. These opinions are almost entirely based on noise. A model that went 15-5 in its first 20 predictions is not necessarily better than one that went 10-10. The difference could easily be luck.

The discipline required is to suspend judgment until sufficient data accumulates. This is psychologically difficult because we want to know now whether something works. But premature judgment leads to abandoning good processes during inevitable cold streaks and embracing bad processes during lucky hot streaks.

Process Versus Outcome Thinking

The alternative to outcome-based evaluation is process-based evaluation. Rather than asking whether predictions won or lost, ask whether the process that generated them was sound.

What Makes a Good Process

A good prediction process uses the right inputs. It focuses on factors that actually predict outcomes rather than those that merely describe the past. For baseball, this means emphasizing predictive metrics like FIP, wOBA, and strikeout-to-walk ratios rather than volatile results like ERA and batting average. For more on this, see What Metrics Matter Most in MLB Predictions.

A good process weighs factors appropriately. It gives more weight to starting pitching than to middle relief. It accounts for park factors and weather. It does not overweight recent results or underweight long-term skill indicators. For the full framework, see How MLB Games Are Predicted.

A good process is transparent about its methodology and limitations. It acknowledges uncertainty. It does not claim certainty where none exists. It produces calibrated probabilities that match actual outcomes over time.

Evaluating Process Quality

You can evaluate process quality without waiting for long-term results. Ask whether the methodology makes sense. Ask whether the inputs are appropriate. Ask whether the reasoning is logical. A flawed process can get lucky in the short term, but it will not hold up over time. A sound process may have cold streaks, but it will produce value eventually.

This does not mean ignoring results entirely. If a process produces obviously terrible results over a meaningful sample, something may be wrong. But the threshold for meaningful should be much higher than most people assume. Do not abandon ship after a few losses.

Why Transparency Matters

Transparency is essential for building trust in prediction systems. When the methodology is hidden, there is no way to evaluate process quality. You are reduced to judging by results, which is unreliable.

What Transparency Looks Like

A transparent prediction approach explains what inputs it uses. It describes how those inputs are weighted. It acknowledges limitations and uncertainties. It provides a framework for understanding why specific predictions were made, not just what they were.

This entire educational cluster is an exercise in transparency. By explaining how predictions are generated, what metrics matter, how pitchers and bullpens are evaluated, and why weather affects games, we provide the tools to evaluate the underlying logic rather than relying solely on outcomes.

The Alternative to Transparency

The alternative is black-box systems that promise results without explanation. These systems ask you to trust them based on short-term track records that could easily be noise. They provide no way to evaluate whether a cold streak represents bad luck or a fundamental flaw. They offer no insight into whether a hot streak is skill or fortune.

Transparency is harder because it invites scrutiny. But it is more honest and ultimately more valuable. A transparent process can be improved when flaws are identified. An opaque process just continues doing whatever it does, right or wrong.

The Mature Approach to Prediction

The mature approach to prediction accepts several uncomfortable truths.

First, losses are inevitable. Even excellent predictions lose regularly. This is not a bug. It is a feature of uncertainty.

Second, short-term results are unreliable indicators of quality. Patience is required. Meaningful evaluation takes time and data.

Third, process matters more than outcome. A sound methodology is more valuable than a lucky streak. Focus on getting the reasoning right, and let the results follow over time.

Fourth, transparency enables better evaluation. Systems that explain themselves invite productive criticism and improvement. Systems that hide behind results alone provide no path forward when things go wrong.

This is not a comfortable framework for those who want quick answers and guaranteed success. But it is an honest framework that reflects how prediction actually works. For more on the realistic limits of accuracy, see Can You Actually Predict Baseball Games Accurately.