Weibull MLE vs Median-Rank Regression: Which to Use, and When Each Fails

For small, complete datasets, median-rank regression is often the safer estimator — maximum likelihood overestimates the Weibull shape parameter when samples are small. But once your data are meaningfully censored, with units still running when you pull the records, maximum likelihood is the better choice, and its advantage grows as censoring rises. The decision is driven by censoring, not by which method sounds more advanced. And before either: understand the failure mechanism you're modelling, because neither estimator can rescue a Weibull fit applied to data that aren't Weibull.

That last point is where most analyses go wrong, so it's worth stating plainly up front.

Weibull is the default for the wrong reason

In practice, teams reach for Weibull because it's easy to justify. It maps cleanly onto reliability standards, an auditor expects to see it, and a two-parameter fit is defensible in a report. Those are good reasons to present a Weibull fit. They are not good reasons to believe one. The better discipline is to let the real failure data tell you whether a two-parameter Weibull actually describes the population before you lean on its shape and scale. Plot the empirical distribution first; if it's bimodal or stepped, no amount of estimator sophistication will fix the fact that you're fitting one curve to two mechanisms.

With that caveat in place, assume you've confirmed Weibull is a reasonable model. The question becomes how to estimate its parameters, and there are two standard routes.

The two methods

Median-rank regression (MRR). Rank the failure times, assign each a median rank using Bernard's approximation, adjust those ranks for any suspended units using Johnson's method, and fit a straight line through the points on a Weibull probability plot. It's intuitive, it's visual, and it's the basis of the classic probability-paper approach generations of engineers learned. Its weakness is structural: it weights points by plotting position rather than by information content, and suspensions enter only through a rank adjustment rather than on their own terms.

Maximum likelihood estimation (MLE). Choose the shape and scale that maximise the probability of exactly the data you observed, with every suspended unit entering the likelihood directly through the survival term, exp(−(t/η)β). Because each running unit contributes its actual information rather than a rank correction, MLE dominates when censoring is heavy. Its cost is a known upward bias in the shape parameter at small sample sizes.

How censoring decides it

Here is the comparison that actually matters. Using a simulated fleet with a known true Weibull (β = 2.3, η = 540 days, 20 units), fit by both methods across 4,000 repetitions at each censoring level, with estimator quality measured as root-mean-square error against the known truth — lower is better.

RMSE against known truth — 20-unit fleet, 4,000 simulation runs per censoring level
Censoring	β RMSE (MLE)	β RMSE (MRR)	η RMSE (MLE)	η RMSE (MRR)
~10%	0.54	0.64	57 d	61 d
~40%	0.76	0.96	83 d	121 d
~70%	1.57	15.7	416 d	1351 d

At light censoring the two methods are practically interchangeable — pick whichever your toolchain defaults to and it won't change a decision. As censoring climbs, MLE pulls ahead on both parameters. Under heavy censoring, median-rank regression becomes unstable: with failures sparse relative to suspensions, the occasional rank-regression fit goes badly wrong, and its error balloons by an order of magnitude.

This matters because of where field reliability work actually lives. The censoring fraction is typically large — pull ESP, pump, or rod run-life at any data freeze and most of the fleet is still turning. You are almost always in the bottom row of that table, the regime where MLE earns its keep and a probability-plot fit will quietly mislead you.

The honest counterpoint: small, complete samples

Blanket "always use MLE" advice is wrong, and here's the case it gets wrong. With no censoring and few units, MLE overestimates the shape parameter — which inflates apparent wear-out and can push you toward replacing parts earlier than the data justify. Mean estimated β across 8,000 repetitions, no censoring, true β = 2.3:

Small-sample bias — no censoring, 8,000 simulation runs per n
n	mean β (MLE)	mean β (MRR)
8	2.80	2.45
15	2.54	2.35
30	2.43	2.32

At eight units MLE overstates β by more than 20%. Median-rank regression is markedly less biased here, which is exactly the position the New Weibull Handbook takes in favouring rank regression for small samples. For a small, clean test set where every unit failed, MRR is the defensible choice. The bias in MLE is also partly correctable with a reduced-bias adjustment factor, so the fix is a refinement rather than a reason to switch methods — but if you're not applying one, know that your small-sample β is running hot.

Worked example you can reproduce

A 20-unit pump fleet at a data freeze: 6 failed, 14 still running — 70% censored.

Failures (days): 168, 252, 295, 340, 433, 511
Suspensions (days): 120, 150, 205, 240, 300, 318, 360, 392, 410, 455, 470, 500, 525, 560

Fitted both ways:

MLE: β = 2.88, η = 593 d, B10 = 271 d
MRR: β = 2.55, η = 594 d, B10 = 246 d

Both recover the characteristic life η well, near 593 days. The disagreement is in the shape parameter, and therefore in B10 — the age by which 10% are expected to have failed — which lands at 271 days by MLE versus 246 by MRR. That's roughly a 10% difference in the number you'd actually build a replacement schedule around.

One honest note: in this single dataset both β estimates sit above the true 2.3, because heavy censoring biases the shape upward regardless of method, and here MLE happens to land further out than MRR. A single sample can fall either way; the method-level verdict comes from the repetitions in the table above, not from one fit. That's the difference between anecdote and evidence, and it's worth holding onto.

Paste those two columns into Advanced Failure Intelligence and you'll get the same fit, the Kaplan-Meier survival curve, and the forward failure forecast in your browser.

The practitioner's decision rule

If a junior engineer asks me where to start, the answer isn't a method — it's understand the physics first. Know the failure mechanism before you trust a shape parameter. A β above 2 is only "wear-out" if there's a wear-out mechanism to point at; otherwise it's a number that happened to fit. Once you've earned the right to model, the rule is short:

Censored field data, units still running: MLE, ideally with a reduced-bias adjustment.
Small, complete test set, all failed, n under ~20: median-rank regression — or report both and show the spread.
Heavy censoring, few failures among many suspensions: MLE; treat a standalone MRR fit with suspicion.
Large samples: stop worrying — the two converge.

And recognise where two parameters stop being enough. Life rarely depends on age alone. Once you suspect operating conditions are driving failure — and in field data you usually should — single-variable Weibull gives way to covariate methods like proportional hazards and multivariate analysis, which is almost always worth the effort it costs.

FAQ

Is MLE or median-rank regression better for Weibull analysis?: It depends on censoring. For small, complete samples, median-rank regression is less biased. For censored field data, MLE is more accurate and the gap widens as censoring increases.
Why does MLE overestimate the Weibull shape parameter?: It's a small-sample bias inherent to the estimator; the fitted β runs high when there are few failures. A reduced-bias adjustment factor corrects most of it.
How does median-rank regression handle suspended units?: Through Johnson's adjusted-rank method, which shifts the ranks of the observed failures to account for suspensions. It's an approximation — suspensions don't contribute their own plotting positions — which is why the method degrades under heavy censoring.
What is rank regression on X (RRX)?: Regressing time (ln t) on the transformed median rank, rather than the reverse (RRY). It's the convention in most reliability software because failure time carries the measurement error the regression should absorb.

Run your own field data through the Advanced Failure Intelligence tool — paste raw run-life and suspension records, fit Weibull by MLE, and forecast fleet failures with Monte Carlo, in the browser.