Stopping Rules Clinical Trials
I happened to receive two questions about stopping rules on the same day. First, from Tom Cunningham: I’ve been arguing with my colleagues about whether the stopping rule is relevant (a presenter disclosed that he went out to collect more data because the first experiment didn’t get significant results) — and I believe you have some qualifications to the Bayesian irrelevance argument but I don’t properly understand them. Then, from Benjamin Kay: I have a question that may be of interest for your blog. I was reading about the early history of AIDS and learned that the the trial of AZT was ended early because it was: The trial reported in the New England Journal of medicine, had produced a dramatic result. Before the planned 24 week duration of the study, after a mean period of participation of about 120 days, nineteen participants receiving placebo had died while there was only a single death among those receiving AZT. This appeared to be a momentous breakthrough and accordingly there was no restraint at all in reporting the result; prominent researchers triumphantly proclaimed the drug to be “a ray of hope” and “a light at the end of the tunnel”.
This paper, the second in a series of three papers concerned with the statistical aspects of interim analyses in clinical trials, is concerned with.
Because of this dramatic effect, the placebo arm of the study was discontinued and all participants offered 1500mg of AZT daily. It is my understanding that this is reasonably common when they do drug studies on humans. If the treatment is much, much better than the control it is considered unethical to continue the planned study and they end it early. I certainly understand the sentiment behind that. However, I know that it isn’t kosher to keep adding time or sample to an experiment until you find a result, and isn’t this a bit like that?
Shouldn’t we expect regression to the mean and all that? When two people come to me with a question, I get the impression it’s worth answering. So here goes: First, we discuss stopping rules in section 6.3 (the example on pages 147-148), section 8.5, and exercise 8.15 of BDA3. The short answer is that the stopping rule enters Bayesian data analysis in two places: inference and model checking: 1. For inference, the key is that the stopping rule is only ignorable if time is included in the model.
To put it another way, treatment effects (or whatever it is that you’re measuring) can vary over time, and that possibility should be allowed for in your model, if you’re using a data-dependent stopping rule. To put it yet another way, if you use a data-dependent stopping rule and don’t allow for possible time trends in your outcome, then your analysis will not be robust to failures with that assumption. For model checking, the key is that if you’re comparing observed data to hypothetical replications under the model (for example, using a p-value), these hypothetical replications depend on the design of your data collection. If you use a data-dependent stopping rule, this should be included in your data model, otherwise your p-value isn’t what it claims to be. Next, my response to Benjamin Kay’s question about AZT: For the Bayesian analysis, it is actually kosher “to keep adding time or sample to an experiment until you find a result.” As noted above, you do lose some robustness but, hey, there are tradeoffs in life, and robustness isn’t the only important thing out there. Beyond that, I do think there should be ways to monitor treatments that have already been approved, so that if problems show up, somebody becomes aware of them as soon as possible. I know that some people are bothered by the idea that you can keep adding time or sample to an experiment until you find a result.
But, really, it doesn’t bother me one bit. Let me illustrate with a simple example. Suppose you’re studying some treatment that has a tiny effect, say 0.01 on some scale in which an effect of 1.0 would be large. And suppose there’s a lot of variability, so if you do a preregistered study you’re unlikely to get anything approaching certainty.
But if you do a very careful study (so as to minimize variation) or a very large study (to get that magic 1/sqrt(n)), you’ll get a small enough confidence interval to have high certainty about the sign of the effect. So, from going from high sigma and low n, to low sigma and high n, you’ve “adding time or sample to an experiment” and you “found a result.” See what I did there?
OK, this particular plan (measure carefully and get a huge sample size) is chosen ahead of time, it doesn’t involve waiting until the confidence interval excludes zero. The point is that by manipulating my experimental conditions I can change the probability of getting a conclusive result.
That doesn’t bother me. In any case, when it comes to decision making, I wouldn’t use “Does the 95% interval exclude zero?” as a decision rule. That’s not Bayesian at all. It seems to me that problems with data-based stopping and Bayesian analysis (other than the two issues I noted above) arise only because people are mixing Bayesian inference with non-Bayesian decision making. Which is fair enough—people apply these sorts of mixed methods all the time—but in that case I prefer to see the problem as arising from the non-Bayesian decision rule, not from the stopping rule or the Bayesian inference. By my understanding, the main problem with outcome-based stopping for inference isn’t so much about the difference between Bayesian or Frequentist positions.
Rather, the problem stems from the combination of two factors: first, by their most common, binary/qualitative interpretation, many Frequentist tests only allow one outcome – rejection of H0; second, alpha isn’t zero. Combine these two, and optional stopping all but guarantees results. However, a stopping rule not biased towards one outcome does not suffer from this problem; for example, width of a CI.
Basically, any method that could lead to a stop based on rejecting H1 or on rejecting H0. The Bayesian equivalent to stopping when p. Many thanks for the post. I don’t think I understand point 2: suppose we interpret p as “under H0 the probability of this event occurring within N observations is less than p”, then we wouldn’t we calculate the same p-value however N was chosen (whether predetermined or by a stopping rule). (& I came up with another qualification in a Bayesian world: we infer different things from “I expected an effect size of 4 and found an effect size of 4”, vs “I expected an effect size of 16 and found an effect size of 4”.
It is genuinely informative to know the experimenter’s expectations. And their choice of sample size tells us something about their priors. If they use a stopping rule then it can be potentially misleading in that direction, and we learn something when we find out the experimenter had to recruit 4 times as many subjects as he or she originally intended.). Suppose you’re at a basketball game. Think of the game as an experiment to determine which team is better at playing basketball. If the game is called unexpectedly because the electricity went out to the gym, it seems fair to take the score at the time of the power failure as good data.
But if the referee calls the game as soon as his favored team pulls into the lead, that hardly seems fair. How is the basketball story different from the scientist collecting data until he gets the result he wants? You say that you’re trying to determine which team is better, but then you say “once the favored team pulls into the lead”.
But if you only care about quality of the team, then you care about the difference in score given the time, & there’s not a discontinuity when someone is in the lead. And so if you’re deciding whether to let them play an extra minute, it’s just as likely that your favored team will get better (relative to their history) as it is they will get worse. (Or rather, in expectation there will be no systematic movement). Suppose Xi and Yi are sequences of iid Bernoulli random variables with probability of success pX and pY respectively, with pX pY. Then the sum of the X’s will be less than the sum of the Y’s infinitely often with probability 1. So with the unfair ref stopping rule, you can almost certainly conduct an experiment inferring wrongly that pY pX. Admittedly this is a frequentist argument, but I don’t see how a Bayesian perspective could salvage this, unless you account for the informative stopping rule in the likelihood.
John: There are some differences between the basketball story and the scientist story. In basketball you have a winner, in science you are doing inference.
For example, if this were a science example and you had 2 drugs and you kept sampling until drug A wins. This isn’t such a realistic rule, because if A is much worse, it’s quite possible you’ll never (in finite time) get to a point where A wins. Especially if you have a rule that N has to be greater than some minimum value such as 40. Also, if A wins and the difference is clearly noise (e.g., 8/20 successes for A and 7/20 successes for B), that won’t be taken as strong evidence. So to apply your story to science, you’d need to have a minimum sample size, a maximum sample size, and some rule that you only stop if A is statistically significantly better than B.
Even so, yes, you will sometimes see that happen, and a data-dependent stopping rule can increase the probability of stopping at that point—but, yes, this is a frequentist argument and indeed I don’t think it will hurt a full Bayesian analysis if there is no underlying time trend in the probabilities of success. As I said above, though, a data-dependent stopping rule could cause damage if someone is mixing Bayesian inference with non-Bayesian decision rules. And indeed people do this all the time, I’m sure (for example, performing a Bayesian analysis and then making a decision based on whether the 95% posterior interval excludes zero). So in that sense it could create a problem. To go back to the basketball example: in a Bayesian analysis, your posterior probability of which team is better is changing a bit with each score. But sports is about winning, not about inference: a team wins the game if they scored more points, not because there is an inference (Bayesian or otherwise) that they are the better team. Perhaps this last point will be clear if I return to the sample size analogy.
Suppose two players are competing: now consider an individual sport, in this case taking shots 30 feet from the basket, and the ref gives 1 shot to player A and 1000 shots to player B. The prize goes to the player with a greater success record.
It’s really hard to make the shot from 30 feet, so player A will almost certainly get 0 successes. But player B gets so many tries, he’ll probably have some success, maybe 10% or 5% or whatever.
The point is, Player B will almost certainly win. So you get unfairness, but with no data-dependent stopping rule. The problem is with the decision rule. Having a decision rule that satisfies certain fairness properties is a hard problem. It’s true that by restricting the stopping rules in certain problems, you can get the fairness properties that seem so intuitive, but you lose something too (in the medical example, you might give a less preferred treatment to someone).
Is it worth it, this tradeoff? It depends on how much you care about the fairness property.
It’s hard for me to see the justification of it, really; I think it’s an Arrow’s-theorem-like situation where there are certain properties that intuitively seem desirable but, on second thought, aren’t worth the effort. John: Just to add a bit more about this “fairness” thing (maybe we need to do a joint blog post on it): It seems reasonable for a basketball game to have a symmetry principle, that any stopping or scoring rule has to be symmetric relative to the team labeling, for example if you stop after team A is up by 20 points, then you have to stop after team B is up by 20 points. For a medical trial, though, I don’t see this, as I’d think it would be rare that an analysis is symmetric in any case. (For example, the existing treatment and the new treatment are typically not treated symmetrically.). Sorry, lost my connection or something. I’ll try again.
Discussion of stopping rules and inference usually begin with the frequentist position that the repeated sampling principle is more important than any other consideration, so I am very pleased to read this post. It is commonplace to view data-dependent stopping rules as problematical because of the increase in risk of type I errors, but at the same time as those false positive error increases the risk of false negative errors declines. I’ve played around with simulations and in almost all situations the false negative rate declines much faster with increasing sample size than the false positive rate increases.
Thus even within the inferentially depleted world of frequentists who use dichotomous outcomes there are inferential advantages to data-dependent stopping rules. Does anyone know why such stopping rules are nearly universally assumed to be deleterious?
I’m one of those people who fit bayesian models and then make a decision based on whether the 95% HPD contains zero, or whether theta0 or theta0)/(1-P(theta0)) ). I have never made a different decision based on whether I used the Bayes factor and using the HPD interval. In the kinds of studies I do, and for the amount of data I have per experiment, my decision never differs regardless of whether I use linear mixed models in R (lme4), or Stan or JAGS. Of course, the bayesian approach allows me to flexibly fit models that I simply cannot fit in the frequentist setting (or don’t yet know how to), so that is a huge advantage. But the decision is the same. I don’t yet understand why the inference is non-bayesian.
I just got done submitting a homework assignment for my statistics course where I did a bayesian analysis to come to a decision on whether to give treatment A or B based on willingness to pay on part of the government for a unit increase in net benefit. The decision I made was pretty much based on the probability of a net benefit (i.e., P(theta0)). I understand Andrew’s general objection that there may be no theta to estimate out there in nature. But what’s non-bayesian about such an inference? A frequentist analysis is not going to give me a posterior distribution to estimate such probabilities from; I can only do this because I fit a bayesian model.
And using an HPD yields pretty similar decisions. I wouldn’t even know what the alternative criterion for a decision would be in this very practical setting. Of course, I don’t yet know whether I did the homework right!. Hey, I didn’t say it would be easy, I just said that if you do Bayesian inference with non-Bayesian decision making, you can end up with challenges that would not arise in a pure Bayesian setting. I do think that formalizing costs and benefits can be a good idea—there’s no general way that I know to do this, I think that at our current stage of understanding it just needs to be done anew for each problem.
Paid Clinical Trials
One advantage of formalizing costs and benefits is that it can make you think harder about what you’re really concerned about in your estimation problem. That said, I don’t usually do this sort of formal decision analysis in my own work. So, what has changed since you posted (“for now let me just reiterate my current understanding that there is no such thing as a utility function”) and (“I’m down on the decision-theoretic concept of “utility” because it doesn’t really exist.”). Do you have a different view of utility functions in general now? Or is it that you don’t like working backward from, say, choice data to infer something about unobserved utility functions, whereas in full Bayesian decision making, you get to make your own utility function and then use it prescriptively? If it’s the latter, what’s the problem with inferring utility functions from data?
” It would be pretty foolish to just sit there with your prechosen N, if you think you can learn something useful by increasing your sample.” I assume you are talking about the case where one is doing a bayesian analysis. In a frequentist setting, that.would. Bomb factory pultec bundle torrent. be foolish, no? I just want to have it out there so that I don’t start getting people telling me Andrew Gelman says it’s OK to keep running an experiment till you hit significance;) What I started doing very recently in such a situation (where more data would help) is to re-run the experiment and use the previous data as a prior. I hope that’s not too crazy.
It’s rather more complicated than that, actually. What you have to do is define a test procedure T(, ), a function taking two arguments: a null hypothesis and a Type I error rate. A good test procedure needs be consistent with Egon Pearson’s ““: We then divide this set of possible results by a system of ordered boundariessuch that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined on the Information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts. In the case of optional stopping, the set of possible results is any observed difference & N combo that stops the experiment, so the “system of ordered boundaries” needs to be set up for all possible N. Once you’ve defined such a test procedure T, you can observe the experimental result and then back out the Type I error rate that puts the result on the boundary of the rejection region: that’s your p-value. Or you could always just chuck the observed difference, pretend you only observed N, and base your test and p-value on that. That’s pretty much what frequentist who wanted actual results had to do back in the Stone Age (that is, back when computations were chiselled on stone tablets, or written in notebooks in pen, or whatever it is people did back then).
The linked AZT story is interesting: Sounds like affirming the consequent fallacy rampant throughout medical research I’ve noticed (If the drug works people who get it will survive longer, people who got the drug survived longer therefore the drug works). I’ve seen comments by Kary Mullis’ (who invented pcr) regarding the early days of HIV testing saying the method at the time was not capable of detecting virus at the levels claimed. It’s not my area of expertise so I will stop there, but I would not doubt that fields could continue along the wrong path for decades under the current environment of mass confusion over how to interpret evidence along with publication bias. Also this paper contains a nice discussion of stopping rules: “It is an interesting sub-paradox that the seemingly hard-headed and objective p-value approach leads to something as subjective as the conclusion that the meaning of the data depends not only on the data but also on the number of times the investigator looked at them before he stopped, while the seemingly fuzzy subjective formulation leads to the hard headed conclusion “data are data”.” Cornfield, Jerome (1976).
“Recent Methodological Contributions to Clinical Trials”. American Journal of Epidemiology 104 (4): 408–421. “To put it yet another way, if you use a data-dependent stopping rule and don’t allow for possible time trends in your outcome, then your analysis will not be robust to failures with that assumption” The data and your statistical model can’t tell you whether there are time trends, or if there are why the are occurring. You need to go outside statistics, to thinking and hypothesizing. “But if you do a very careful study (so as to minimize variation) or a very large study (to get that magic 1/sqrt(n)), you’ll get a small enough confidence interval to have high certainty about the sign of the effect. So, from going from high sigma and low n, to low sigma and high n, you’ve “adding time or sample to an experiment” and you “found a result.”” Doing a “very careful study (so as to minimize variation)” again, involves thinking about the problem in a qualitative way and introducing controls based on theory. The “very careful” part is theory, not data, driven.
This is a VERY different and more effective way to reduce uncertainty than increasing sample size, which I don’t believe will generally reduce uncertainty to acceptable levels in dirty data. It is the epidemiological approach vs the experimental approach. Stopping rules tell you where to stop in the former case; that’s not good enough.
Well-conducted clinical investigation, in concert with basic laboratory research, is the cornerstone for progress in medicine. Without it, we are forced to depend solely on experience and bias in the choice of proper treatment, which can sometimes be misleading. Our well-intended predecessors from the time of Hippocrates would envy our analytic approach to clinical investigation, releasing them from the limitations of leeches, bad humors, and evil spirits in the management of patients with life-threatening disease. Given the importance of the clinical trial process, our greatest challenge as clinical investigators is to ensure that trials yield interpretable results while preserving the safety of study participants. For phase III randomized trials, these issues are typically under the purview of an independent Data and Safety Monitoring Board (DSMB), composed of statisticians and clinical investigators not directly involved with the study. The DSMB is responsible for reviewing the data, performing interim analyses when the study reaches its specified number of events, and deciding whether or not to close the study on the basis of predetermined early stopping rules that relate to toxicity or outcome.
If excess harm is observed, or if a statistically significant benefit is observed, the study is stopped early and the patient is informed of the results. If treatment is ongoing, the patient is typically offered the opportunity to receive the regimen that is perceived to be superior. Thus, the early stopping rule has the potential to minimize harm and to maximize benefit for those patients enrolled in a randomized trial. Despite the importance of early stopping rules, the results of two recent studies suggest that there are instances in which such rules might compromise proper interpretation of clinical trial results and thereby jeopardize the very patients that they are designed to protect. Southwestern Oncology Group (SWOG) study 9701 (Gynecologic Oncology Group 178) was a phase III randomized trial designed to determine whether a prolonged maintenance phase of single-agent paclitaxel could improve outcome in patients with advanced epithelial ovarian cancer in first clinical remission. In this intergroup study, patients were randomly assigned to receive an experimental arm of single-agent maintenance paclitaxel administered monthly for 12 months, or a control arm of the same drug and dose for 3 months.
The primary objective of SWOG 9701 was to determine whether 12 cycles of paclitaxel resulted in superior progression-free survival (PFS) and overall survival (OS). Anticipated total accrual was 450 patients over 5 years, with the first interim analysis to be performed by the DSMB after 50% accrual was reached.
The interim analysis was “conducted to guard against extreme findings, either excessive toxicity or a substantial improvement in efficacy”. At the time of the first interim analysis, a median PFS advantage of 7 months was observed in favor of the 12-month paclitaxel arm ( P =.0023, one-tailed test), resulting in early study termination. At the time of study closure, only 17 deaths had occurred, and there was no evidence of an OS benefit for the 12-month paclitaxel arm. Recommendation 1. Early stopping rules for oncology trials evaluating a chronic treatment intervention should be based on two major criteria: A) development of prohibitive toxicity, and B) improvement in either OS or QOL.
Improvement in PFS or DFS should not generally be used as an early stopping criterion, unless there are convincing data in the same setting to show that prolongation of PFS is a reliable surrogate for improved OS or QOL. As previously discussed, although an OS advantage is almost always associated with a PFS advantage, the reverse is not always the case. However, it may be reasonable to consider an improvement in PFS as an early stopping criterion for those instances in which standard treatment is so unsatisfactory that any effect could be interpreted as being clinically meaningful, and no effective salvage therapy at relapse is known to exist. Recommendation 3. The consent form should explicitly state that participants and clinical investigators alike will be informed of those results of interim analyses that have direct bearing on early stopping, especially as they relate to a change of therapy. Specifically, this means that the DSMB would be obliged to report the data if any of the two major criteria listed above were satisfied. If they are not satisfied, and thus there is no need for early stopping, it is acceptable for the DSMB to simply report that “none of the predetermined early stopping criteria have been met.” If these criteria are adopted, clinical investigators should be strongly discouraged from reporting the results of PFS (or DFS) in those instances where the study is ongoing and where the early stopping rules have not been met.
This particularly applies to abstract presentations at meetings. This approach will minimize the likelihood that study participants and their physicians will be tempted to switch therapies based upon potentially dangerous and misleading assumptions. Recommendation 4.
The DSMB should invite participation from members of the lay public, such as patient representatives from advocacy groups. We should not underestimate the ability of well-informed patient advocates to understand these concepts and to be our allies in clinical research. The DSMB has the important task of implementing predefined early stopping rules, but it is not the purview of the DSMB to ensure that these early stopping rules are well-suited to the goals of the study. That is the responsibility of clinical investigators and statisticians involved in trial design—to protect patients against undue toxicity, to offer patients superior treatment once benefit is proven, and to ensure that the study will yield interpretable data for future generations of patients. Early stopping rules that do not capture each of these important elements may serve to undermine the clinical trial effort. Markman M, Liu PY, Wilczynski S, et al: Phase III randomized trial of 12 versus 3 months of maintenance paclitaxel in patients with advanced ovarian cancer after complete response to platinum and paclitaxel-based chemotherapy: A Southwest Oncology Group and Gynecologic Oncology Group trial. J Clin Oncol 21:: 2460, 2003-2465, 2.
Ozols RF: Maintenance therapy in advanced ovarian cancer: Progression-free survival and clinical benefit. J Clin Oncol 21:: 2451, 2003-2453, 3. Thigpen T: Maybe more is better. J Clin Oncol 21:: 2454, 2003-2456, 4. Goss PE, Ingle JN, Martino S, et al: A randomized trial of letrozole in postmenopausal women after five years of tamoxifen therapy for early-stage breast cancer. N Engl J Med 349:: 1793, 2003-1802, 5. Muss HB, Case LD, Richards F 2nd, et al: Interrupted versus continuous chemotherapy in patients with metastatic breast cancer.
The Piedmont Oncology Association. N Engl J Med 325:: 1342, 1991-1348, 6. Sledge GW, Neuberg D, Bernardo P, et al: Phase III trial of doxorubicin, paclitaxel, and the combination of doxorubicin and paclitaxel as front-line chemotherapy for metastatic breast cancer: An intergroup trial (E1193). J Clin Oncol 21:: 588, 2003-592, 7.
Muggia FM, Braly PS, Brady MF, et al: Phase III randomized study of cisplatin versus paclitaxel versus cisplatin and paclitaxel in patients with suboptimal stage III or IV ovarian cancer: A gynecologic oncology group study. J Clin Oncol 18:: 106, 2000-115, 8. Omura GA, Brady MF, Look KY, et al: Phase III trial of paclitaxel at two dose levels, the higher dose accompanied by filgrastim at two dose levels in platinum-pretreated epithelial ovarian cancer: An intergroup study. J Clin Oncol 21:: 2843, 2003-2848.