Probability of success of a clinical trial: general methodology

Author

Affiliation

Kaspar Rufibach

Methodology, Collaboration, and Outreach Group, PD DSS, Roche Basel

Published

Invalid Date

Setup

# Load rpact
library(rpact)

What is probability of success?

At Roche, probability of technical success (PTS), sometimes also probabibility of success (PoS), is defined as the probability of progressing from one decision point to the next. For details we refer to the pharma PTS handbook, Section 3.1.

PTS, DDCP, assurance, and Bayesian predictive power

Depending on the scenario, PTS can be based on various approaches - one of which we call internally at Roche Data-driven conditional probability (DDCP). This is an attempt to quantify the probability to beat a predefined success criterion. Typically, this is either the minimal or target product profile (so typically refers to the primary efficacy endpoint) in Phase 3 given the entire prior knowledge we have about the effect size prior to Phase 3. This prior knowledge can come from

early phase trials of the same molecule,
evidence about efficacy from related molecules (internal or external).

Technically, DDCP is nothing else than what has been introduced in O’Hagan et al. (2001) and O’Hagan et al. (2005) as assurance. Note that Bayesian predictive power (BPP) as introduced in Spiegelhalter et al. (1986) is not completely identical to assurance but typically a very good approximation, see the discussion below.

Throughout successR we will use the term DDCP to mean “assurance”. How that relates to PTS is described above. The purpose of successR is to provide the tools to compute DDCP in various scenarios, i.e. for different priors, endpoints, and trial types.

Note that while the bpp package abbreviates Bayesian predictive power it actually computes assurance.

Target product profile (TPP), minimal detectable difference (MDD), and trial success

For an introduction see pharma PTS handbook, Section 2.2.

According to the TPP guidance (version of September 2021) the TPP is the key strategic document that defines core elements of a therapeutic product in an intended indication, population, geography at launch. When planning a clinical trial we should then make sure the sample size and MDDs are aligned with the TPP such that a statistical significant result at the final analysis is also clinically relevant. This can be achieved by making sure that the MDD of our pivotal trial matches the minimal TPP.

A typical scenario looks as follows: assume our minimal TPP specifies a hazard ratio of 0.82 and a target hazard ratio of 0.73. The question then is: How should a pivotal trial look like to make sure we match this TPP?

Assume we plan a 1:1 randomized trial for a time-to-event endpoint, so the effect measure we are interested in is the hazard ratio. Additionally, we plan for an interim analysis for efficacy after 2 / 3 of information using an O’Brien-Fleming boundary. The parameters of this pivotal trial are:

# proportion of patients randomized to arm A and corresponding variance factor
pA <- 0.5   

# significance levels at final analysis
alpha <- 0.05

# power
beta <- 0.2

# effect to detect with 80% power
hr <- 0.75

# interim timing
info <- c(2 / 3, 1)

With these specifications we can now compute the basic quantities of the design using rpact:

design <- getDesignGroupSequential(sided = 2, alpha = alpha, beta = beta,
                                   informationRates = info, typeOfDesign = "asOF")
sampleSize1 <- getSampleSizeSurvival(design, allocationRatioPlanned = pA / (1 - pA),
                                     hazardRatio = hr, maxNumberOfSubjects = 1200)

From this object we can extract the minimal detectable differences at the interim and final analysis:

t(sampleSize1$criticalValuesEffectScaleLower)

         [,1]      [,2]
[1,] 0.730814 0.8159891

t(sampleSize1$criticalValuesPValueScale)

           [,1]       [,2]
[1,] 0.01209678 0.04627413

So at the interim, to reject

\[ H_0 : \text{hazard ratio} = 1 \]

and declare efficacy, the observed \(p\)-value needs to be \(\le 0.012\) or, on the scale of effect size, the hazard ratio needs to be \(\le 0.73\).

At the final analysis, the bounds are \(0.046\) and \(0.82\), respectively.

We call the two hazard ratio boundaries the minimal detectable difference (MDD). With an interim after about 2/3 of information and O’Brien-Fleming \(\alpha\)-spending it is often the case that the MDD at the interim (0.73) is \(\sim\) the effect we power at (0.75 in our case). This may be one of the reasons why it feels that we stop many trials at interim analyses. These considerations generally apply to trials with an interim for efficacy after \(\approx 2/3\) of information. If there are more interims for efficacy and/or at different information fractions considerations might be different.

With the above setup, specifically with the choice of number of events (or rather the hazard ratio we power at that implies the number of events) and the interim analysis timepoint, we have indeed aligned our TPP thresholds with the MDDs:

Target TPP and MDD at interim analysis: \(0.73\),
Minimal TPP and MDD at final analysis: \(0.82\).

So our above design is matchted to the TPP.

Assurance and bayesian predictive power for a normally distributed endpoint

What you need to know

Let us assume our pivotal trial delivers at the final analysis a normally distributed estimate of the treatment effect \(\widehat \delta_\text{fin}\), with a mean \(\delta\), representing the true underlying effect, and a variance \(\sigma^2_\text{fin}\) which is determined by the true underlying variability in the patient population and the trial design (randomization ratio, sample size etc.):

\[ \widehat \delta_\text{fin} \ \sim \ N(\delta, \sigma^2_\text{fin}). \] Our trial is called a success if \(\widehat \delta_\text{fin} \le \delta_\text{suc}\), with \(\delta_\text{suc}\) a fixed pre-specified number, e.g. the minimal TPP threshold at the final analysis. We use “\(\le\)” for the scenarios that lower values correspond to better effect, e.g. for the (log)hazard ratio. In an ideal world where we knew the true effect size \(\delta\) we could compute the probability of success simply as the power to reject the null hypothesis \(H_0 : \delta = \delta_0\) at the threshold \(\delta_\text{suc}\):

\[\begin{align*} \pi(\delta, \delta_\text{suc}) \ &= \ P(\widehat \delta_\text{fin} \le \delta_\text{suc}) \\ \ &= \ \Phi \Bigl( \frac{\delta_\text{suc} - \delta}{\sigma_\text{fin}}\Bigr). \end{align*}\]

Now, of course the true underlying effect \(\delta\) is unknown. When computing power we deal with this issue by simply assuming a value for \(\delta\), i.e. we consider a conditional quantity. However, that has the drawback that you bet on just one value. Alternatively, one could assume a distribution over the true effect size \(\delta\), with \(\Delta\) now being a random variable with density \(q(\delta)\). Then, the power function \(\pi(\Delta, \delta_\text{suc})\) is by itself a random variable of which we can compute the expected value:

\[ \text{assurance}(\delta_\text{suc}) \ = \ E(\pi(\Delta, \delta_\text{suc})) \ = \ \int_{-\infty}^\infty \Phi \Bigl( \frac{\delta_\text{suc} - \delta}{\sigma_\text{fin}}\Bigr) q(\delta) d \delta . \] This quantity is what we call in Roche Data-driven Conditional Probability, DDCP.

Explicit formula if prior \(q\) is normal

For a normal prior \(q\) one can evaulate the above integral explicitly to give

\[ \text{assurance}(\delta_\text{suc}) \ = \ \int_{-\infty}^\infty \Phi \Bigl( \frac{\delta_\text{suc} - \delta}{\sigma_\text{fin}}\Bigr) N(\delta_0, \sigma^2_0) d \delta \ = \ \Phi\Bigl(\frac{\delta_\text{suc} - \delta_0}{\sqrt{\sigma_\text{fin}^2 + \sigma_0^2}}\Bigr). \]

Some further background for those interested

When choosing \(\delta_\text{suc}\) as the critical value \(z_{1-\alpha/2}\) of the underlying hypothesis test and thereby computing the probability that the trial is significant at the final analysis, O’Hagan et al. (2001) and O’Hagan et al.(2005) call this quantity assurance while Kunzmann et al. (2021) propose the term marginal probability of rejecting the null hypothesis. It is interesting to note that this is NOT the “probability of success” initially defined in Spiegelhalter et al. (1986). The difference is that the latter authors do not only consider the probability to be significant but additionally condition on the effect being relevant, i.e. being below a clinically important difference \(\delta_{\text{MCID}}\), and this is what the literature calls “Bayesian predictive power”, BPP. Kunzmann et al. (2021) show that in fact, assuming of course that \(\delta_{\text{MCID}} \le \delta_\text{suc}\) for \(\delta_{\text{MCID}}\) the minimally clinically relevant effect size,

\[\begin{align*} \text{assurance}(\delta_\text{suc}) \ &= \int_{-\infty}^{\delta_{\text{MCID}}} \Phi \Bigl( \frac{\delta_\text{suc} - \delta}{\sigma_\text{fin}}\Bigr) q(\delta) d \delta + \int_{\delta_{\text{MCID}}}^{\delta_\text{suc}} \Phi \Bigl( \frac{\delta_\text{suc} - \delta}{\sigma_\text{fin}}\Bigr) q(\delta) d \delta + \int_{\delta_\text{suc}}^\infty \Phi \Bigl( \frac{\delta_\text{suc} - \delta}{\sigma_\text{fin}}\Bigr) q(\delta) d \delta \ \\ &= \text{BPP}(\delta_{\text{MCID}}) + P(\text{reject an irrelevant effect}) + P(\text{type I error}). \end{align*}\]

Let us discuss each summand in turn:

The first summand is the probability to be statistically significant and clinically relevant (i.e. the effect is \(\le \delta_{\text{MCID}}\)).
As for the second summand this is the probability to observe a statistically significant effect which is not clinically relevant: \[ \int_{\delta_{\text{MCID}}}^{\delta_\text{suc}} \Phi \Bigl( \frac{\delta_\text{suc} - \delta}{\sigma_\text{fin}}\Bigr) q(\delta) d \delta = P(\text{reject an irrelevant effect}). \] In a well-designed trial, \(\delta_\text{suc}\) is chosen to be equal to \(\delta_{\text{MCID}}\), so that the above probability would be 0. The probability can become “large” during the trial in case the treatment landscape and therefore \(\delta_{\text{MCID}}\) changes.
Finally, the third summand is the probability to observe an effect that is statistically significant, so quantifies the probability of a Type I error (in a somewhat loose sense).

So, by integrating over the entire parameter range for \(\delta\) in the definition of \(\text{assurance}\) we are also counting type I errors as “success”. While conceptually questionable this may make sense in our drug development context, where we typically file every significant trial. Anyway, as David J. Spiegelhalter, Abrams, and Myles (2004) and also Kunzmann et al. (2021) conclude \(\text{assurance}\) can be used as a good approximation to BPP in most practically relevant cases, because in a well-designed trial the latter two summands in the above equation are typically very small.

For the connection of DDCP to other quantities please see this tutorial.

Is DDCP Bayesian?

\(\text{DDCP}(\delta_\text{suc})\) is simply a weighted average of the power function with respect to a weight (or “belief”) function on the underlying true effect size. Bayes theorem is not involved at all, and I would not call \(\text{DDCP}(\delta_\text{suc})\) anything Bayes (as opposed to what Slide 55 in the pharma PTS handbook says). If you want you can call the weight function a “prior”.

Is DDCP a conditional probability?

Unlike power, which conditions on an assumed effect, neither DDCP nor BPP are a conditional probability. How to interpret DDCP is (also verbally) described in Figure 2 of Kunzmann et al. (2021).

Compute DDCP for a given trial

Examples of DDCP computations will be added over time. So far the following tutorials are available:

Endpoint type	Design	Update after interim
Continuous	tutorial	tutorial
Longitudinal	not yet	not yet
Binary	tutorial	tutorial
Time-to-event	tutorial	tutorial

Update DDCP during a trial

Once a team has computed DDCP and a trial has started to recruit patients, two potential sources of evidence may require an update of DDCP.

Evidence external to the trial

External information on the treatment effect can e.g. be

a trial in the same program reads out,
a trial in a program on a related molecule reads out (in- or externally to Roche).

This may potentially update our prior belief on the treatment effect, and as a consequence the team should consider updating the prior distribution \(q(\delta)\) that went into the computation of the DDCP at the design stage. The options to do this are

meta-analyze the various sources of evidence,
update the initial prior with the new data that you observe to get a posterior and then use that posterior in the formula for DDCP.

Trial does not stop at an interim analysis for futility and/or efficacy

Assume an interim analysis for futility and/or efficacy for the trial of interest, that has a time-to-event endpoint, itself is performed with boundaries for the hazard ratio of

futility: \(\theta_{fut}\) and
efficacy: \(\theta_{eff}\).

These boundaries can also be

\(\theta_{fut} = \infty\): we would only stop for futility if the hazard ratio \(\ge \infty\), so no stopping for futility possible or
\(\theta_{eff} = 0\): we would only stop for efficacy if the hazard ratio \(\le 0\), so no stopping for efficacy possible.

Now, if we do not stop at such an interim analysis in a blinded trial, we can (up to the freedom of the iDMC to deviate from instructions in the iDMC charter, which we ignore here) assume that the effect estimate observed at the interim must have been in the interval \([\theta_{eff}, \theta_{fut}]\). This knowledge can be used to update DDCP, by changing the power part in its formula to conditional power.

If we have a trial that is unblinded and we even learn the precise estimate \(\hat \delta_{int}\) after the interim analysis we can also update DDCP with that knowledge.

Examples for all these computations are provided in the tutorials on “Update after interim”, as linked in the table above. The methodology has been developed in Rufibach et al. (2016), see also the slidedeck that Kaspar presented at the Bayes Pharma conference 2019 in Lyon for an easy accessible introduction.

References

Kunzmann, Kevin, Michael J. Grayling, Kim May Lee, David S. Robertson, Kaspar Rufibach, and James M. S. Wason. 2021. “A Review of Bayesian Perspectives on Sample Size Derivation for Confirmatory Trials.” The American Statistician 75 (4): 424–32. https://doi.org/10.1080/00031305.2021.1901782.

O’Hagan, Anthony, John W. Stevens, and Michael J. Campbell. 2005. “Assurance in Clinical Trial Design.” Pharm. Stat. 4 (3): 187–201.

O’Hagan, A., J. W. Stevens, and J. Montmartin. 2001. “Bayesian cost-effectiveness analysis from clinical trial data.” Stat Med 20 (5): 733–53.

Rufibach, K., P. Jordan, and M. Abt. 2016. “Sequentially updating the likelihood of success of a Phase 3 pivotal time-to-event trial based on interim analyses or external information.” J Biopharm Stat 26 (2): 191–201. http://dx.doi.org/10.1080/10543406.2014.972508.

Spiegelhalter, D. J., L. S. Reedman, and P. R. Blackburn. 1986. “Monitoring clinical trials - conditional power or predictive power.” Control Clin Trials 7 (1): 8–17.

Spiegelhalter, David J, Keith R Abrams, and Jonathan P Myles. 2004. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Vol. 13. John Wiley & Sons, Hoboken, New Jersey.