Marginalizing versus Profiling of Nuisance Parameters Notes by Robert D. Cousins Department of Physics and Astronomy University of California, Los Angeles 90095 Los Angeles, CA and Larry Wasserman Department of Statistics and Data Science Carnegie Mellon (2024)

\sidecaptionvpos

figurec

Abstract

This is a writeup, with some elaboration, of the talks by the two authors (a physicist and a statistician) at the first PHYSTAT Informal review on January 24, 2024. We discuss Bayesian and frequentist approaches to dealing with nuisance parameters, in particular, integrated versus profiled likelihood methods. In regular models, with finitely many parameters and large sample sizes, the two approaches are asymptotically equivalent. But, outside this setting, the two methods can lead to different tests and confidence intervals. Assessing which approach is better generally requires comparing the power of the tests or the length of the confidence intervals. This analysis has to be conducted on a case-by-case basis. In the extreme case where the number of nuisance parameters is very large, possibly infinite, neither approach may be useful. Part I provides an informal history of usage in high energy particle physics, including a simple illustrative example. Part II includes an overview of some more recently developed methods in the statistics literature, including methods applicable when the use of the likelihood function is problematic.

Foreword

This document collects reviews written by Bob Cousins, Distinguished Professor Emeritus in the Department of Physics and Astronomy at UCLA, and Larry Wasserman, Professor of Statistics and Data Science at Carnegie Mellon University, on inferential methods for handling nuisance parameters.

These notes expand upon the main points of discussion that arose during the first of a new event series called PHYSTAT–Informal Reviews, which aims to promote dialogue between physicists and statisticians on specific topics. The inaugural event held virtually on January 24, 2024, focused on the topic of "Hybrid Frequentist and Bayesian approaches". Bob Cousins and Larry Wasserman presented excellent talks, and the meeting was further enriched by a valuable discussion involving the statisticians and physicists in the audience.

In Part I, Bob Cousins provides a comprehensive overview of how high-energy physicists have historically tackled the problem of inferring a parameter of interest in the presence of nuisance parameters. The note offers insights into why methods like profile likelihood are somewhat dominant in practice while others have not been widely adopted. Various statistical solutions are also compared through a classical example with relevance in both physics and astronomy. This memorandum is a wealth of crucial references concerning statistical practice in the physical sciences.

In Part II, Larry Wasserman examines the problem of conducting statistical inference with nuisance parameters from the perspectives of both classical and modern statistics. He notes that standard inferential solutions require a likelihood function, regularity conditions, and large samples. Nonetheless, recent statistical developments may be used to ensure correct coverage while relaxing these requirements. This review offers an accessible self-contained overview of some of these approaches and guides the reader through their strengths and limitations.

Together, the written accounts below provide a cohesive review of the tools currently adopted in the practice of statistical inference in searches for new physics, as well as their historical and technical justification. They also offer a point of reflection on the role that more sophisticated statistical methods and recent discoveries could play in assisting the physics community in tackling both classical and new statistical challenges.

The organizing committee of PHYSTAT activities is grateful to both authors for their engagement and valuable contributions, which we believe will benefit both the statistics and physics communities.

Sara Algeri,

School of Statistics, University of Minnesota, Minneapolis, MN

On behalf of the PHYSTAT Organizing Committee

Part I: Physicist’s View
by Robert D. Cousins

1 Introduction

This is a writeup (with some changes) of the 20-minute talk that I gave at a virtual PhyStat meeting that featured “mini-reviews” of the topic by a physicist (cousinsphystat2023) and a statistician(wassermanphystat2023) Mine was largely based on my talk the previous year at a BIRS workshop at Banff(cousinsbanff2023). This is a short personal perspective through the lens of sometimes-hazy recollections of developments during the last 40 years in high energy physics, and is not meant to be a definitive history. I also refer at times to my lecture notes from the 2018 Hadron Collider Physics Summer School at Fermilab(cousinshcpss2018), which provide a more detailed introduction to some of the fundamentals, such as the duality of frequentist hypothesis testing and confidence intervals (“inverting a test" to obtain intervals).

Parametric statistical inference requires a statistical model, p(y;θ)𝑝𝑦𝜃p(y;\theta)italic_p ( italic_y ; italic_θ ), which gives the probability (density) for observing data y𝑦yitalic_y, given parameter(s) θ𝜃\thetaitalic_θ. In a given context, θ𝜃\thetaitalic_θ is partitioned as (μ,β)𝜇𝛽(\mu,\beta)( italic_μ , italic_β ), where μ𝜇\muitalic_μ is the parameter of interest (taken to be a scalar in this note, e.g., the magnitude of a putative signal), and β𝛽\betaitalic_β is typically a vector of “nuisance parameters” with values constrained by subsidiary measurements, usually referred to as “systematic uncertainties” in particle physics (e.g., detector calibration “constants”, magnitudes of background processes, etc.).

This note discusses the common case in which one desires an algorithm for obtaining “frequentist confidence intervals (CI)” for μ𝜇\muitalic_μ that are valid at a specified confidence level (CL), no matter what are the unknown values of β𝛽\betaitalic_β. Ideally, CIs have exact “frequentist coverage” of the unknown true value of μ𝜇\muitalic_μ, namely that in imagined repeated applications of the algorithm for different data y𝑦yitalic_y sampled from the model, in the long run a fraction CL of the confidence intervals will contain (cover) the unknown true μ𝜇\muitalic_μ. In practice, this is often only approximately true, and some over-coverage (fraction higher than CL) is typically tolerated (with attendant loss of power), while material under-coverage is considered disqualifying.

The construction of confidence intervals requires that for every possible true μ𝜇\muitalic_μ, one chooses an ordering of the sample space y𝑦yitalic_y. This is done by specifying an ordered “test statistic”, a function of the observed data and parameters. In the absence of nuisance parameters, standard choices have existed since the 1930s. Thus, a key question to be discussed is how to treat (“eliminate") the nuisance parameters in the test statistic, so that the problem is reduced to the case without nuisance parameters.

A second question to be discussed is how to obtain the sampling distribution of the test statistic under the statistical model(s) being considered. There is a vast literature on approximate analytic methods that become more accurate as the sample size increases. For small sample sizes in particle physics, Monte Carlo simulation is typically used to obtain these distributions. The crucial question is then, which values(s) of the nuisance parameters should be used in the simulations, so that the inferred confidence intervals for the parameter of interest have valid coverage, no matter what are the true values of the nuisance parameters.

Section2 defines the profile and marginalized likelihoods.Section3 introduces the two corresponding ways to treat nuisance parameters in simulation, the parametric bootstrap and marginalization (sampling from the posterior distribution of the nuisance parameters).Section4 is a longer section going into detail regarding five ways to calculate the significance Z𝑍Zitalic_Z in the on/off problem (ratio of Poisson means).Section5 briefly highlights the talk by Anthony Davison on this subject at the recent 2023 BIRS workshop at Banff.Section6 contains some final remarks.

2 Choice of test statistic

The observed data y𝑦yitalic_y are plugged into the statistical model p(y;θ)𝑝𝑦𝜃p(y;\theta)italic_p ( italic_y ; italic_θ ) to obtain the likelihood function (μ,β)𝜇𝛽{\cal L}(\mu,\beta)caligraphic_L ( italic_μ , italic_β ). The global maximum is at(μ^,β^)^𝜇^𝛽{\cal L}(\widehat{\mu},\widehat{\beta})caligraphic_L ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_β end_ARG ). The “profile likelihood function" (of only μ𝜇\muitalic_μ) is

(μ,β^^),𝜇^^𝛽{\cal L}(\mu,\widehat{\widehat{\beta}}),caligraphic_L ( italic_μ , over^ start_ARG over^ start_ARG italic_β end_ARG end_ARG ) ,(1)

where, for a given μ𝜇\muitalic_μ, the restricted estimate β^^^^𝛽\widehat{\widehat{\beta}}over^ start_ARG over^ start_ARG italic_β end_ARG end_ARG maximizes {\cal L}caligraphic_L for that μ𝜇\muitalic_μ. It is also written as

supβ(μ,β).subscriptsupremum𝛽𝜇𝛽\sup_{\beta}{\cal L}(\mu,\beta).roman_sup start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT caligraphic_L ( italic_μ , italic_β ) .(2)

(The restricted estimate is denoted by β^μsubscript^𝛽𝜇\widehat{\beta}_{\mu}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT inreidslac2003.)It is common in HEP to use as a test statistic the “profile likelihood ratio” (PLR) with the likelihood at the global maximum in the denominator,

(μ,β^^)(μ^,β^).𝜇^^𝛽^𝜇^𝛽\frac{{\cal L}(\mu,\widehat{\widehat{\beta}})}{{\cal L}(\widehat{\mu},\widehat%{\beta})}.divide start_ARG caligraphic_L ( italic_μ , over^ start_ARG over^ start_ARG italic_β end_ARG end_ARG ) end_ARG start_ARG caligraphic_L ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_β end_ARG ) end_ARG .(3)

The “integrated” (or “marginalized”) likelihood function (also of only μ𝜇\muitalic_μ) is

(μ,β)π(β)𝑑β,𝜇𝛽𝜋𝛽differential-d𝛽\int{\cal L}(\mu,\beta)\pi(\beta)d\beta,∫ caligraphic_L ( italic_μ , italic_β ) italic_π ( italic_β ) italic_d italic_β ,(4)

where π(β)𝜋𝛽\pi(\beta)italic_π ( italic_β ) is a weight function in the spirit of a Bayesian prior pdf. (More generally, it could be π(β|μ)𝜋conditional𝛽𝜇\pi(\beta|\mu)italic_π ( italic_β | italic_μ ).)

One can then use either of these as if it were a 1D likelihood function (μ)𝜇{\cal L}(\mu)caligraphic_L ( italic_μ ) with the nuisance parameters “eliminated”. The question remains, what are the frequentist properties of intervals constructed for μ𝜇\muitalic_μ? The long tradition in particle physics is to use the profile likelihood function and the asymptotic theorem of wilks to obtain approximate 1D confidence intervals (as well as multi-dimensional confidence regions in cases beyond the scope of this note). For a brief review, see Section 40.3.2.1 and Table 40.2 of thepdg2022. For decades, a commonly used software tool has been the classic program (originally in FORTRAN and later in C and C++), “Minuit: A System for Function Minimization and Analysis of the Parameter Errors and Correlations”(minuit).MINUIT refers to the uncertainties from profile likelihood ratios as “MINOS errors”, which are discussed by minoscpc. I believe that the name “profile likelihood” entered the HEP mainstream in 2000, thanks to a talk by statistician Wolfgang Rolke on work with Angel López at the second meeting of what became the PhyStat series(rolkeclk; rolkelopez2001; Rolke2005). Interestingly,minoscpc and rolkeclk conceptualize the same math in two different ways, as discussed in my lectures(cousinshcpss2018); this may have delayed recognizing the connection (which was not immediate, even after Wolfgang’s talk).

At some point, a few people started integrating out nuisance parameter(s), typically when there was a Gaussian/normal contribution to the likelihood, while treating the parameter of interest in a frequentist manner. One of the earlier examples (citing yet earlier examples) was by myself and Virgil Highland, “Incorporating systematic uncertainties into an upper limit”(cousinshighland).We looked at the case of a physics quantity (cross section) σ=μb/L𝜎subscript𝜇b𝐿\sigma=\mu_{\rm b}/Litalic_σ = italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT / italic_L,where μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT is an unknown Poisson mean and L𝐿Litalic_L is a factor called the integrated luminosity. One observes a sampled value n𝑛nitalic_n from the Poisson distribution.From n𝑛nitalic_n, one can obtain a usual frequentist upper confidence limit on μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT.If L𝐿Litalic_L is known exactly, one then simply scales this by 1/L1𝐿1/L1 / italic_L to get an upper confidence limit on σ𝜎\sigmaitalic_σ. We considered the case in which, instead, one has an independent unbiased measurement of L𝐿Litalic_L with some uncertainty, yielding a likelihood function of (L)𝐿{\cal L}(L)caligraphic_L ( italic_L ). Multiplying (L)𝐿{\cal L}(L)caligraphic_L ( italic_L ) by a prior pdf for L𝐿Litalic_L yields a Bayesian posterior pdf for L𝐿Litalic_L. We advocated using this as a weight function for what we now call integrating/marginalizing over the nuisance parameter L𝐿Litalic_L. A power series expansion gave useful approximations.

Our motivation was to ameliorate an effect whereby confidence intervals derived from a discrete observable become shorter when a continuous nuisance parameter is added to the model. In the initially submitted draft, we did not understand that we were effectively grafting a Bayesian pdf onto a frequentist Poisson upper limit. Fred James sorted us out before publication. A more enlightened discussion is in my paper with Jim Linnemann and Jordan Tucker(CLT), referred to as CLT, discussed below. Luc Demortier noted that, in these simple cases, what we did was the same math as Bayesian statistician George Box’s prior predictive p𝑝pitalic_p-value(box1980; CLT). Box calculated a tail probability after obtaining a Bayesian pdf, and we averaged a frequentist tail probability over a Bayesian pdf. The two calculations simply reverse the order of two integrals.

A key paper was submitted in the first year of LHC data taking, byGlen Cowan, Kyle Cranmer, Eilam Gross, and Ofer Vitells (CCGV), “Asymptotic formulae for likelihood-based tests of new physics”(ccgv2011).They provided useful asymptotic formulas for the PLR and their preferred variants, which have been widely used at the LHC and beyond. I believe that this had a significant side effect: since asymptotic distributions were not known in HEP for other test statistics, the default became PLR variants in CCGV, even for small sample size, in order to have consistency between small and large sample sizes.

Meanwhile, in the statistics literature, there is a long history of criticism of the simple PLR. At PhyStat meetings, various higher-order approximations have been discussed by Nancy Reid at SLAC inreidslac2003; by Nancy(reidoxford2005; reidoxford2) and me(cousinsoxford2005) at Oxford in 2005; by Anthony Davidson in the Banff Challenge of 2006(davisonbanffchallenge), at PhyStat-nu 2019(davison2019), and at Banff last yeardavisonbanff2023; andby statistician Alessandra Brazzale(brazzalephystat2019) and physicist Igor Volobouev and statistician Alex Trindale at PhyStat-DM involobouevphystat2019. Virtually none of these developments have been adopted for widespread use in HEP, as far as I know.

Bayesian statisticians at PhyStat meetings have long advocated marginalizing nuisance parameters, even if we stick to frequentist treatment of the parameter of interest, e.g., James Berger, Brunero Liseo, and Robert Wolpert, “Integrated Likelihood Methods for Eliminating Nuisance Parameters”, (berger1999).Berger et al., in their rejoinder to a comment by Edward Susko, wrote,“Dr. Susko finishes by suggesting that it might be useful to compute both profile and integrated likelihood answers in application, as a type of sensitivity study. It is probably the case that, if the two answers agree, then one can feel relatively assured in the validity of the answer. It is less clear what to think in the case of disagreement, however. If the answers are relatively precise and quite different, we would simply suspect that it is a situation with a ‘bad’ profile likelihood.” I agree with Susko’s suggestion to compute (and report) both profile and integrated likelihood answers, to build up experience and to see if Berger et al. are right in real cases relevant to HEP.

However, in my experience, marginalizing nuisance parameters is relatively rare at the LHC. This is for various reasons, in my opinion, including:

It also seems common that the minimization required for profiling is less CPU-intensive than the integration required for marginalization.

Thus, the profile likelihood ratio (including variants with parameters near boundaries as in CCGV(ccgv2011)) is the most common test statistic in use today at the LHC, and perhaps in all of HEP. Among other uses, it is the basis of the ordering for confidence intervals advocated by Gary Feldman and myself(feldman1998), which turned out to be the no-nuisance-parameter special case of inversion of the “exact” PLR test in the treatise of “Kendall and Stuart” and later versions(kendall1999). I recall a long discussion between Gary Feldman and Kyle Cranmer at a pub at the Oxford Phystat in 2005 regarding how to interpret Eqs. 22.4-22.7 of kendall1999. For part of that historical context, see Kyle’s PhyStat contributions in cranmerslac2003 (Sec.9) and cranmeroxford2005 (Sec.4.3), as well as the contribution by Giovanni Punzi in punzioxford2005.

Various studies have been done regarding the elimination of nuisance parametersin the test statistic (typically a likelihood ratio), many concludingthat results are relatively insensitive to profiling vsmarginalization, so that the choice can be made based on CPU time. See,for example, John Conway’s talk and writeup atPhyStat in conwayphystat2011.Jan Conrad and Fredrik Tegenfeldt studied cases with favorable coverage(conradtegenfeldt2005; conradphystat2005). Notably, as Kyle Cranmer showed at Oxford in cranmeroxford2005, for Gaussian uncertainty on the mean μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, marginalizing the nuisance parameter can be badly anti-conservative (have severe undercoverage) when calculating signal significance (a different situation than the upper limits considered by myself and Highland (cousinshighland)). This was one of the motivations of the studies by CLT.

3 Sampling distributions of the test statistic(s) under various
hypotheses

Rather than pursuing higher-order corrections to the PLR, the general practice in HEP has been to stick with the PLR test statistic, and touse Monte Carlo simulation (known as “toy MC”) to obtain the finite-sample-size distribution(s) of the PLR under the null hypothesis, and under alternative(s) as desired. These results are often compared to the relevant asymptotic formulas, typically from CCGV.

So now one has the question: How should nuisance parameters be treatedin the simulations? We need to keep in mind that we want correct frequentist coverageof the parameter of interest when nature is sampling the data using the unknowntrue values of the nuisance parameters. An important constraint (atleast until now) is that it is generally thought to be impractical to performMC simulation for each of many different sets of values of the nuisanceparameters (especially in high dimensions). So how should one judiciously choose those values to be used in simulation?

The usual procedure (partially based on a desire to be “fully frequentist”), is to “throw toys” using the profiled values of the nuisance parameters, i.e., their ML estimates conditional on whatever value of the parameter(s) of interest are being tested.At some point, this procedure was identified as the parametric bootstrap. (The first time that I recall was in an email from Luc Demortier in 2011, pointing to his contribution to the PhyStat-LHC 2007 Proceedings (luc2007phystat) and to the book by Bradley Efron and Robert Tibshirani (efrontibshirani).) The hope is that the profiled values of the nuisance parameters are a good enough proxy for the unknown true values, even for small sample size. I have since learned that there is a large literature on the parametric bootstrap and its convergence properties. See, for example, Luc’s talk (luc2012slac), which describes a number of variants, with references, as well as Anthony Davison’s talk at Banffdavisonbanff2023, emphasized in Section5.

In an alternative procedure, sometimes called the Bayesian-Frequentist hybrid, in each pseudo-experiment in the toy-throwing, a new set of values of the unknown nuisance parameters is sampled from their posterior distributions, and used for sampling the data. The parameter of interest, however, continues to be treated in a frequentist fashion (confidence intervals, p𝑝pitalic_p-values, etc.). More details are in Section4.4 below.

I would like to see more comparisons of these results with alternatives, particularly those from marginalizing the nuisance parameters with some judiciously chosen priors. I find it a bit comforting to take an average over a set of nuisance parameters in the neighborhood of the profiled values. Coverage studies can give guidance on the weight function π(β)𝜋𝛽\pi(\beta)italic_π ( italic_β ) used in the marginalization, presumably considering default priors thought to lead to good coverage.

4 Example: ratio of Poisson means

I “cherry-picked” an example that appears often in the statistical literature;is an important prototype in HEP and astronomy; and is the topic of a paper for which I happen to be a co-author, including an interesting result for marginalization. This is the ratio of Poisson means, which maps onto the “on/off” problem in astronomy and the “signal region plus sideband” problem in HEP. I focus here mostly on the concepts, as the algebra is in the CLT paper(CLT), much of which was inspired by Jim Linnemann’s talk at the SLAC PhyStat in linnemannslac2003.

As in Fig.1, we have a histogram with two bins; in astronomy, these are counts with the telescope “on” and “off” the putative source, with observed contents nonsubscript𝑛onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT and noffsubscript𝑛offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT, respectively. In HEP, there is a bin in which one looks for a signal, and a “sideband” bin that is presumed to have background but no signal. So the “on” bin has (potentially) signal and background events with unknown Poisson means μssubscript𝜇s\mu_{\rm s}italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT and μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, respectively, with total Poisson mean μs+μbsubscript𝜇ssubscript𝜇b\mu_{\rm s}+\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT. The “off” bin has only background with unknown Poisson mean μoffsubscript𝜇off\mu_{\rm off}italic_μ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT. The ratio of the total Poisson means in the two bins is denoted by λ𝜆\lambdaitalic_λ,

λ=μoffμs+μb.𝜆subscript𝜇offsubscript𝜇ssubscript𝜇b\lambda=\frac{\mu_{\rm off}}{\mu_{\rm s}+\mu_{\rm b}}.italic_λ = divide start_ARG italic_μ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_ARG .(5)
Marginalizing versus Profiling of Nuisance Parameters Notes by Robert D. Cousins Department of Physics and Astronomy University of California, Los Angeles 90095 Los Angeles, CA and Larry Wasserman Department of Statistics and Data Science Carnegie Mellon University Pittsburgh, PA 15213 (1)

In this simple version discussed here, we assume that the ratio of the Poisson means of background in the two bins, τ=μoff/μb𝜏subscript𝜇offsubscript𝜇b\tau=\mu_{\rm off}/\mu_{\rm b}italic_τ = italic_μ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT / italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, is precisely known. In astronomy, this could be the ratio of observing times; in HEP, one might rely on simulation or control regions in the data. (If τ𝜏\tauitalic_τ is determined with non-negligible uncertainty from two more control regions, as in the so-called ABCD method in HEP, more issues arise.)In a search for a signal, we test the null hypothesis H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: μs=0subscript𝜇s0\mu_{\rm s}=0italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0,which we can rephrase as H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: λ=τ𝜆𝜏\lambda=\tauitalic_λ = italic_τ. One can choose the nuisance parameter to be either μoffsubscript𝜇off\mu_{\rm off}italic_μ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT or μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, and we choose the latter.

4.1 𝒁𝐁𝐢subscript𝒁𝐁𝐢Z_{\rm Bi}bold_italic_Z start_POSTSUBSCRIPT bold_Bi end_POSTSUBSCRIPT: conditioning on 𝒏totsubscript𝒏totn_{\textnormal{\scriptsize tot}}bold_italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT and using Clopper-Pearson intervals

The standard frequentist solution, invoking “conditioning” (jamesroos; reid1995; CLT), is based on the fact that the total number of counts, ntot=non+noffsubscript𝑛totsubscript𝑛onsubscript𝑛offn_{\textnormal{\scriptsize tot}}=n_{\textnormal{\scriptsize on}}+n_{\rm off}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT, provides information about the uncertainty on the ratio of Poisson means, but no information about the ratio itself. Thus, ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT is known as an ancillary statistic, and most statisticians agree that one should treat ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT as fixed, i.e., calculate (binomial) probabilities of nonsubscript𝑛onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT and noffsubscript𝑛offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT conditional on the observed ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT. We can again rephrase the null hypothesis in terms of the binomial parameter ρ𝜌\rhoitalic_ρ, namely H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: ρ=μb/(μb+τμb)=1/(1+τ)𝜌subscript𝜇bsubscript𝜇b𝜏subscript𝜇b11𝜏\rho=\mu_{\rm b}/(\mu_{\rm b}+\tau\mu_{\rm b})=1/(1+\tau)italic_ρ = italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT / ( italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT + italic_τ italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ) = 1 / ( 1 + italic_τ ). So we do a binomial test of ρ=1/(1+τ)𝜌11𝜏\rho=1/(1+\tau)italic_ρ = 1 / ( 1 + italic_τ ), given data nonsubscript𝑛onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT, ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT. The p𝑝pitalic_p-value pBisubscript𝑝Bip_{\rm Bi}italic_p start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT is obtained by inversion of binomial confidence intervals.

There are many available choices for sets of binomial confidence intervals. For pBisubscript𝑝Bip_{\rm Bi}italic_p start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT, CLT adheres to the standard in HEP and follows the “exact” construction of clopper1934, with the guarantee of no undercoverage. Note, however, that “exact”, as commonly used in situations such as this (going back to Fisher), means that a small-sample construction (as opposed to asymptotic theory) was used, but beware that in a discrete situation such as ours, there will be overcoverage, sometimes severe.

The p𝑝pitalic_p-value pBisubscript𝑝Bip_{\rm Bi}italic_p start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT is converted to a Z𝑍Zitalic_Z-value (sometimes called S𝑆Sitalic_S in HEP) with a 1-tailed Gaussian convention to obtain ZBisubscript𝑍BiZ_{\rm Bi}italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT. (As in CLT, Z=2erf1(12p)𝑍2superscripterf112𝑝Z=\sqrt{2}\,\text{erf}^{-1}(1-2p)italic_Z = square-root start_ARG 2 end_ARG erf start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - 2 italic_p ).

4.2 𝒁𝐦𝐢𝐝-𝐏subscript𝒁𝐦𝐢𝐝-𝐏Z_{\mathrm{mid\text{-}P}}bold_italic_Z start_POSTSUBSCRIPT bold_mid - bold_P end_POSTSUBSCRIPT: conditioning on 𝒏totsubscript𝒏totn_{\textnormal{\scriptsize tot}}bold_italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT and using Lancaster mid-P intervals

After CLT, a paper by myself, Hymes, and Tucker (cht2010) advocated using Lancaster’s mid-P binomial confidence intervals(lancaster61) to obtain confidence intervals for the ratio of Poisson means. For any fixed ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT, the mid-P intervals have over-coverage for some values of the binomial parameter and undercoverage for other values. In the unconditional ensemble that includes ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT sampled from the Poisson distribution with the true total mean, the randomness in ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT leads to excellent coverage of the true ratio of Poisson means.

4.3 𝒁𝐏𝐋subscript𝒁𝐏𝐋Z_{\mathrm{PL}}bold_italic_Z start_POSTSUBSCRIPT bold_PL end_POSTSUBSCRIPT: profile likelihood and asymptotic theory

The asymptotic result from Wilks’s Theorem (details in Section 5 of CLT and in pdg2022, Section 5) yields the Z𝑍Zitalic_Z-value ZPLsubscript𝑍PLZ_{\mathrm{PL}}italic_Z start_POSTSUBSCRIPT roman_PL end_POSTSUBSCRIPT from the profile likelihood function.

4.4 𝒁𝐁𝐅subscript𝒁𝐁𝐅Z_{\mathrm{BF}}bold_italic_Z start_POSTSUBSCRIPT bold_BF end_POSTSUBSCRIPT: marginalization of nuisance parameter 𝝁𝐛subscript𝝁𝐛\mu_{\rm b}bold_italic_μ start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT
(Bayesian-frequentist hybrid)

First, consider the case when μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT is known exactly. The p𝑝pitalic_p-value (denoted by pPsubscript𝑝Pp_{\rm P}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT) for H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: μs=0subscript𝜇s0\mu_{\rm s}=0italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0 is the Poisson probability of obtaining nonsubscript𝑛onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT or more events in the “on” bin with true mean μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT.

Then, introduce uncertainty in μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT in a Bayesian-like way, with uniform prior for μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT and likelihood (μb)subscript𝜇b{\cal L}(\mu_{\rm b})caligraphic_L ( italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ) from the Poisson probability of observing noffsubscript𝑛offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT in the “off” bin, thus obtaining posterior probability P(μb|noff)𝑃conditionalsubscript𝜇bsubscript𝑛offP(\mu_{\rm b}|n_{\rm off})italic_P ( italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT | italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ), a Gamma distribution.Finally, take the weighted average of pPsubscript𝑝Pp_{\mathrm{P}}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT over P(μb|noff)𝑃conditionalsubscript𝜇bsubscript𝑛offP(\mu_{\rm b}|n_{\rm off})italic_P ( italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT | italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ) to obtain the Bayesian-frequentist hybrid p𝑝pitalic_p-value pBFsubscript𝑝BFp_{\mathrm{BF}}italic_p start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT:

pBF=pPP(μb|noff)𝑑μb,subscript𝑝BFsubscript𝑝P𝑃conditionalsubscript𝜇bsubscript𝑛offdifferential-dsubscript𝜇bp_{\mathrm{BF}}=\int p_{\rm P}P(\mu_{\rm b}|n_{\rm off})d\mu_{\rm b},italic_p start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT = ∫ italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT italic_P ( italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT | italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ) italic_d italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ,(6)

and map to ZBFsubscript𝑍BFZ_{\mathrm{BF}}italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT (called ZΓsubscript𝑍ΓZ_{\Gamma}italic_Z start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT by CLT).

Jim Linnemann discovered numerically, and then proved, that ZBF=ZBisubscript𝑍BFsubscript𝑍BiZ_{\mathrm{BF}}=Z_{\rm Bi}italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT (!). His proof is in Appendix C of CLT. This justifies, from a frequentist point of view, the use of the uniform prior for μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, which is on shaky ground in real Bayesian theory.

The integral in Eq.6 can be performed using the Monte Carlo method as follows.One samples μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT from the posterior pdf P(μb|noff)𝑃conditionalsubscript𝜇bsubscript𝑛offP(\mu_{\rm b}|n_{\rm off})italic_P ( italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT | italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ), and computes the frequentist pPsubscript𝑝Pp_{\rm P}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT for each sampled value of μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT (using same observed nonsubscript𝑛onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT).Then one calculates the arithmetic mean of these values of pPsubscript𝑝Pp_{\mathrm{P}}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT to obtain pBFsubscript𝑝BFp_{\mathrm{BF}}italic_p start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT, and converts to ZBFsubscript𝑍BFZ_{\mathrm{BF}}italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT as desired. (Note that pPsubscript𝑝Pp_{\mathrm{P}}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT’s are averaged, not ZPsubscript𝑍PZ_{\rm P}italic_Z start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT’s.)

4.5 𝒁𝐏𝐁subscript𝒁𝐏𝐁Z_{\mathrm{PB}}bold_italic_Z start_POSTSUBSCRIPT bold_PB end_POSTSUBSCRIPT: parametric bootstrap for nuisance parameter 𝝁𝐛subscript𝝁𝐛\mu_{\rm b}bold_italic_μ start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT

For testing H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: μs=0subscript𝜇s0\mu_{\rm s}=0italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0, we first find the profile likelihood estimate of μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT. See lima, cited by CLT, for the mathematics.

Conceptually, we use the data in both bins by noting as above that for μs=0subscript𝜇s0\mu_{\rm s}=0italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0,ntot=non+noffsubscript𝑛totsubscript𝑛onsubscript𝑛offn_{\textnormal{\scriptsize tot}}=n_{\textnormal{\scriptsize on}}+n_{\rm off}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT is a sample from Poisson mean μb+μoff=μb(1+τ)subscript𝜇bsubscript𝜇offsubscript𝜇b1𝜏\mu_{\rm b}+\mu_{\rm off}=\mu_{\rm b}(1+\tau)italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ( 1 + italic_τ ).Solving for μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, the profiled MLE of μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT for μs=0subscript𝜇s0\mu_{\rm s}=0italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0 is μ^^b=(non+noff)/(1+τ)subscript^^𝜇bsubscript𝑛onsubscript𝑛off1𝜏\widehat{\widehat{\mu}}_{\rm b}=(n_{\textnormal{\scriptsize on}}+n_{\rm off})/%(1+\tau)over^ start_ARG over^ start_ARG italic_μ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = ( italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ) / ( 1 + italic_τ ). The parametric bootstrap takes this value of μbsubscript𝜇b\mu_{\rm b}italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT as truth and proceeds to calculate the p𝑝pitalic_p-value pPBsubscript𝑝PBp_{\mathrm{PB}}italic_p start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT as in pPsubscript𝑝Pp_{\rm P}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT above. This can be done by simulation or direct calculation, leading to ZPBsubscript𝑍PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT.

4.6 Numerical examples

The calculated Z𝑍Zitalic_Z-values for the various recipes are shown in Table 1 for three chosen sets of values of τ𝜏\tauitalic_τ, nonsubscript𝑛onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT, and noffsubscript𝑛offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT.

Explaining the first set in detail, suppose τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0, non=10subscript𝑛on10n_{\textnormal{\scriptsize on}}=10italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT = 10, noff=2subscript𝑛off2n_{\rm off}=2italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 2.Following the ROOT commands in Appendix E of CLT yields the “Exact” ZBi=ZBF=2.07subscript𝑍Bisubscript𝑍BF2.07Z_{\rm Bi}=Z_{\mathrm{BF}}=2.07italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT = 2.07.Alternatively, one can use the ROOT (tefficiency) class TEfficiency for two-tailed binomial confidence intervals. Recall that we need a one-tailed binomial test of ρ=1/(1+τ)𝜌11𝜏\rho=1/(1+\tau)italic_ρ = 1 / ( 1 + italic_τ ), given data nonsubscript𝑛onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT, ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT. Applying the duality between tests and intervals(cousinshcpss2018), we seek the confidence level (CL) of the confidence interval (CI) that just includes the value of ρ=1/(1+1)=0.5𝜌1110.5\rho=1/(1+1)=0.5italic_ρ = 1 / ( 1 + 1 ) = 0.5. By iterating p𝑝pitalic_p and the CL=12pCL12𝑝\mathrm{CL}=1-2*proman_CL = 1 - 2 ∗ italic_p in TEfficiency manually until obtaining the endpoint=0.5 (for CL=p=0.019287CL𝑝0.019287\mathrm{CL}=p=0.019287roman_CL = italic_p = 0.019287), I arrived at the following ROOT commands (converting p=0.019287𝑝0.019287p=0.019287italic_p = 0.019287 to Z𝑍Zitalic_Z as in CLT).

double n_on=10double n_off=2double p=0.019287endpoint = TEfficiency::ClopperPearson(n_on+n_off,n_on,1.-2.*p,false);endpoint

Then I could switch to TEfficiency::MidPInterval to get the Lancaster mid-P interval and similarly obtain Zmid-P=2.28subscript𝑍mid-P2.28Z_{\mathrm{mid\text{-}P}}=2.28italic_Z start_POSTSUBSCRIPT roman_mid - roman_P end_POSTSUBSCRIPT = 2.28.

The asymptotic result from Wilks’s Theorem gives the value from the profile likelihood function, ZPL=2.41subscript𝑍PL2.41Z_{\mathrm{PL}}=2.41italic_Z start_POSTSUBSCRIPT roman_PL end_POSTSUBSCRIPT = 2.41.

For the parametric bootstrap, for μs=0subscript𝜇s0\mu_{\rm s}=0italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0 and ntot=12subscript𝑛tot12n_{\textnormal{\scriptsize tot}}=12italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = 12, the profiled MLE is μ^^b=6subscript^^𝜇b6\widehat{\widehat{\mu}}_{\rm b}=6over^ start_ARG over^ start_ARG italic_μ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 6, and so μoff=6subscript𝜇off6\mu_{\rm off}=6italic_μ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 6. So I generated toys with μ=6𝜇6\mu=6italic_μ = 6 in each bin (!), to find ZPB=2.32subscript𝑍PB2.32Z_{\mathrm{PB}}=2.32italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT = 2.32.

The results for the second set of values (τ=2.0𝜏2.0\tau=2.0italic_τ = 2.0, non=10subscript𝑛on10n_{\textnormal{\scriptsize on}}=10italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT = 10, and noff=2subscript𝑛off2n_{\rm off}=2italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 2) are found similarly.For the parametric bootstrap, for μs=0subscript𝜇s0\mu_{\rm s}=0italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0, the profiled MLE is μ^^b=4subscript^^𝜇b4\widehat{\widehat{\mu}}_{\rm b}=4over^ start_ARG over^ start_ARG italic_μ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 4, and so μoff=8subscript𝜇off8\mu_{\rm off}=8italic_μ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 8.So I generated toys with μ=4,8𝜇48\mu=4,8italic_μ = 4 , 8 in the bins where we observed 10, 2 (!). ZPB=3.46subscript𝑍PB3.46Z_{\mathrm{PB}}=3.46italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT = 3.46.

The third set of values (added since my talk) has no events in the off sideband, noff=0subscript𝑛off0n_{\rm off}=0italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 0. In the discussion following our talks, I expressed concern about generating toys for the parametric bootstrap in this situation. In fact, it is not a problem. With τ=2.0𝜏2.0\tau=2.0italic_τ = 2.0 and non=6subscript𝑛on6n_{\textnormal{\scriptsize on}}=6italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT = 6, then for μs=0subscript𝜇s0\mu_{\rm s}=0italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0 and ntot=6subscript𝑛tot6n_{\textnormal{\scriptsize tot}}=6italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = 6, the profiled MLE is μ^^b=2subscript^^𝜇b2\widehat{\widehat{\mu}}_{\rm b}=2over^ start_ARG over^ start_ARG italic_μ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 2, and so μoff=4subscript𝜇off4\mu_{\rm off}=4italic_μ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 4.So I generated toys with μ=2,4𝜇24\mu=2,4italic_μ = 2 , 4 in the bins where we observed 6, 0 (!). ZPB=3.41subscript𝑍PB3.41Z_{\mathrm{PB}}=3.41italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT = 3.41 (In more complicated models, zero events observed in multiple control regions might be an issue.)

τ𝜏\tauitalic_τ1.02.02.0
nonsubscript𝑛onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT10106
noffsubscript𝑛offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT220
ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT12126
μ^^bsubscript^^𝜇b\widehat{\widehat{\mu}}_{\rm b}over^ start_ARG over^ start_ARG italic_μ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT for μs=0subscript𝜇s0\mu_{\rm s}=0italic_μ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0642
ZBi=ZBFsubscript𝑍Bisubscript𝑍BFZ_{\rm Bi}=Z_{\mathrm{BF}}italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT, Clopper-Pearson2.073.273.00
Zmid-Psubscript𝑍mid-PZ_{\mathrm{mid\text{-}P}}italic_Z start_POSTSUBSCRIPT roman_mid - roman_P end_POSTSUBSCRIPT, Lancaster mid-P2.283.443.20
ZPLsubscript𝑍PLZ_{\mathrm{PL}}italic_Z start_POSTSUBSCRIPT roman_PL end_POSTSUBSCRIPT2.413.583.63
ZPBsubscript𝑍PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT2.323.463.41

4.6.1 Discussion

Superficially, given that ZBi=ZBFsubscript𝑍Bisubscript𝑍BFZ_{\rm Bi}=Z_{\mathrm{BF}}italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT with Clopper-Pearson intervals is “Exact”, and the parametric bootstrap gives larger Z𝑍Zitalic_Z, the latter looks anti-conservative.But recall that the so-called “exact” intervals overcover due to discreteness. Maybe ZPBsubscript𝑍PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT is actually better. We need a closer look.

4.7 Coverage studies

Following CLT, we consider a pair of true values of (μb,τ)subscript𝜇b𝜏(\mu_{\rm b},\tau)( italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT , italic_τ ), and for that pair,consider all values of (non,noff)subscript𝑛onsubscript𝑛off(n_{\textnormal{\scriptsize on}},n_{\rm off})( italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ), and calculate both the probability of obtaining that data, and the computed Z𝑍Zitalic_Z value for each recipe. We consider some threshold value of claimed Zclaimsubscript𝑍claimZ_{\rm claim}italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT, say 3 or 5, and compute the probability thatZZclaim𝑍subscript𝑍claimZ\geq Z_{\rm claim}italic_Z ≥ italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT according to a chosen recipe; this is the true Type I errorrate for that recipe and the significance level corresponding to thatvalue of Zclaimsubscript𝑍claimZ_{\rm claim}italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT. One can then convert this true Type I errorrate to a Z𝑍Zitalic_Z-value using the one-tailed formula and obtain what we call Ztruesubscript𝑍trueZ_{\rm true}italic_Z start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT.This could be done by simulation, but we do direct calculations. Figure2 has “heat maps” from CLT showing ZtrueZclaimsubscript𝑍truesubscript𝑍claimZ_{\rm true}-Z_{\rm claim}italic_Z start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT for claims of 5σ𝜎\sigmaitalic_σ for the ZBisubscript𝑍BiZ_{\rm Bi}italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT and profile likelihood algorithms.

I calculated a couple of points using the parametric bootstrap ZPBsubscript𝑍PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT:For τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0, μb=10subscript𝜇b10\mu_{\rm b}=10italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 10, and Zclaim=3subscript𝑍claim3Z_{\rm claim}=3italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT = 3, I found Ztrue=3.0subscript𝑍true3.0Z_{\rm true}=3.0italic_Z start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 3.0 (!)For τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0, μb=20subscript𝜇b20\mu_{\rm b}=20italic_μ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 20, and Zclaim=5subscript𝑍claim5Z_{\rm claim}=5italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT = 5, I found Ztrue=5.0subscript𝑍true5.0Z_{\rm true}=5.0italic_Z start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 5.0 (!)Thus, at least for these two points, any anti-conservatism of the parametric bootstrap is exactly canceled out by the over-coverage due to discreteness.That is, I cherry-picked my example problem to be one in which I knew that marginalizing the nuisance parameter (with judicious prior) yielded the standard frequentist Z𝑍Zitalic_Z, while doing parametric bootstrap as commonly done at the LHC would give higher ZPBsubscript𝑍PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT. But in this example, the discrete nature of the data in the problem “saves” the coverage of ZPBsubscript𝑍PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT !

Note also the similarity of ZPBsubscript𝑍PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT with Zmid-Psubscript𝑍mid-PZ_{\mathrm{mid\text{-}P}}italic_Z start_POSTSUBSCRIPT roman_mid - roman_P end_POSTSUBSCRIPT. As mentioned, cht2010 found that Lancaster mid-P confidence intervals had excellent coverage in the unconditional ensemble, i.e., not conditioning on ntotsubscript𝑛totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT.

4.8 If 𝝉𝝉\taubold_italic_τ is not known exactly

An obvious next step is to relax the assumption that τ𝜏\tauitalic_τ is known exactly, so that τ𝜏\tauitalic_τ becomes a nuisance parameter. In analyses in HEP, τ𝜏\tauitalic_τ is often estimated from the observed data in two so-called control regions (bins), which are thought to have the same ratio of background means τ𝜏\tauitalic_τ as the original two bins. The control regions are distinguished from the original two bins by using an additional statistic that is thought to be independent of the statistic used to define the original two bins. This is often referred to as the ABCD method, after the letters commonly designating the four bins. I would strongly encourage coverage studies for the various methods applied to this case. One could also study the case where the estimate of τ𝜏\tauitalic_τ is sampled from a lognormal distribution.

5 Discussion at Banff BIRS 2023

Among the many talks touching on the topic at Banff in davisonbanff2023, statistician Anthony Davison’s was paired with mine, pointing to works including his Banff challenge paper with Sartori, and describing a parametric bootstrap, writing,

“Lee and Young (2005, Statistics and Probability Letters) and DiCiccio and Young (2008, 2010, Biometrika) show that this parametric bootstrap procedure has third-order relative accuracy in linear exponential families, both conditionally and unconditionally, and is closely linked to objective Bayes approaches.”

6 Final remarks

Overall, I am encouraged by what I learned in the last year while preparing the talks and this writeup regarding our widespread use of the parametric bootstrap with the PLR test statistic in HEP. Still, there is room to learn more (which I suspect that I must leave to the younger generations). First, how much is HEP losing by not studying more thoroughly the higher-order asymptotic theory in the test statistic? Would that give more power, or at least reduce the need for parametric bootstrapping? Second, can one find relevant cases in HEP where marginalizing out-performs profiling? Is there an example having continuous data but otherwise similar to the example in this note, so that one can study the performance without the confounding complication of discreteness? Or, for those familiar with the ABCD method in HEP (where, as mentioned) τ𝜏\tauitalic_τ is measured from contents of two more control regions), are there pitfalls for the parametric bootstrap in that context?I would urge more exploration of these issues in real HEP analyses!

Acknowledgments

Many thanks go to the multitudes of physicists and statisticians from whom I have learned, and particularly to Louis Lyons for bringing us together regularly at PhyStat meetings since the year 2000 (and for comments on earlier drafts). I also thank him, Sara Algeri, and Olaf Behnke for organizing this discussion. This work was partially supported by the U.S.Department of Energyunder Award Number DE–SC0009937.

Part II: Statistician’s View
by Larry Wasserman

1 Introduction

This paper arosefrom my presentationat the PhyStat meeting whereBob Cousins and I were asked togive our views ontwo methods for handling nuisance parameterswhen conducting statistical inference.This written account expands on my commentsand goes into a little more detail.

I’ll follow Bob’s notation whereμ𝜇\muitalic_μ is the (real-valued) parameter of interestand β𝛽\betaitalic_β is the vector of nuisance parameters.In his paper,Bob points out thatmarginalising nuisance parametersis rare at the LHC.This is true in statistics as well.There are a few cases where marginalisingis used, such as random effects modelswhere the number of parameters increases as the samplesize increases.But even profile likelihood is not that common.I would say that the most common method in statisticsis the Wald confidence intervalμ^±zα/2snplus-or-minus^𝜇subscript𝑧𝛼2subscript𝑠𝑛\widehat{\mu}\pm z_{\alpha/2}s_{n}over^ start_ARG italic_μ end_ARG ± italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTwhere μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG is the maximum likelihood estimator,snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the standard error (usually obtained from the Fisher information matrix)and zα/2subscript𝑧𝛼2z_{\alpha/2}italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT is the α/2𝛼2\alpha/2italic_α / 2 upper quantileof a standard Normal distribution.(Note that for the Wald interval,one simply inserts points estimates of the nuisance parametersin the formula for snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.)

LetY1,,Ynsubscript𝑌1subscript𝑌𝑛Y_{1},\ldots,Y_{n}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTbe a random sample froma densitypθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPTwhich belongs to a model(pθ:θΘ):subscript𝑝𝜃𝜃Θ(p_{\theta}:\ \theta\in\Theta)( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : italic_θ ∈ roman_Θ ).We decompose the parameter asθ=(μ,β)𝜃𝜇𝛽\theta=(\mu,\beta)italic_θ = ( italic_μ , italic_β )where μ𝜇\mu\in\mathbb{R}italic_μ ∈ blackboard_R is the parameter of interest andβk𝛽superscript𝑘\beta\in\mathbb{R}^{k}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is avector of nuisance parameters.We want to construct a confidence setCnCn(Y1,,Yn)subscript𝐶𝑛subscript𝐶𝑛subscript𝑌1subscript𝑌𝑛C_{n}\equiv C_{n}(Y_{1},\ldots,Y_{n})italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≡ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )such that

Pθ(μCn)1αforallθsubscript𝑃𝜃𝜇subscript𝐶𝑛1𝛼forall𝜃P_{\theta}(\mu\in C_{n})\geq 1-\alpha\qquad\text{for\ all\ }\thetaitalic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_μ ∈ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ 1 - italic_α for all italic_θ(7)

or, equivalently,

infθPθ(μCn)1α.subscriptinfimum𝜃subscript𝑃𝜃𝜇subscript𝐶𝑛1𝛼\inf_{\theta}P_{\theta}(\mu\in C_{n})\geq 1-\alpha.roman_inf start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_μ ∈ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ 1 - italic_α .

We may also want to test

H0:μ=μ0versusμμ0:subscript𝐻0formulae-sequence𝜇subscript𝜇0versus𝜇subscript𝜇0H_{0}:\mu=\mu_{0}\qquad\text{versus}\qquad\mu\neq\mu_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_μ = italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT versus italic_μ ≠ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

however, for simplicity, I’ll focus on confidence sets.The present discussionis on likelihood based methods.In particular,we may use the profile likelihood

p(μ)=supβ(μ,β)subscript𝑝𝜇subscriptsupremum𝛽𝜇𝛽{\cal L}_{p}(\mu)=\sup_{\beta}{\cal L}(\mu,\beta)caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ ) = roman_sup start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT caligraphic_L ( italic_μ , italic_β )

or the marginalised (or integrated) likelihood

m(μ)=(μ,β)π(β|μ)𝑑βsubscript𝑚𝜇𝜇𝛽𝜋conditional𝛽𝜇differential-d𝛽{\cal L}_{m}(\mu)=\int{\cal L}(\mu,\beta)\pi(\beta|\mu)d\betacaligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_μ ) = ∫ caligraphic_L ( italic_μ , italic_β ) italic_π ( italic_β | italic_μ ) italic_d italic_β

where π(β|μ)𝜋conditional𝛽𝜇\pi(\beta|\mu)italic_π ( italic_β | italic_μ ) is a prior for β𝛽\betaitalic_β.In my view, we cannot say whether a likelihood isgood or badwithout saying what we will do with the likelihood.I assume thatwe are using the likelihoodto constructingconfidence intervals and tests.Then judging whether a likelihood isgood or badrequires assessing the quality of the confidence setsobtained from the likelihood.The usual central 1α1𝛼1-\alpha1 - italic_α confidence set based on psubscript𝑝{\cal L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is

C={μ:p(μ)p(μ^)ec/2}={μ:p(μ)p(μ^)c2}𝐶conditional-set𝜇subscript𝑝𝜇subscript𝑝^𝜇superscript𝑒𝑐2conditional-set𝜇subscript𝑝𝜇subscript𝑝^𝜇𝑐2C=\Bigl{\{}\mu:\ {\cal L}_{p}(\mu)\geq{\cal L}_{p}(\widehat{\mu})e^{-c/2}\Bigr%{\}}=\Bigl{\{}\mu:\ \ell_{p}(\mu)\geq\ell_{p}(\widehat{\mu})-\frac{c}{2}\Bigr{\}}italic_C = { italic_μ : caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ ) ≥ caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) italic_e start_POSTSUPERSCRIPT - italic_c / 2 end_POSTSUPERSCRIPT } = { italic_μ : roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ ) ≥ roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG ) - divide start_ARG italic_c end_ARG start_ARG 2 end_ARG }(8)

where θ^=(μ^,β^)^𝜃^𝜇^𝛽\widehat{\theta}=(\widehat{\mu},\widehat{\beta})over^ start_ARG italic_θ end_ARG = ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_β end_ARG )is the maximum likelihood estimate,p=logpsubscript𝑝subscript𝑝\ell_{p}=\log{\cal L}_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = roman_log caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT andc𝑐citalic_c is the 1α1𝛼1-\alpha1 - italic_α quantile of a χ12subscriptsuperscript𝜒21\chi^{2}_{1}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTdistribution.

2 Profile or Marginalise?

We want confidence intervalswith correct coveragewhich means thatPθ(θC)1αsubscript𝑃𝜃𝜃𝐶1𝛼P_{\theta}(\theta\in C)\geq 1-\alphaitalic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_θ ∈ italic_C ) ≥ 1 - italic_α for all θ𝜃\thetaitalic_θ.In this case we say thatthe confidence set is valid.Among valid intervals,we may want one that is shortest.A validinterval with shortest lengthis efficient.(But this is a scale dependent notion.)

If the sample size is largeand certain regularity conditions hold,then the profile-based confidence setin (8)will be valid and efficient.So in these well-behaved situations,I don’t see any reasonfor using msubscript𝑚{\cal L}_{m}caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.The main regularity conditions are:

(1) pθ(y)subscript𝑝𝜃𝑦p_{\theta}(y)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) is a smooth function of θ𝜃\thetaitalic_θ.

(2) The range of the random variable Y𝑌Yitalic_Y does not depend on θ𝜃\thetaitalic_θ.

(3) The number of nuisance parameters does not increaseas the sample size increases.

In a very interesting paper that Bob mentions,berger1999provide exampleswhere psubscript𝑝{\cal L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT apparently does not behave well.This is because,in each case,the sample size is small, or the number of nuisance parameters is increasingor some other regularity conditions fail.There may be a few cases where the marginalisedlikelihood can fix these thingsif we use a carefully chosen prior, but such casesare rare.We might improve the accuracy of the coverageby using higher order asymptotics.But this is rarely done in practice.Such methods are not easy to applyand they won’t necessarily helpif there are violations of the regularity conditions.

Sometimes we can use specific tricks,like conditioning on a well-chosen statisticas Bob does in the Poisson ON-OFF example.These methods are useful whenthey are available but they tend to be very problem specific.Bootstrapping and subsamplingare also useful in some cases.A common misconceptionis that the bootstrap is a finite sample method.It is not.These methods are asymptotic and they dostill rely on regularity conditions.

In the next two sectionsI’ll discuss two approachesfor gettingvalid confidence setswithout relying onasymptotic approximations or regularity conditions.

3 Universal Inference

Universal inference(wasserman2020universal)is a methodfor constructingconfidence setsfrom the likelihoodthat does not rely onasymptotic approximationsor on regularity conditions.The method,works as follows.

  1. 1.

    Choose a large number B𝐵Bitalic_B.

  2. 2.

    For j=1,,B𝑗1𝐵j=1,\ldots,Bitalic_j = 1 , … , italic_B:

    1. (a)

      Split the data into two disjoint groups𝒟0subscript𝒟0{\cal D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒟1subscript𝒟1{\cal D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

    2. (b)

      Let θ^=(μ^,β^)^𝜃^𝜇^𝛽\widehat{\theta}=(\widehat{\mu},\widehat{\beta})over^ start_ARG italic_θ end_ARG = ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_β end_ARG )be any estimate of θ𝜃\thetaitalic_θ using 𝒟1subscript𝒟1{\cal D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

    3. (c)

      LetTj(μ)=0(μ^,β^)supβ(μ,β)subscript𝑇𝑗𝜇subscript0^𝜇^𝛽subscriptsupremum𝛽𝜇𝛽T_{j}(\mu)=\frac{{\cal L}_{0}(\widehat{\mu},\widehat{\beta})}{\sup_{\beta}{%\cal L}(\mu,\beta)}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ) = divide start_ARG caligraphic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_β end_ARG ) end_ARG start_ARG roman_sup start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT caligraphic_L ( italic_μ , italic_β ) end_ARGwhere0(θ)subscript0𝜃{\cal L}_{0}(\theta)caligraphic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) is the likelihoodconstructed from 𝒟0subscript𝒟0{\cal D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

  3. 3.

    Let T(μ)=B1j=1BTj(μ)𝑇𝜇superscript𝐵1superscriptsubscript𝑗1𝐵subscript𝑇𝑗𝜇T(\mu)=B^{-1}\sum_{j=1}^{B}T_{j}(\mu)italic_T ( italic_μ ) = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_μ ).

  4. 4.

    Return Cn={μ:T(μ)1/α}subscript𝐶𝑛conditional-set𝜇𝑇𝜇1𝛼C_{n}=\{\mu:\ T(\mu)\leq 1/\alpha\}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_μ : italic_T ( italic_μ ) ≤ 1 / italic_α }.

This confidence set is valid,that is,

Pθ(μCn)1αforallθsubscript𝑃𝜃𝜇subscript𝐶𝑛1𝛼forall𝜃P_{\theta}(\mu\in C_{n})\geq 1-\alpha\qquad\text{for\ all\ }\thetaitalic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_μ ∈ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ 1 - italic_α for all italic_θ

and this holds for any sample size.It does not rely on large sample approximationsand it does not requireany regularity conditions.The number of splits B𝐵Bitalic_B can be any number, even B=1𝐵1B=1italic_B = 1 is allowed.The advantage of using large B𝐵Bitalic_B is that it removes therandomness from the procedure.The disadvantage ofUniversal inference is thatone needs to compute the likelihood many timeswhich can be computationally expensive.

4 Simulation Based Inference

As the name implies,simulation based inference (SBI)uses simulationsto conduct inference.It is usually used to deal withintractable likelihoods(dalmasso2021likelihood; masserano2023simulator; al2024amortized; cranmer2015; cranmer2020; thomas2022; xie2024; masserano2022simulation; stanley2023).But it can be used as a way to get valid inferencefor any model.The idea is to firstget a confidence set for θ𝜃\thetaitalic_θby inverting a test.Let θsubscript𝜃\theta_{*}italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT denote the true value of θ𝜃\thetaitalic_θ.For each θ𝜃\thetaitalic_θ we test thehypothesis H0:θ=θ:subscript𝐻0subscript𝜃𝜃H_{0}:\ \theta_{*}=\thetaitalic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_θ.This test is based on some test statisticT(θ,Y1,,Yn)𝑇𝜃subscript𝑌1subscript𝑌𝑛T(\theta,Y_{1},\ldots,Y_{n})italic_T ( italic_θ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).The p-value for this test is

p(θ)=p(μ,β)=Pθ(T(θ,Y1,,Yn)T(θ,Y1,,Yn))𝑝𝜃𝑝𝜇𝛽subscript𝑃𝜃𝑇𝜃superscriptsubscript𝑌1superscriptsubscript𝑌𝑛𝑇𝜃subscript𝑌1subscript𝑌𝑛p(\theta)=p(\mu,\beta)=P_{\theta}\Bigl{(}T(\theta,Y_{1}^{*},\ldots,Y_{n}^{*})%\geq T(\theta,Y_{1},\ldots,Y_{n})\Bigr{)}italic_p ( italic_θ ) = italic_p ( italic_μ , italic_β ) = italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_T ( italic_θ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_T ( italic_θ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )

whereY1,,Ynpθsimilar-tosuperscriptsubscript𝑌1superscriptsubscript𝑌𝑛subscript𝑝𝜃Y_{1}^{*},\ldots,Y_{n}^{*}\sim p_{\theta}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.ThenC~={θ:p(θ)α}~𝐶conditional-set𝜃𝑝𝜃𝛼\widetilde{C}=\{\theta:\ p(\theta)\geq\alpha\}over~ start_ARG italic_C end_ARG = { italic_θ : italic_p ( italic_θ ) ≥ italic_α }is an exact, 1α1𝛼1-\alpha1 - italic_α confidence set for θ𝜃\thetaitalic_θ.The confidence set for μ𝜇\muitalic_μis the projection

C={μ:θ=(μ,β)C~forsomeβ}}={μ:supβp(μ,β)α}.C=\Bigl{\{}\mu:\ \theta=(\mu,\beta)\in\widetilde{C}\ \text{for\ some\ }\beta\}%\Bigr{\}}=\Bigl{\{}\mu:\ \sup_{\beta}p(\mu,\beta)\geq\alpha\Bigr{\}}.italic_C = { italic_μ : italic_θ = ( italic_μ , italic_β ) ∈ over~ start_ARG italic_C end_ARG for some italic_β } } = { italic_μ : roman_sup start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_p ( italic_μ , italic_β ) ≥ italic_α } .

Such projected confidence sets can be large,but the resulting confidence sets havecorrect coverage.In SBI, we samplea set of valuesθ1,,θBsubscript𝜃1subscript𝜃𝐵\theta_{1},\ldots,\theta_{B}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTfrom some distribution.For each θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPTwe draw a sampleY1,,Ynpθjsimilar-tosuperscriptsubscript𝑌1superscriptsubscript𝑌𝑛subscript𝑝subscript𝜃𝑗Y_{1}^{*},\ldots,Y_{n}^{*}\sim p_{\theta_{j}}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT.Then we define

Ij=I(T(θj,Y1,,Yn)T(θj,Y1,,Yn))subscript𝐼𝑗𝐼𝑇subscript𝜃𝑗superscriptsubscript𝑌1superscriptsubscript𝑌𝑛𝑇subscript𝜃𝑗subscript𝑌1subscript𝑌𝑛I_{j}=I\Bigl{(}T(\theta_{j},Y_{1}^{*},\ldots,Y_{n}^{*})\geq T(\theta_{j},Y_{1}%,\ldots,Y_{n})\Bigr{)}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_I ( italic_T ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_T ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )

which is 1 ifT(θj,Y1,,Yn)T(θj,Y1,,Yn)𝑇subscript𝜃𝑗superscriptsubscript𝑌1superscriptsubscript𝑌𝑛𝑇subscript𝜃𝑗subscript𝑌1subscript𝑌𝑛T(\theta_{j},Y_{1}^{*},\ldots,Y_{n}^{*})\geq T(\theta_{j},Y_{1},\ldots,Y_{n})italic_T ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_T ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )and 0 otherwise.Note that

p(θj)=Pθj(Ij=1)=𝔼[Ij|θj].𝑝subscript𝜃𝑗subscript𝑃subscript𝜃𝑗subscript𝐼𝑗1𝔼delimited-[]conditionalsubscript𝐼𝑗subscript𝜃𝑗p(\theta_{j})=P_{\theta_{j}}(I_{j}=1)=\mbox{$\mathbb{E}$}[I_{j}|\theta_{j}].italic_p ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 ) = blackboard_E [ italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] .

So Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is an unbiased estimate ofp(θj)𝑝subscript𝜃𝑗p(\theta_{j})italic_p ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).But we can’t estimate p(θj)𝑝subscript𝜃𝑗p(\theta_{j})italic_p ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )using a single observation Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.Instead, weperform a nonparametric regression ofthe Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s which allows us to use nearby values.For example we can use the kernel regression estimator

p^(θ)=jIjKh(θjθ)jKh(θjθ)^𝑝𝜃subscript𝑗subscript𝐼𝑗subscript𝐾subscript𝜃𝑗𝜃subscript𝑗subscript𝐾subscript𝜃𝑗𝜃\widehat{p}(\theta)=\frac{\sum_{j}I_{j}K_{h}(\theta_{j}-\theta)}{\sum_{j}K_{h}%(\theta_{j}-\theta)}over^ start_ARG italic_p end_ARG ( italic_θ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_θ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_θ ) end_ARG

where Khsubscript𝐾K_{h}italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a kernel with bandwidth hhitalic_hsuch asKh(z)=ez2/(2h2)subscript𝐾𝑧superscript𝑒superscriptnorm𝑧22superscript2K_{h}(z)=e^{-||z||^{2}/(2h^{2})}italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_z ) = italic_e start_POSTSUPERSCRIPT - | | italic_z | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 2 italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT.Here are the steps in detail.

  1. 1.

    Choose any functionT(θ,Y1,,Yn)𝑇𝜃subscript𝑌1subscript𝑌𝑛T(\theta,Y_{1},\ldots,Y_{n})italic_T ( italic_θ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

  2. 2.

    Draw θ1,,θBsubscript𝜃1subscript𝜃𝐵\theta_{1},\ldots,\theta_{B}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTfrom any distribution h(θ)𝜃h(\theta)italic_h ( italic_θ ) with full support.This can be the likelihood but it can be anything.

  3. 3.

    For j=1,,B𝑗1𝐵j=1,\ldots,Bitalic_j = 1 , … , italic_B:draw Y1,,Ynpθjsimilar-tosuperscriptsubscript𝑌1superscriptsubscript𝑌𝑛subscript𝑝subscript𝜃𝑗Y_{1}^{*},\ldots,Y_{n}^{*}\sim p_{\theta_{j}}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT and computeTj=T(θj,Y1,,Yn)subscript𝑇𝑗𝑇subscript𝜃𝑗superscriptsubscript𝑌1superscriptsubscript𝑌𝑛T_{j}=T(\theta_{j},Y_{1}^{*},\ldots,Y_{n}^{*})italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_T ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

  4. 4.

    LetIj=I(TjT(θj,Y1,,Yn))subscript𝐼𝑗𝐼subscript𝑇𝑗𝑇subscript𝜃𝑗subscript𝑌1subscript𝑌𝑛I_{j}=I(T_{j}\geq T(\theta_{j},Y_{1},\ldots,Y_{n}))italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_I ( italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ italic_T ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )for j=1,,B𝑗1𝐵j=1,\ldots,Bitalic_j = 1 , … , italic_B.Here,Ij=1subscript𝐼𝑗1I_{j}=1italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 ifTjT(θj,Y1,,Yn)subscript𝑇𝑗𝑇subscript𝜃𝑗subscript𝑌1subscript𝑌𝑛T_{j}\geq T(\theta_{j},Y_{1},\ldots,Y_{n})italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ italic_T ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) andIj=0subscript𝐼𝑗0I_{j}=0italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 otherwise.

  5. 5.

    Estimatep(θj)=𝔼[Ij|θj]𝑝subscript𝜃𝑗𝔼delimited-[]conditionalsubscript𝐼𝑗subscript𝜃𝑗p(\theta_{j})=\mbox{$\mathbb{E}$}[I_{j}|\theta_{j}]italic_p ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = blackboard_E [ italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] by regressingI1,,IBsubscript𝐼1subscript𝐼𝐵I_{1},\ldots,I_{B}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT onθ1,,θBsubscript𝜃1subscript𝜃𝐵\theta_{1},\ldots,\theta_{B}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTto get p^(θ)=p^(μ,β)^𝑝𝜃^𝑝𝜇𝛽\widehat{p}(\theta)=\widehat{p}(\mu,\beta)over^ start_ARG italic_p end_ARG ( italic_θ ) = over^ start_ARG italic_p end_ARG ( italic_μ , italic_β ).

  6. 6.

    Return: C={μsupβp^(μ,β)α}𝐶𝜇subscriptsupremum𝛽^𝑝𝜇𝛽𝛼C=\{\mu\>\ \sup_{\beta}\widehat{p}(\mu,\beta)\geq\alpha\}italic_C = { italic_μ roman_sup start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_μ , italic_β ) ≥ italic_α }.

We have that Pμ(μCn)1αsubscript𝑃𝜇𝜇subscript𝐶𝑛1𝛼P_{\mu}(\mu\in C_{n})\to 1-\alphaitalic_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_μ ∈ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → 1 - italic_α as B𝐵B\to\inftyitalic_B → ∞.The test statisticT(θ,Y1,,Yn)𝑇𝜃subscript𝑌1subscript𝑌𝑛T(\theta,Y_{1},\ldots,Y_{n})italic_T ( italic_θ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )can be anything.A natural choice is thelikelihood function(θ)𝜃{\cal L}(\theta)caligraphic_L ( italic_θ ).In fact, even if the likelihood is intractableit too can be estimated using the simulation.See(dalmasso2021likelihood; masserano2023simulator; al2024amortized; cranmer2015; cranmer2020; thomas2022)for details.A related method is given in xie2022.

5 Optimality

Suppose we have more than one validmethod.How do we choose between them?We could try to optimize somenotion of efficiency.For example,we could try to minimizethe expected lengthof the confidence interval.However,this measure is not transformation invariantso one has to choose a scale.The marginalised likelihood could potentiallylead to shorter intervals if the prior happensto be concentrated near the true parameter valuebut, in general,this would not be the case.(hoff2022 discusses methods to includeprior information while maintaining frequentist coverageguarantees.)For testing, we usuallychoose the test with the highest power.But generally there is notest with highest power at allalternative values.One has to choose a specific alternative orone puts a prior on θ𝜃\thetaitalic_θ and maximizesthe average power.

Again, in the large sample,low dimension regimethe choice of procedure might not have muchpractical impact.In small sample size situations,careful simulation studiesare probably the best approachto assessing the quality of competingmethods.

Of course, there may be other relevant criteriasuch as simplicity and computational tractability.These need to be factored in as well.

6 Why Likelihood?

There are cases whereneither the profile likelihood or themarginalised likelihood should be used.Here is a sample example.Letc1,,cnsubscript𝑐1subscript𝑐𝑛c_{1},\ldots,c_{n}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTbe a set of fixed, unknown constantssuch that0ci10subscript𝑐𝑖10\leq c_{i}\leq 10 ≤ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 1.Letθ=n1ici𝜃superscript𝑛1subscript𝑖subscript𝑐𝑖\theta=n^{-1}\sum_{i}c_{i}italic_θ = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.LetS1,,Snsubscript𝑆1subscript𝑆𝑛S_{1},\ldots,S_{n}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTbe n𝑛nitalic_n independent flips of a coinwhereP(Si=1)=P(Si=0)=1/2𝑃subscript𝑆𝑖1𝑃subscript𝑆𝑖012P(S_{i}=1)=P(S_{i}=0)=1/2italic_P ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) = italic_P ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ) = 1 / 2.If Si=1subscript𝑆𝑖1S_{i}=1italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 you get to see cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.If Si=0subscript𝑆𝑖0S_{i}=0italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 then cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is unobserved.Let

θ^=2ni=1nciSi.^𝜃2𝑛superscriptsubscript𝑖1𝑛subscript𝑐𝑖subscript𝑆𝑖\widehat{\theta}=\frac{2}{n}\sum_{i=1}^{n}c_{i}S_{i}.over^ start_ARG italic_θ end_ARG = divide start_ARG 2 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Then𝔼[θ^]=θ𝔼delimited-[]^𝜃𝜃\mbox{$\mathbb{E}$}[\widehat{\theta}]=\thetablackboard_E [ over^ start_ARG italic_θ end_ARG ] = italic_θ anda simple 1α1𝛼1-\alpha1 - italic_α confidence interval for θ𝜃\thetaitalic_θ isθ^±zα/2/nplus-or-minus^𝜃subscript𝑧𝛼2𝑛\widehat{\theta}\pm z_{\alpha/2}/\sqrt{n}over^ start_ARG italic_θ end_ARG ± italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT / square-root start_ARG italic_n end_ARG.The likelihood isthe probability of n𝑛nitalic_n coin flips which is(1/2)nsuperscript12𝑛(1/2)^{n}( 1 / 2 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPTwhich does not even depend on θ𝜃\thetaitalic_θ.In other words, the likelihood function is flatand contains no information.What’s going here?The variablesS1,,Snsubscript𝑆1subscript𝑆𝑛S_{1},\ldots,S_{n}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are ancillarywhich means their distributiondoes not depend on the unknown parameter.Ancillary statistics do not appearin the likelihood.But the estimator above does use the ancillarieswhich shows that the ancillaries contain valuable information.So likelihood-based inferencecan failwhen there are useful ancillaries.This issue comes up in real problemssuch as analyzing randomized clinical trialswhere the random assignment of subjects to treatment orcontrol is ancillary.

Another exampleis Example 7 fromberger1999.Let X1,,XnN(θ,σ2)similar-tosubscript𝑋1subscript𝑋𝑛𝑁𝜃superscript𝜎2X_{1},\ldots,X_{n}\sim N(\theta,\sigma^{2})italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_N ( italic_θ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )and suppose thatthe parameter of interest isμ=θ/σ𝜇𝜃𝜎\mu=\theta/\sigmaitalic_μ = italic_θ / italic_σ.berger1999shows that both profile and marginalised likelihoodfunctions have poor behavior.But there is no need to use either.We can define a confidence setC={θ/σ:(θ,σ)C~}𝐶conditional-set𝜃𝜎𝜃𝜎~𝐶C=\{\theta/\sigma:\ (\theta,\sigma)\in\widetilde{C}\}italic_C = { italic_θ / italic_σ : ( italic_θ , italic_σ ) ∈ over~ start_ARG italic_C end_ARG }where C~~𝐶\widetilde{C}over~ start_ARG italic_C end_ARG is a joint confidence set for(θ,σ)𝜃𝜎(\theta,\sigma)( italic_θ , italic_σ ).This is a valid confidence setwhich uses neither psubscript𝑝{\cal L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT or msubscript𝑚{\cal L}_{m}caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

In some cases,we might want to use semiparametric methodsinstead of likelihood based methods.Suppose, for example,that the nuisance parameter is infinite dimensional.For example, consider testing for the presence of a signalin the presence of a background.The data take the form

Y1,,Yn(1λ)b(y)+λs(y)similar-tosubscript𝑌1subscript𝑌𝑛1𝜆𝑏𝑦𝜆𝑠𝑦Y_{1},\ldots,Y_{n}\sim(1-\lambda)b(y)+\lambda s(y)italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ ( 1 - italic_λ ) italic_b ( italic_y ) + italic_λ italic_s ( italic_y )

where0λ10𝜆10\leq\lambda\leq 10 ≤ italic_λ ≤ 1,b𝑏bitalic_b is a known background distribution ands𝑠sitalic_s is the signal.Suppose that s𝑠sitalic_s is supportedon a signal region ΩΩ\Omegaroman_Ωbut is otherwise not known.We want to estimate λ𝜆\lambdaitalic_λ or testH0:λ=0:subscript𝐻0𝜆0H_{0}:\lambda=0italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_λ = 0.The unknown parameters are(λ,s)𝜆𝑠(\lambda,s)( italic_λ , italic_s ).Here, s𝑠sitalic_s is any density such thatΩs(y)𝑑y=1subscriptΩ𝑠𝑦differential-d𝑦1\int_{\Omega}s(y)dy=1∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT italic_s ( italic_y ) italic_d italic_y = 1.So s𝑠sitalic_s is an infinite dimensionalnuisance parameter.In our previous notation,the parameter of interest β𝛽\betaitalic_β is λ𝜆\lambdaitalic_λand the nuisance parameter μ𝜇\muitalic_μ is s𝑠sitalic_s.

Models of this formmight be better handled usingsemiparametric methods(tsiatis2006; kosorok2008).It is hard for me to judge whether such methods are usefulin physics but they may indeed be caseswhere they are.The main idea can be summarized as follows.Suppose that μ𝜇\muitalic_μ is the parameter of interestand β𝛽\betaitalic_β is the (possibly infinite dimensional)nuisance parameter.Any well-behaved estimatorμ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG will havethe property thatn(μ^μ)𝑛^𝜇𝜇\sqrt{n}(\widehat{\mu}-\mu)square-root start_ARG italic_n end_ARG ( over^ start_ARG italic_μ end_ARG - italic_μ )will converge to a Normalwith mean 0 and some variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.Generally,σ2𝔼[φ2(Y)]superscript𝜎2𝔼delimited-[]superscript𝜑2𝑌\sigma^{2}\geq\mbox{$\mathbb{E}$}[\varphi^{2}(Y)]italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ blackboard_E [ italic_φ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Y ) ]for a function φ𝜑\varphiitalic_φwhich is known as the efficientinfluence function.We then try to find μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG so thatit* corresponding σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is equal to𝔼[φ2(Y)]𝔼delimited-[]superscript𝜑2𝑌\mbox{$\mathbb{E}$}[\varphi^{2}(Y)]blackboard_E [ italic_φ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Y ) ].Such an estimator is said to besemiparametric efficient.

In our signal detection example,the efficient influence function is

φ(y)=(1λ)IΩ(y)P(Ω)IΩc(y)Ωcb(y)𝑑y𝜑𝑦1𝜆subscript𝐼Ω𝑦𝑃Ωsubscript𝐼superscriptΩ𝑐𝑦subscriptsuperscriptΩ𝑐𝑏𝑦differential-d𝑦\varphi(y)=(1-\lambda)I_{\Omega}(y)-\frac{P(\Omega)I_{\Omega^{c}}(y)}{\int_{%\Omega^{c}}b(y)dy}italic_φ ( italic_y ) = ( 1 - italic_λ ) italic_I start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_y ) - divide start_ARG italic_P ( roman_Ω ) italic_I start_POSTSUBSCRIPT roman_Ω start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∫ start_POSTSUBSCRIPT roman_Ω start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_b ( italic_y ) italic_d italic_y end_ARG

where P(Ω)=P(YΩ)𝑃Ω𝑃𝑌ΩP(\Omega)=P(Y\in\Omega)italic_P ( roman_Ω ) = italic_P ( italic_Y ∈ roman_Ω )and the efficient estimator is simply

λ^=1n1iIΩc(Yi)Ωcb(y)𝑑y.^𝜆1superscript𝑛1subscript𝑖subscript𝐼superscriptΩ𝑐subscript𝑌𝑖subscriptsuperscriptΩ𝑐𝑏𝑦differential-d𝑦\widehat{\lambda}=1-\frac{n^{-1}\sum_{i}I_{\Omega^{c}}(Y_{i})}{\int_{\Omega^{c%}}b(y)dy}.over^ start_ARG italic_λ end_ARG = 1 - divide start_ARG italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT roman_Ω start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∫ start_POSTSUBSCRIPT roman_Ω start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_b ( italic_y ) italic_d italic_y end_ARG .

The point is that we have taken care of aninfinite dimensional nuisance parametervery easilyand we never mentionedlikelihood.

Another reason to considernon-likelihood methods is robustness.Both the profile likelihood andthe marginalised likelihood could leadto poor inferencesif the model is misspecified.What should we do in such cases?One possibility is to useminimum Hellinger estimation.In this case wechoose θ^^𝜃\widehat{\theta}over^ start_ARG italic_θ end_ARGto minimize

h2(θ)=(pθ(y)p^(y))2𝑑ysuperscript2𝜃superscriptsubscript𝑝𝜃𝑦^𝑝𝑦2differential-d𝑦h^{2}(\theta)=\int(\sqrt{p_{\theta}(y)}-\sqrt{\widehat{p}(y)})^{2}dyitalic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_θ ) = ∫ ( square-root start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) end_ARG - square-root start_ARG over^ start_ARG italic_p end_ARG ( italic_y ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_y

where p^(y)^𝑝𝑦\widehat{p}(y)over^ start_ARG italic_p end_ARG ( italic_y ) is some non-parametric density estimator.This estimator is asymptoticallyequivalent to the maximum likelihood estimatorif the model is correctbut it is less sensitiveto model misspecification(beran1977; lindsay1994).

Another recent non-likelihood methodis the HulC(hulc).We divide the data intoB=log(2/α)/log(2)𝐵2𝛼2B=\log(2/\alpha)/\log(2)italic_B = roman_log ( 2 / italic_α ) / roman_log ( 2 )disjoint groups.Let μ^1,,μ^Bsubscript^𝜇1subscript^𝜇𝐵\widehat{\mu}_{1},\ldots,\widehat{\mu}_{B}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTbe estimates from the B𝐵Bitalic_B groups and letC=[minjμ^j,maxjμ^j]𝐶subscript𝑗subscript^𝜇𝑗subscript𝑗subscript^𝜇𝑗C=[\min_{j}\widehat{\mu}_{j},\max_{j}\widehat{\mu}_{j}]italic_C = [ roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ].Define the median bias

Bn=(12min{P(μ^μ),P(μ^μ)}).subscript𝐵𝑛12𝑃^𝜇𝜇𝑃^𝜇𝜇B_{n}=\left(\frac{1}{2}-\min\Bigl{\{}P(\widehat{\mu}\geq\mu),\ P(\widehat{\mu}%\leq\mu)\Bigr{\}}\right).italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG - roman_min { italic_P ( over^ start_ARG italic_μ end_ARG ≥ italic_μ ) , italic_P ( over^ start_ARG italic_μ end_ARG ≤ italic_μ ) } ) .

So 0Bn1/20subscript𝐵𝑛120\leq B_{n}\leq 1/20 ≤ italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 1 / 2.In the extremesituation whereμ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG is larger than the true value μ𝜇\muitalic_μwith probability 1,we have Bn=1/2subscript𝐵𝑛12B_{n}=1/2italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 / 2.Otherwise Bn<1/2subscript𝐵𝑛12B_{n}<1/2italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < 1 / 2.If Bn0subscript𝐵𝑛0B_{n}\to 0italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → 0as is typically the case,we haveP(μCn)1α𝑃𝜇subscript𝐶𝑛1𝛼P(\mu\in C_{n})\to 1-\alphaitalic_P ( italic_μ ∈ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → 1 - italic_α.This is simple and does not requireprofiling or marginalising a likelihood function.

7 Conclusion

The confidence intervals and testsbased on the profile likelihoodshould have good coverage and reasonable sizeas long as the sample size is large,the number of nuisance parameters is not too largeand the usual regularity conditions hold.If any of these conditions fail to hold,it might be better to usealternative methods such asUniversal inference,simulation based inference,semiparametric inference or the HulC.The integrated likelihood may offer benefitsin some special casesbut I don’t believeit is useful in any generality.

I conclude with remarks aboutpower.Methods like Universal inference,SBI and the HulC give satisfying coverage guaranteesbut these come with some lossof efficiency.The confidence sets from Universalinference typically shrink at the optimalrate but will nonetheless be larger than optimalconfidence sets (if such optimal sets exist).As a benchmark,consider the confidence setforY1,,YnN(μ,I)similar-tosubscript𝑌1subscript𝑌𝑛𝑁𝜇𝐼Y_{1},\ldots,Y_{n}\sim N(\mu,I)italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_N ( italic_μ , italic_I )where Yidsubscript𝑌𝑖superscript𝑑Y_{i}\in\mathbb{R}^{d}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.Of course we don’t need Universal inference herebut this case allows for precise results.dunn2023 show thatthe confidence set has radius of order d/n𝑑𝑛\sqrt{d/n}square-root start_ARG italic_d / italic_n end_ARGwhich is optimal but, compared to the best confidence setthe squared radius is about twice as large.More precisely, the ratio squared radii is bounded by(4log(1/α)+4d)/(2log(1/α)+d5/2)41𝛼4𝑑21𝛼𝑑52(4\log(1/\alpha)+4d)/(2\log(1/\alpha)+d-5/2)( 4 roman_log ( 1 / italic_α ) + 4 italic_d ) / ( 2 roman_log ( 1 / italic_α ) + italic_d - 5 / 2 ).Similarly, comparing theHulC to the Wald interval(when the latter is valid)we find that the length is expanded by a factor oflog2(log2(2/α))subscript2subscript22𝛼\sqrt{\log_{2}(\log_{2}(2/\alpha))}square-root start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 2 / italic_α ) ) end_ARG which is quite modest.For SBI it is hard to saymuch since that method works with anystatistic T𝑇Titalic_T and the size of the set will dependon the particular choice of T𝑇Titalic_T.However, it is fair to say thatusing generally applicable methodsthat provide correct coverage under weak conditions,there will be some price to pay.Evaluating these methods in some real physics problemswould be very valuable.This would be a great project forstatisticians and physicists to collaborate on.

8 Acknowledgments

Thanks to Sara Algeri, Louis Lyons andOlaf Behnke for organizing this discussion.Also, I would like to thankLouis Lyons and Bob Cousins for comments on an earlierdraft that led to improvements in the paper.

Marginalizing versus Profiling of Nuisance Parameters Notes by Robert D. Cousins Department of Physics and Astronomy University of California, Los Angeles 90095 Los Angeles, CA and Larry Wasserman Department of Statistics and Data Science Carnegie Mellon  (2024)
Top Articles
Latest Posts
Article information

Author: Pres. Lawanda Wiegand

Last Updated:

Views: 5865

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Pres. Lawanda Wiegand

Birthday: 1993-01-10

Address: Suite 391 6963 Ullrich Shore, Bellefort, WI 01350-7893

Phone: +6806610432415

Job: Dynamic Manufacturing Assistant

Hobby: amateur radio, Taekwondo, Wood carving, Parkour, Skateboarding, Running, Rafting

Introduction: My name is Pres. Lawanda Wiegand, I am a inquisitive, helpful, glamorous, cheerful, open, clever, innocent person who loves writing and wants to share my knowledge and understanding with you.