Marginalizing versus Profiling of Nuisance Parameters Notes by Robert D. Cousins Department of Physics and Astronomy University of California, Los Angeles 90095 Los Angeles, CA and Larry Wasserman Department of Statistics and Data Science Carnegie Mellon (2024)

Table of Contents
Foreword 1 Introduction 2 Choice of test statistic 3 Sampling distributions of the test statistic(s) under various hypotheses 4 Example: ratio of Poisson means 4.1 ๐’๐๐ขsubscript๐’๐๐ขZ_{\rm Bi}bold_italic_Z start_POSTSUBSCRIPT bold_Bi end_POSTSUBSCRIPT: conditioning on ๐’totsubscript๐’totn_{\textnormal{\scriptsize tot}}bold_italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT and using Clopper-Pearson intervals 4.2 ๐’๐ฆ๐ข๐โข-โข๐subscript๐’๐ฆ๐ข๐-๐Z_{\mathrm{mid\text{-}P}}bold_italic_Z start_POSTSUBSCRIPT bold_mid - bold_P end_POSTSUBSCRIPT: conditioning on ๐’totsubscript๐’totn_{\textnormal{\scriptsize tot}}bold_italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT and using Lancaster mid-P intervals 4.3 ๐’๐๐‹subscript๐’๐๐‹Z_{\mathrm{PL}}bold_italic_Z start_POSTSUBSCRIPT bold_PL end_POSTSUBSCRIPT: profile likelihood and asymptotic theory 4.4 ๐’๐๐…subscript๐’๐๐…Z_{\mathrm{BF}}bold_italic_Z start_POSTSUBSCRIPT bold_BF end_POSTSUBSCRIPT: marginalization of nuisance parameter ๐๐›subscript๐๐›\mu_{\rm b}bold_italic_ฮผ start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT(Bayesian-frequentist hybrid) 4.5 ๐’๐๐subscript๐’๐๐Z_{\mathrm{PB}}bold_italic_Z start_POSTSUBSCRIPT bold_PB end_POSTSUBSCRIPT: parametric bootstrap for nuisance parameter ๐๐›subscript๐๐›\mu_{\rm b}bold_italic_ฮผ start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT 4.6 Numerical examples 4.7 Coverage studies 4.8 If ๐‰๐‰\taubold_italic_ฯ„ is not known exactly 5 Discussion at Banff BIRS 2023 6 Final remarks Acknowledgments 1 Introduction 2 Profile or Marginalise? 3 Universal Inference 4 Simulation Based Inference 5 Optimality 6 Why Likelihood? 7 Conclusion 8 Acknowledgments

\sidecaptionvpos

figurec

Abstract

This is a writeup, with some elaboration, of the talks by the two authors (a physicist and a statistician) at the first PHYSTAT Informal review on January 24, 2024. We discuss Bayesian and frequentist approaches to dealing with nuisance parameters, in particular, integrated versus profiled likelihood methods. In regular models, with finitely many parameters and large sample sizes, the two approaches are asymptotically equivalent. But, outside this setting, the two methods can lead to different tests and confidence intervals. Assessing which approach is better generally requires comparing the power of the tests or the length of the confidence intervals. This analysis has to be conducted on a case-by-case basis. In the extreme case where the number of nuisance parameters is very large, possibly infinite, neither approach may be useful. Part I provides an informal history of usage in high energy particle physics, including a simple illustrative example. Part II includes an overview of some more recently developed methods in the statistics literature, including methods applicable when the use of the likelihood function is problematic.

Foreword

This document collects reviews written by Bob Cousins, Distinguished Professor Emeritus in the Department of Physics and Astronomy at UCLA, and Larry Wasserman, Professor of Statistics and Data Science at Carnegie Mellon University, on inferential methods for handling nuisance parameters.

These notes expand upon the main points of discussion that arose during the first of a new event series called PHYSTATโ€“Informal Reviews, which aims to promote dialogue between physicists and statisticians on specific topics. The inaugural event held virtually on January 24, 2024, focused on the topic of "Hybrid Frequentist and Bayesian approaches". Bob Cousins and Larry Wasserman presented excellent talks, and the meeting was further enriched by a valuable discussion involving the statisticians and physicists in the audience.

In Part I, Bob Cousins provides a comprehensive overview of how high-energy physicists have historically tackled the problem of inferring a parameter of interest in the presence of nuisance parameters. The note offers insights into why methods like profile likelihood are somewhat dominant in practice while others have not been widely adopted. Various statistical solutions are also compared through a classical example with relevance in both physics and astronomy. This memorandum is a wealth of crucial references concerning statistical practice in the physical sciences.

In Part II, Larry Wasserman examines the problem of conducting statistical inference with nuisance parameters from the perspectives of both classical and modern statistics. He notes that standard inferential solutions require a likelihood function, regularity conditions, and large samples. Nonetheless, recent statistical developments may be used to ensure correct coverage while relaxing these requirements. This review offers an accessible self-contained overview of some of these approaches and guides the reader through their strengths and limitations.

Together, the written accounts below provide a cohesive review of the tools currently adopted in the practice of statistical inference in searches for new physics, as well as their historical and technical justification. They also offer a point of reflection on the role that more sophisticated statistical methods and recent discoveries could play in assisting the physics community in tackling both classical and new statistical challenges.

The organizing committee of PHYSTAT activities is grateful to both authors for their engagement and valuable contributions, which we believe will benefit both the statistics and physics communities.

Sara Algeri,

School of Statistics, University of Minnesota, Minneapolis, MN

On behalf of the PHYSTAT Organizing Committee

Part I: Physicistโ€™s View
by Robert D. Cousins

1 Introduction

This is a writeup (with some changes) of the 20-minute talk that I gave at a virtual PhyStat meeting that featured โ€œmini-reviewsโ€ of the topic by a physicist (cousinsphystat2023) and a statistician(wassermanphystat2023) Mine was largely based on my talk the previous year at a BIRS workshop at Banff(cousinsbanff2023). This is a short personal perspective through the lens of sometimes-hazy recollections of developments during the last 40 years in high energy physics, and is not meant to be a definitive history. I also refer at times to my lecture notes from the 2018 Hadron Collider Physics Summer School at Fermilab(cousinshcpss2018), which provide a more detailed introduction to some of the fundamentals, such as the duality of frequentist hypothesis testing and confidence intervals (โ€œinverting a test" to obtain intervals).

Parametric statistical inference requires a statistical model, pโข(y;ฮธ)๐‘๐‘ฆ๐œƒp(y;\theta)italic_p ( italic_y ; italic_ฮธ ), which gives the probability (density) for observing data y๐‘ฆyitalic_y, given parameter(s) ฮธ๐œƒ\thetaitalic_ฮธ. In a given context, ฮธ๐œƒ\thetaitalic_ฮธ is partitioned as (ฮผ,ฮฒ)๐œ‡๐›ฝ(\mu,\beta)( italic_ฮผ , italic_ฮฒ ), where ฮผ๐œ‡\muitalic_ฮผ is the parameter of interest (taken to be a scalar in this note, e.g., the magnitude of a putative signal), and ฮฒ๐›ฝ\betaitalic_ฮฒ is typically a vector of โ€œnuisance parametersโ€ with values constrained by subsidiary measurements, usually referred to as โ€œsystematic uncertaintiesโ€ in particle physics (e.g., detector calibration โ€œconstantsโ€, magnitudes of background processes, etc.).

This note discusses the common case in which one desires an algorithm for obtaining โ€œfrequentist confidence intervals (CI)โ€ for ฮผ๐œ‡\muitalic_ฮผ that are valid at a specified confidence level (CL), no matter what are the unknown values of ฮฒ๐›ฝ\betaitalic_ฮฒ. Ideally, CIs have exact โ€œfrequentist coverageโ€ of the unknown true value of ฮผ๐œ‡\muitalic_ฮผ, namely that in imagined repeated applications of the algorithm for different data y๐‘ฆyitalic_y sampled from the model, in the long run a fraction CL of the confidence intervals will contain (cover) the unknown true ฮผ๐œ‡\muitalic_ฮผ. In practice, this is often only approximately true, and some over-coverage (fraction higher than CL) is typically tolerated (with attendant loss of power), while material under-coverage is considered disqualifying.

The construction of confidence intervals requires that for every possible true ฮผ๐œ‡\muitalic_ฮผ, one chooses an ordering of the sample space y๐‘ฆyitalic_y. This is done by specifying an ordered โ€œtest statisticโ€, a function of the observed data and parameters. In the absence of nuisance parameters, standard choices have existed since the 1930s. Thus, a key question to be discussed is how to treat (โ€œeliminate") the nuisance parameters in the test statistic, so that the problem is reduced to the case without nuisance parameters.

A second question to be discussed is how to obtain the sampling distribution of the test statistic under the statistical model(s) being considered. There is a vast literature on approximate analytic methods that become more accurate as the sample size increases. For small sample sizes in particle physics, Monte Carlo simulation is typically used to obtain these distributions. The crucial question is then, which values(s) of the nuisance parameters should be used in the simulations, so that the inferred confidence intervals for the parameter of interest have valid coverage, no matter what are the true values of the nuisance parameters.

Section2 defines the profile and marginalized likelihoods.Section3 introduces the two corresponding ways to treat nuisance parameters in simulation, the parametric bootstrap and marginalization (sampling from the posterior distribution of the nuisance parameters).Section4 is a longer section going into detail regarding five ways to calculate the significance Z๐‘Zitalic_Z in the on/off problem (ratio of Poisson means).Section5 briefly highlights the talk by Anthony Davison on this subject at the recent 2023 BIRS workshop at Banff.Section6 contains some final remarks.

2 Choice of test statistic

The observed data y๐‘ฆyitalic_y are plugged into the statistical model pโข(y;ฮธ)๐‘๐‘ฆ๐œƒp(y;\theta)italic_p ( italic_y ; italic_ฮธ ) to obtain the likelihood function โ„’โข(ฮผ,ฮฒ)โ„’๐œ‡๐›ฝ{\cal L}(\mu,\beta)caligraphic_L ( italic_ฮผ , italic_ฮฒ ). The global maximum is atโ„’โข(ฮผ^,ฮฒ^)โ„’^๐œ‡^๐›ฝ{\cal L}(\widehat{\mu},\widehat{\beta})caligraphic_L ( over^ start_ARG italic_ฮผ end_ARG , over^ start_ARG italic_ฮฒ end_ARG ). The โ€œprofile likelihood function" (of only ฮผ๐œ‡\muitalic_ฮผ) is

โ„’โข(ฮผ,ฮฒ^^),โ„’๐œ‡^^๐›ฝ{\cal L}(\mu,\widehat{\widehat{\beta}}),caligraphic_L ( italic_ฮผ , over^ start_ARG over^ start_ARG italic_ฮฒ end_ARG end_ARG ) ,(1)

where, for a given ฮผ๐œ‡\muitalic_ฮผ, the restricted estimate ฮฒ^^^^๐›ฝ\widehat{\widehat{\beta}}over^ start_ARG over^ start_ARG italic_ฮฒ end_ARG end_ARG maximizes โ„’โ„’{\cal L}caligraphic_L for that ฮผ๐œ‡\muitalic_ฮผ. It is also written as

supฮฒโ„’โข(ฮผ,ฮฒ).subscriptsupremum๐›ฝโ„’๐œ‡๐›ฝ\sup_{\beta}{\cal L}(\mu,\beta).roman_sup start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT caligraphic_L ( italic_ฮผ , italic_ฮฒ ) .(2)

(The restricted estimate is denoted by ฮฒ^ฮผsubscript^๐›ฝ๐œ‡\widehat{\beta}_{\mu}over^ start_ARG italic_ฮฒ end_ARG start_POSTSUBSCRIPT italic_ฮผ end_POSTSUBSCRIPT inreidslac2003.)It is common in HEP to use as a test statistic the โ€œprofile likelihood ratioโ€ (PLR) with the likelihood at the global maximum in the denominator,

โ„’โข(ฮผ,ฮฒ^^)โ„’โข(ฮผ^,ฮฒ^).โ„’๐œ‡^^๐›ฝโ„’^๐œ‡^๐›ฝ\frac{{\cal L}(\mu,\widehat{\widehat{\beta}})}{{\cal L}(\widehat{\mu},\widehat%{\beta})}.divide start_ARG caligraphic_L ( italic_ฮผ , over^ start_ARG over^ start_ARG italic_ฮฒ end_ARG end_ARG ) end_ARG start_ARG caligraphic_L ( over^ start_ARG italic_ฮผ end_ARG , over^ start_ARG italic_ฮฒ end_ARG ) end_ARG .(3)

The โ€œintegratedโ€ (or โ€œmarginalizedโ€) likelihood function (also of only ฮผ๐œ‡\muitalic_ฮผ) is

โˆซโ„’โข(ฮผ,ฮฒ)โขฯ€โข(ฮฒ)โข๐‘‘ฮฒ,โ„’๐œ‡๐›ฝ๐œ‹๐›ฝdifferential-d๐›ฝ\int{\cal L}(\mu,\beta)\pi(\beta)d\beta,โˆซ caligraphic_L ( italic_ฮผ , italic_ฮฒ ) italic_ฯ€ ( italic_ฮฒ ) italic_d italic_ฮฒ ,(4)

where ฯ€โข(ฮฒ)๐œ‹๐›ฝ\pi(\beta)italic_ฯ€ ( italic_ฮฒ ) is a weight function in the spirit of a Bayesian prior pdf. (More generally, it could be ฯ€โข(ฮฒ|ฮผ)๐œ‹conditional๐›ฝ๐œ‡\pi(\beta|\mu)italic_ฯ€ ( italic_ฮฒ | italic_ฮผ ).)

One can then use either of these as if it were a 1D likelihood function โ„’โข(ฮผ)โ„’๐œ‡{\cal L}(\mu)caligraphic_L ( italic_ฮผ ) with the nuisance parameters โ€œeliminatedโ€. The question remains, what are the frequentist properties of intervals constructed for ฮผ๐œ‡\muitalic_ฮผ? The long tradition in particle physics is to use the profile likelihood function and the asymptotic theorem of wilks to obtain approximate 1D confidence intervals (as well as multi-dimensional confidence regions in cases beyond the scope of this note). For a brief review, see Section 40.3.2.1 and Table 40.2 of thepdg2022. For decades, a commonly used software tool has been the classic program (originally in FORTRAN and later in C and C++), โ€œMinuit: A System for Function Minimization and Analysis of the Parameter Errors and Correlationsโ€(minuit).MINUIT refers to the uncertainties from profile likelihood ratios as โ€œMINOS errorsโ€, which are discussed by minoscpc. I believe that the name โ€œprofile likelihoodโ€ entered the HEP mainstream in 2000, thanks to a talk by statistician Wolfgang Rolke on work with Angel Lรณpez at the second meeting of what became the PhyStat series(rolkeclk; rolkelopez2001; Rolke2005). Interestingly,minoscpc and rolkeclk conceptualize the same math in two different ways, as discussed in my lectures(cousinshcpss2018); this may have delayed recognizing the connection (which was not immediate, even after Wolfgangโ€™s talk).

At some point, a few people started integrating out nuisance parameter(s), typically when there was a Gaussian/normal contribution to the likelihood, while treating the parameter of interest in a frequentist manner. One of the earlier examples (citing yet earlier examples) was by myself and Virgil Highland, โ€œIncorporating systematic uncertainties into an upper limitโ€(cousinshighland).We looked at the case of a physics quantity (cross section) ฯƒ=ฮผb/L๐œŽsubscript๐œ‡b๐ฟ\sigma=\mu_{\rm b}/Litalic_ฯƒ = italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT / italic_L,where ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT is an unknown Poisson mean and L๐ฟLitalic_L is a factor called the integrated luminosity. One observes a sampled value n๐‘›nitalic_n from the Poisson distribution.From n๐‘›nitalic_n, one can obtain a usual frequentist upper confidence limit on ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT.If L๐ฟLitalic_L is known exactly, one then simply scales this by 1/L1๐ฟ1/L1 / italic_L to get an upper confidence limit on ฯƒ๐œŽ\sigmaitalic_ฯƒ. We considered the case in which, instead, one has an independent unbiased measurement of L๐ฟLitalic_L with some uncertainty, yielding a likelihood function of โ„’โข(L)โ„’๐ฟ{\cal L}(L)caligraphic_L ( italic_L ). Multiplying โ„’โข(L)โ„’๐ฟ{\cal L}(L)caligraphic_L ( italic_L ) by a prior pdf for L๐ฟLitalic_L yields a Bayesian posterior pdf for L๐ฟLitalic_L. We advocated using this as a weight function for what we now call integrating/marginalizing over the nuisance parameter L๐ฟLitalic_L. A power series expansion gave useful approximations.

Our motivation was to ameliorate an effect whereby confidence intervals derived from a discrete observable become shorter when a continuous nuisance parameter is added to the model. In the initially submitted draft, we did not understand that we were effectively grafting a Bayesian pdf onto a frequentist Poisson upper limit. Fred James sorted us out before publication. A more enlightened discussion is in my paper with Jim Linnemann and Jordan Tucker(CLT), referred to as CLT, discussed below. Luc Demortier noted that, in these simple cases, what we did was the same math as Bayesian statistician George Boxโ€™s prior predictive p๐‘pitalic_p-value(box1980; CLT). Box calculated a tail probability after obtaining a Bayesian pdf, and we averaged a frequentist tail probability over a Bayesian pdf. The two calculations simply reverse the order of two integrals.

A key paper was submitted in the first year of LHC data taking, byGlen Cowan, Kyle Cranmer, Eilam Gross, and Ofer Vitells (CCGV), โ€œAsymptotic formulae for likelihood-based tests of new physicsโ€(ccgv2011).They provided useful asymptotic formulas for the PLR and their preferred variants, which have been widely used at the LHC and beyond. I believe that this had a significant side effect: since asymptotic distributions were not known in HEP for other test statistics, the default became PLR variants in CCGV, even for small sample size, in order to have consistency between small and large sample sizes.

Meanwhile, in the statistics literature, there is a long history of criticism of the simple PLR. At PhyStat meetings, various higher-order approximations have been discussed by Nancy Reid at SLAC inreidslac2003; by Nancy(reidoxford2005; reidoxford2) and me(cousinsoxford2005) at Oxford in 2005; by Anthony Davidson in the Banff Challenge of 2006(davisonbanffchallenge), at PhyStat-nu 2019(davison2019), and at Banff last yeardavisonbanff2023; andby statistician Alessandra Brazzale(brazzalephystat2019) and physicist Igor Volobouev and statistician Alex Trindale at PhyStat-DM involobouevphystat2019. Virtually none of these developments have been adopted for widespread use in HEP, as far as I know.

Bayesian statisticians at PhyStat meetings have long advocated marginalizing nuisance parameters, even if we stick to frequentist treatment of the parameter of interest, e.g., James Berger, Brunero Liseo, and Robert Wolpert, โ€œIntegrated Likelihood Methods for Eliminating Nuisance Parametersโ€, (berger1999).Berger et al., in their rejoinder to a comment by Edward Susko, wrote,โ€œDr. Susko finishes by suggesting that it might be useful to compute both profile and integrated likelihood answers in application, as a type of sensitivity study. It is probably the case that, if the two answers agree, then one can feel relatively assured in the validity of the answer. It is less clear what to think in the case of disagreement, however. If the answers are relatively precise and quite different, we would simply suspect that it is a situation with a โ€˜badโ€™ profile likelihood.โ€ I agree with Suskoโ€™s suggestion to compute (and report) both profile and integrated likelihood answers, to build up experience and to see if Berger et al. are right in real cases relevant to HEP.

However, in my experience, marginalizing nuisance parameters is relatively rare at the LHC. This is for various reasons, in my opinion, including:

It also seems common that the minimization required for profiling is less CPU-intensive than the integration required for marginalization.

Thus, the profile likelihood ratio (including variants with parameters near boundaries as in CCGV(ccgv2011)) is the most common test statistic in use today at the LHC, and perhaps in all of HEP. Among other uses, it is the basis of the ordering for confidence intervals advocated by Gary Feldman and myself(feldman1998), which turned out to be the no-nuisance-parameter special case of inversion of the โ€œexactโ€ PLR test in the treatise of โ€œKendall and Stuartโ€ and later versions(kendall1999). I recall a long discussion between Gary Feldman and Kyle Cranmer at a pub at the Oxford Phystat in 2005 regarding how to interpret Eqs. 22.4-22.7 of kendall1999. For part of that historical context, see Kyleโ€™s PhyStat contributions in cranmerslac2003 (Sec.9) and cranmeroxford2005 (Sec.4.3), as well as the contribution by Giovanni Punzi in punzioxford2005.

Various studies have been done regarding the elimination of nuisance parametersin the test statistic (typically a likelihood ratio), many concludingthat results are relatively insensitive to profiling vsmarginalization, so that the choice can be made based on CPU time. See,for example, John Conwayโ€™s talk and writeup atPhyStat in conwayphystat2011.Jan Conrad and Fredrik Tegenfeldt studied cases with favorable coverage(conradtegenfeldt2005; conradphystat2005). Notably, as Kyle Cranmer showed at Oxford in cranmeroxford2005, for Gaussian uncertainty on the mean ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, marginalizing the nuisance parameter can be badly anti-conservative (have severe undercoverage) when calculating signal significance (a different situation than the upper limits considered by myself and Highland (cousinshighland)). This was one of the motivations of the studies by CLT.

3 Sampling distributions of the test statistic(s) under various
hypotheses

Rather than pursuing higher-order corrections to the PLR, the general practice in HEP has been to stick with the PLR test statistic, and touse Monte Carlo simulation (known as โ€œtoy MCโ€) to obtain the finite-sample-size distribution(s) of the PLR under the null hypothesis, and under alternative(s) as desired. These results are often compared to the relevant asymptotic formulas, typically from CCGV.

So now one has the question: How should nuisance parameters be treatedin the simulations? We need to keep in mind that we want correct frequentist coverageof the parameter of interest when nature is sampling the data using the unknowntrue values of the nuisance parameters. An important constraint (atleast until now) is that it is generally thought to be impractical to performMC simulation for each of many different sets of values of the nuisanceparameters (especially in high dimensions). So how should one judiciously choose those values to be used in simulation?

The usual procedure (partially based on a desire to be โ€œfully frequentistโ€), is to โ€œthrow toysโ€ using the profiled values of the nuisance parameters, i.e., their ML estimates conditional on whatever value of the parameter(s) of interest are being tested.At some point, this procedure was identified as the parametric bootstrap. (The first time that I recall was in an email from Luc Demortier in 2011, pointing to his contribution to the PhyStat-LHC 2007 Proceedings (luc2007phystat) and to the book by Bradley Efron and Robert Tibshirani (efrontibshirani).) The hope is that the profiled values of the nuisance parameters are a good enough proxy for the unknown true values, even for small sample size. I have since learned that there is a large literature on the parametric bootstrap and its convergence properties. See, for example, Lucโ€™s talk (luc2012slac), which describes a number of variants, with references, as well as Anthony Davisonโ€™s talk at Banffdavisonbanff2023, emphasized in Section5.

In an alternative procedure, sometimes called the Bayesian-Frequentist hybrid, in each pseudo-experiment in the toy-throwing, a new set of values of the unknown nuisance parameters is sampled from their posterior distributions, and used for sampling the data. The parameter of interest, however, continues to be treated in a frequentist fashion (confidence intervals, p๐‘pitalic_p-values, etc.). More details are in Section4.4 below.

I would like to see more comparisons of these results with alternatives, particularly those from marginalizing the nuisance parameters with some judiciously chosen priors. I find it a bit comforting to take an average over a set of nuisance parameters in the neighborhood of the profiled values. Coverage studies can give guidance on the weight function ฯ€โข(ฮฒ)๐œ‹๐›ฝ\pi(\beta)italic_ฯ€ ( italic_ฮฒ ) used in the marginalization, presumably considering default priors thought to lead to good coverage.

4 Example: ratio of Poisson means

I โ€œcherry-pickedโ€ an example that appears often in the statistical literature;is an important prototype in HEP and astronomy; and is the topic of a paper for which I happen to be a co-author, including an interesting result for marginalization. This is the ratio of Poisson means, which maps onto the โ€œon/offโ€ problem in astronomy and the โ€œsignal region plus sidebandโ€ problem in HEP. I focus here mostly on the concepts, as the algebra is in the CLT paper(CLT), much of which was inspired by Jim Linnemannโ€™s talk at the SLAC PhyStat in linnemannslac2003.

As in Fig.1, we have a histogram with two bins; in astronomy, these are counts with the telescope โ€œonโ€ and โ€œoffโ€ the putative source, with observed contents nonsubscript๐‘›onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT and noffsubscript๐‘›offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT, respectively. In HEP, there is a bin in which one looks for a signal, and a โ€œsidebandโ€ bin that is presumed to have background but no signal. So the โ€œonโ€ bin has (potentially) signal and background events with unknown Poisson means ฮผssubscript๐œ‡s\mu_{\rm s}italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT and ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, respectively, with total Poisson mean ฮผs+ฮผbsubscript๐œ‡ssubscript๐œ‡b\mu_{\rm s}+\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT + italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT. The โ€œoffโ€ bin has only background with unknown Poisson mean ฮผoffsubscript๐œ‡off\mu_{\rm off}italic_ฮผ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT. The ratio of the total Poisson means in the two bins is denoted by ฮป๐œ†\lambdaitalic_ฮป,

ฮป=ฮผoffฮผs+ฮผb.๐œ†subscript๐œ‡offsubscript๐œ‡ssubscript๐œ‡b\lambda=\frac{\mu_{\rm off}}{\mu_{\rm s}+\mu_{\rm b}}.italic_ฮป = divide start_ARG italic_ฮผ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_ARG start_ARG italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT + italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_ARG .(5)
Marginalizing versus Profiling of Nuisance Parameters Notes by Robert D. Cousins Department of Physics and Astronomy University of California, Los Angeles 90095 Los Angeles, CA and Larry Wasserman Department of Statistics and Data Science Carnegie Mellon University Pittsburgh, PA 15213 (1)

In this simple version discussed here, we assume that the ratio of the Poisson means of background in the two bins, ฯ„=ฮผoff/ฮผb๐œsubscript๐œ‡offsubscript๐œ‡b\tau=\mu_{\rm off}/\mu_{\rm b}italic_ฯ„ = italic_ฮผ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT / italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, is precisely known. In astronomy, this could be the ratio of observing times; in HEP, one might rely on simulation or control regions in the data. (If ฯ„๐œ\tauitalic_ฯ„ is determined with non-negligible uncertainty from two more control regions, as in the so-called ABCD method in HEP, more issues arise.)In a search for a signal, we test the null hypothesis H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: ฮผs=0subscript๐œ‡s0\mu_{\rm s}=0italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0,which we can rephrase as H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: ฮป=ฯ„๐œ†๐œ\lambda=\tauitalic_ฮป = italic_ฯ„. One can choose the nuisance parameter to be either ฮผoffsubscript๐œ‡off\mu_{\rm off}italic_ฮผ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT or ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, and we choose the latter.

4.1 ๐’๐๐ขsubscript๐’๐๐ขZ_{\rm Bi}bold_italic_Z start_POSTSUBSCRIPT bold_Bi end_POSTSUBSCRIPT: conditioning on ๐’totsubscript๐’totn_{\textnormal{\scriptsize tot}}bold_italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT and using Clopper-Pearson intervals

The standard frequentist solution, invoking โ€œconditioningโ€ (jamesroos; reid1995; CLT), is based on the fact that the total number of counts, ntot=non+noffsubscript๐‘›totsubscript๐‘›onsubscript๐‘›offn_{\textnormal{\scriptsize tot}}=n_{\textnormal{\scriptsize on}}+n_{\rm off}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT, provides information about the uncertainty on the ratio of Poisson means, but no information about the ratio itself. Thus, ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT is known as an ancillary statistic, and most statisticians agree that one should treat ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT as fixed, i.e., calculate (binomial) probabilities of nonsubscript๐‘›onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT and noffsubscript๐‘›offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT conditional on the observed ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT. We can again rephrase the null hypothesis in terms of the binomial parameter ฯ๐œŒ\rhoitalic_ฯ, namely H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: ฯ=ฮผb/(ฮผb+ฯ„โขฮผb)=1/(1+ฯ„)๐œŒsubscript๐œ‡bsubscript๐œ‡b๐œsubscript๐œ‡b11๐œ\rho=\mu_{\rm b}/(\mu_{\rm b}+\tau\mu_{\rm b})=1/(1+\tau)italic_ฯ = italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT / ( italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT + italic_ฯ„ italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ) = 1 / ( 1 + italic_ฯ„ ). So we do a binomial test of ฯ=1/(1+ฯ„)๐œŒ11๐œ\rho=1/(1+\tau)italic_ฯ = 1 / ( 1 + italic_ฯ„ ), given data nonsubscript๐‘›onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT, ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT. The p๐‘pitalic_p-value pBisubscript๐‘Bip_{\rm Bi}italic_p start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT is obtained by inversion of binomial confidence intervals.

There are many available choices for sets of binomial confidence intervals. For pBisubscript๐‘Bip_{\rm Bi}italic_p start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT, CLT adheres to the standard in HEP and follows the โ€œexactโ€ construction of clopper1934, with the guarantee of no undercoverage. Note, however, that โ€œexactโ€, as commonly used in situations such as this (going back to Fisher), means that a small-sample construction (as opposed to asymptotic theory) was used, but beware that in a discrete situation such as ours, there will be overcoverage, sometimes severe.

The p๐‘pitalic_p-value pBisubscript๐‘Bip_{\rm Bi}italic_p start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT is converted to a Z๐‘Zitalic_Z-value (sometimes called S๐‘†Sitalic_S in HEP) with a 1-tailed Gaussian convention to obtain ZBisubscript๐‘BiZ_{\rm Bi}italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT. (As in CLT, Z=2โขerfโˆ’1โข(1โˆ’2โขp)๐‘2superscripterf112๐‘Z=\sqrt{2}\,\text{erf}^{-1}(1-2p)italic_Z = square-root start_ARG 2 end_ARG erf start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - 2 italic_p ).

4.2 ๐’๐ฆ๐ข๐โข-โข๐subscript๐’๐ฆ๐ข๐-๐Z_{\mathrm{mid\text{-}P}}bold_italic_Z start_POSTSUBSCRIPT bold_mid - bold_P end_POSTSUBSCRIPT: conditioning on ๐’totsubscript๐’totn_{\textnormal{\scriptsize tot}}bold_italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT and using Lancaster mid-P intervals

After CLT, a paper by myself, Hymes, and Tucker (cht2010) advocated using Lancasterโ€™s mid-P binomial confidence intervals(lancaster61) to obtain confidence intervals for the ratio of Poisson means. For any fixed ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT, the mid-P intervals have over-coverage for some values of the binomial parameter and undercoverage for other values. In the unconditional ensemble that includes ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT sampled from the Poisson distribution with the true total mean, the randomness in ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT leads to excellent coverage of the true ratio of Poisson means.

4.3 ๐’๐๐‹subscript๐’๐๐‹Z_{\mathrm{PL}}bold_italic_Z start_POSTSUBSCRIPT bold_PL end_POSTSUBSCRIPT: profile likelihood and asymptotic theory

The asymptotic result from Wilksโ€™s Theorem (details in Section 5 of CLT and in pdg2022, Section 5) yields the Z๐‘Zitalic_Z-value ZPLsubscript๐‘PLZ_{\mathrm{PL}}italic_Z start_POSTSUBSCRIPT roman_PL end_POSTSUBSCRIPT from the profile likelihood function.

4.4 ๐’๐๐…subscript๐’๐๐…Z_{\mathrm{BF}}bold_italic_Z start_POSTSUBSCRIPT bold_BF end_POSTSUBSCRIPT: marginalization of nuisance parameter ๐๐›subscript๐๐›\mu_{\rm b}bold_italic_ฮผ start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT
(Bayesian-frequentist hybrid)

First, consider the case when ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT is known exactly. The p๐‘pitalic_p-value (denoted by pPsubscript๐‘Pp_{\rm P}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT) for H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: ฮผs=0subscript๐œ‡s0\mu_{\rm s}=0italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0 is the Poisson probability of obtaining nonsubscript๐‘›onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT or more events in the โ€œonโ€ bin with true mean ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT.

Then, introduce uncertainty in ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT in a Bayesian-like way, with uniform prior for ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT and likelihood โ„’โข(ฮผb)โ„’subscript๐œ‡b{\cal L}(\mu_{\rm b})caligraphic_L ( italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ) from the Poisson probability of observing noffsubscript๐‘›offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT in the โ€œoffโ€ bin, thus obtaining posterior probability Pโข(ฮผb|noff)๐‘ƒconditionalsubscript๐œ‡bsubscript๐‘›offP(\mu_{\rm b}|n_{\rm off})italic_P ( italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT | italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ), a Gamma distribution.Finally, take the weighted average of pPsubscript๐‘Pp_{\mathrm{P}}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT over Pโข(ฮผb|noff)๐‘ƒconditionalsubscript๐œ‡bsubscript๐‘›offP(\mu_{\rm b}|n_{\rm off})italic_P ( italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT | italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ) to obtain the Bayesian-frequentist hybrid p๐‘pitalic_p-value pBFsubscript๐‘BFp_{\mathrm{BF}}italic_p start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT:

pBF=โˆซpPโขPโข(ฮผb|noff)โข๐‘‘ฮผb,subscript๐‘BFsubscript๐‘P๐‘ƒconditionalsubscript๐œ‡bsubscript๐‘›offdifferential-dsubscript๐œ‡bp_{\mathrm{BF}}=\int p_{\rm P}P(\mu_{\rm b}|n_{\rm off})d\mu_{\rm b},italic_p start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT = โˆซ italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT italic_P ( italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT | italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ) italic_d italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ,(6)

and map to ZBFsubscript๐‘BFZ_{\mathrm{BF}}italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT (called Zฮ“subscript๐‘ฮ“Z_{\Gamma}italic_Z start_POSTSUBSCRIPT roman_ฮ“ end_POSTSUBSCRIPT by CLT).

Jim Linnemann discovered numerically, and then proved, that ZBF=ZBisubscript๐‘BFsubscript๐‘BiZ_{\mathrm{BF}}=Z_{\rm Bi}italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT (!). His proof is in Appendix C of CLT. This justifies, from a frequentist point of view, the use of the uniform prior for ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, which is on shaky ground in real Bayesian theory.

The integral in Eq.6 can be performed using the Monte Carlo method as follows.One samples ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT from the posterior pdf Pโข(ฮผb|noff)๐‘ƒconditionalsubscript๐œ‡bsubscript๐‘›offP(\mu_{\rm b}|n_{\rm off})italic_P ( italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT | italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ), and computes the frequentist pPsubscript๐‘Pp_{\rm P}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT for each sampled value of ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT (using same observed nonsubscript๐‘›onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT).Then one calculates the arithmetic mean of these values of pPsubscript๐‘Pp_{\mathrm{P}}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT to obtain pBFsubscript๐‘BFp_{\mathrm{BF}}italic_p start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT, and converts to ZBFsubscript๐‘BFZ_{\mathrm{BF}}italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT as desired. (Note that pPsubscript๐‘Pp_{\mathrm{P}}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPTโ€™s are averaged, not ZPsubscript๐‘PZ_{\rm P}italic_Z start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPTโ€™s.)

4.5 ๐’๐๐subscript๐’๐๐Z_{\mathrm{PB}}bold_italic_Z start_POSTSUBSCRIPT bold_PB end_POSTSUBSCRIPT: parametric bootstrap for nuisance parameter ๐๐›subscript๐๐›\mu_{\rm b}bold_italic_ฮผ start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT

For testing H0subscriptH0\mathrm{H}_{0}roman_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: ฮผs=0subscript๐œ‡s0\mu_{\rm s}=0italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0, we first find the profile likelihood estimate of ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT. See lima, cited by CLT, for the mathematics.

Conceptually, we use the data in both bins by noting as above that for ฮผs=0subscript๐œ‡s0\mu_{\rm s}=0italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0,ntot=non+noffsubscript๐‘›totsubscript๐‘›onsubscript๐‘›offn_{\textnormal{\scriptsize tot}}=n_{\textnormal{\scriptsize on}}+n_{\rm off}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT is a sample from Poisson mean ฮผb+ฮผoff=ฮผbโข(1+ฯ„)subscript๐œ‡bsubscript๐œ‡offsubscript๐œ‡b1๐œ\mu_{\rm b}+\mu_{\rm off}=\mu_{\rm b}(1+\tau)italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT + italic_ฮผ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ( 1 + italic_ฯ„ ).Solving for ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT, the profiled MLE of ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT for ฮผs=0subscript๐œ‡s0\mu_{\rm s}=0italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0 is ฮผ^^b=(non+noff)/(1+ฯ„)subscript^^๐œ‡bsubscript๐‘›onsubscript๐‘›off1๐œ\widehat{\widehat{\mu}}_{\rm b}=(n_{\textnormal{\scriptsize on}}+n_{\rm off})/%(1+\tau)over^ start_ARG over^ start_ARG italic_ฮผ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = ( italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ) / ( 1 + italic_ฯ„ ). The parametric bootstrap takes this value of ฮผbsubscript๐œ‡b\mu_{\rm b}italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT as truth and proceeds to calculate the p๐‘pitalic_p-value pPBsubscript๐‘PBp_{\mathrm{PB}}italic_p start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT as in pPsubscript๐‘Pp_{\rm P}italic_p start_POSTSUBSCRIPT roman_P end_POSTSUBSCRIPT above. This can be done by simulation or direct calculation, leading to ZPBsubscript๐‘PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT.

4.6 Numerical examples

The calculated Z๐‘Zitalic_Z-values for the various recipes are shown in Table 1 for three chosen sets of values of ฯ„๐œ\tauitalic_ฯ„, nonsubscript๐‘›onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT, and noffsubscript๐‘›offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT.

Explaining the first set in detail, suppose ฯ„=1.0๐œ1.0\tau=1.0italic_ฯ„ = 1.0, non=10subscript๐‘›on10n_{\textnormal{\scriptsize on}}=10italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT = 10, noff=2subscript๐‘›off2n_{\rm off}=2italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 2.Following the ROOT commands in Appendix E of CLT yields the โ€œExactโ€ ZBi=ZBF=2.07subscript๐‘Bisubscript๐‘BF2.07Z_{\rm Bi}=Z_{\mathrm{BF}}=2.07italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT = 2.07.Alternatively, one can use the ROOT (tefficiency) class TEfficiency for two-tailed binomial confidence intervals. Recall that we need a one-tailed binomial test of ฯ=1/(1+ฯ„)๐œŒ11๐œ\rho=1/(1+\tau)italic_ฯ = 1 / ( 1 + italic_ฯ„ ), given data nonsubscript๐‘›onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT, ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT. Applying the duality between tests and intervals(cousinshcpss2018), we seek the confidence level (CL) of the confidence interval (CI) that just includes the value of ฯ=1/(1+1)=0.5๐œŒ1110.5\rho=1/(1+1)=0.5italic_ฯ = 1 / ( 1 + 1 ) = 0.5. By iterating p๐‘pitalic_p and the CL=1โˆ’2โˆ—pCL12๐‘\mathrm{CL}=1-2*proman_CL = 1 - 2 โˆ— italic_p in TEfficiency manually until obtaining the endpoint=0.5 (for CL=p=0.019287CL๐‘0.019287\mathrm{CL}=p=0.019287roman_CL = italic_p = 0.019287), I arrived at the following ROOT commands (converting p=0.019287๐‘0.019287p=0.019287italic_p = 0.019287 to Z๐‘Zitalic_Z as in CLT).

double n_on=10double n_off=2double p=0.019287endpoint = TEfficiency::ClopperPearson(n_on+n_off,n_on,1.-2.*p,false);endpoint

Then I could switch to TEfficiency::MidPInterval to get the Lancaster mid-P interval and similarly obtain Zmidโข-โขP=2.28subscript๐‘mid-P2.28Z_{\mathrm{mid\text{-}P}}=2.28italic_Z start_POSTSUBSCRIPT roman_mid - roman_P end_POSTSUBSCRIPT = 2.28.

The asymptotic result from Wilksโ€™s Theorem gives the value from the profile likelihood function, ZPL=2.41subscript๐‘PL2.41Z_{\mathrm{PL}}=2.41italic_Z start_POSTSUBSCRIPT roman_PL end_POSTSUBSCRIPT = 2.41.

For the parametric bootstrap, for ฮผs=0subscript๐œ‡s0\mu_{\rm s}=0italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0 and ntot=12subscript๐‘›tot12n_{\textnormal{\scriptsize tot}}=12italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = 12, the profiled MLE is ฮผ^^b=6subscript^^๐œ‡b6\widehat{\widehat{\mu}}_{\rm b}=6over^ start_ARG over^ start_ARG italic_ฮผ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 6, and so ฮผoff=6subscript๐œ‡off6\mu_{\rm off}=6italic_ฮผ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 6. So I generated toys with ฮผ=6๐œ‡6\mu=6italic_ฮผ = 6 in each bin (!), to find ZPB=2.32subscript๐‘PB2.32Z_{\mathrm{PB}}=2.32italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT = 2.32.

The results for the second set of values (ฯ„=2.0๐œ2.0\tau=2.0italic_ฯ„ = 2.0, non=10subscript๐‘›on10n_{\textnormal{\scriptsize on}}=10italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT = 10, and noff=2subscript๐‘›off2n_{\rm off}=2italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 2) are found similarly.For the parametric bootstrap, for ฮผs=0subscript๐œ‡s0\mu_{\rm s}=0italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0, the profiled MLE is ฮผ^^b=4subscript^^๐œ‡b4\widehat{\widehat{\mu}}_{\rm b}=4over^ start_ARG over^ start_ARG italic_ฮผ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 4, and so ฮผoff=8subscript๐œ‡off8\mu_{\rm off}=8italic_ฮผ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 8.So I generated toys with ฮผ=4,8๐œ‡48\mu=4,8italic_ฮผ = 4 , 8 in the bins where we observed 10, 2 (!). ZPB=3.46subscript๐‘PB3.46Z_{\mathrm{PB}}=3.46italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT = 3.46.

The third set of values (added since my talk) has no events in the off sideband, noff=0subscript๐‘›off0n_{\rm off}=0italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 0. In the discussion following our talks, I expressed concern about generating toys for the parametric bootstrap in this situation. In fact, it is not a problem. With ฯ„=2.0๐œ2.0\tau=2.0italic_ฯ„ = 2.0 and non=6subscript๐‘›on6n_{\textnormal{\scriptsize on}}=6italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT = 6, then for ฮผs=0subscript๐œ‡s0\mu_{\rm s}=0italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0 and ntot=6subscript๐‘›tot6n_{\textnormal{\scriptsize tot}}=6italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = 6, the profiled MLE is ฮผ^^b=2subscript^^๐œ‡b2\widehat{\widehat{\mu}}_{\rm b}=2over^ start_ARG over^ start_ARG italic_ฮผ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 2, and so ฮผoff=4subscript๐œ‡off4\mu_{\rm off}=4italic_ฮผ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = 4.So I generated toys with ฮผ=2,4๐œ‡24\mu=2,4italic_ฮผ = 2 , 4 in the bins where we observed 6, 0 (!). ZPB=3.41subscript๐‘PB3.41Z_{\mathrm{PB}}=3.41italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT = 3.41 (In more complicated models, zero events observed in multiple control regions might be an issue.)

ฯ„๐œ\tauitalic_ฯ„1.02.02.0
nonsubscript๐‘›onn_{\textnormal{\scriptsize on}}italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT10106
noffsubscript๐‘›offn_{\rm off}italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT220
ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT12126
ฮผ^^bsubscript^^๐œ‡b\widehat{\widehat{\mu}}_{\rm b}over^ start_ARG over^ start_ARG italic_ฮผ end_ARG end_ARG start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT for ฮผs=0subscript๐œ‡s0\mu_{\rm s}=0italic_ฮผ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0642
ZBi=ZBFsubscript๐‘Bisubscript๐‘BFZ_{\rm Bi}=Z_{\mathrm{BF}}italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT, Clopper-Pearson2.073.273.00
Zmidโข-โขPsubscript๐‘mid-PZ_{\mathrm{mid\text{-}P}}italic_Z start_POSTSUBSCRIPT roman_mid - roman_P end_POSTSUBSCRIPT, Lancaster mid-P2.283.443.20
ZPLsubscript๐‘PLZ_{\mathrm{PL}}italic_Z start_POSTSUBSCRIPT roman_PL end_POSTSUBSCRIPT2.413.583.63
ZPBsubscript๐‘PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT2.323.463.41

4.6.1 Discussion

Superficially, given that ZBi=ZBFsubscript๐‘Bisubscript๐‘BFZ_{\rm Bi}=Z_{\mathrm{BF}}italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT roman_BF end_POSTSUBSCRIPT with Clopper-Pearson intervals is โ€œExactโ€, and the parametric bootstrap gives larger Z๐‘Zitalic_Z, the latter looks anti-conservative.But recall that the so-called โ€œexactโ€ intervals overcover due to discreteness. Maybe ZPBsubscript๐‘PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT is actually better. We need a closer look.

4.7 Coverage studies

Following CLT, we consider a pair of true values of (ฮผb,ฯ„)subscript๐œ‡b๐œ(\mu_{\rm b},\tau)( italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT , italic_ฯ„ ), and for that pair,consider all values of (non,noff)subscript๐‘›onsubscript๐‘›off(n_{\textnormal{\scriptsize on}},n_{\rm off})( italic_n start_POSTSUBSCRIPT on end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ), and calculate both the probability of obtaining that data, and the computed Z๐‘Zitalic_Z value for each recipe. We consider some threshold value of claimed Zclaimsubscript๐‘claimZ_{\rm claim}italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT, say 3 or 5, and compute the probability thatZโ‰ฅZclaim๐‘subscript๐‘claimZ\geq Z_{\rm claim}italic_Z โ‰ฅ italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT according to a chosen recipe; this is the true Type I errorrate for that recipe and the significance level corresponding to thatvalue of Zclaimsubscript๐‘claimZ_{\rm claim}italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT. One can then convert this true Type I errorrate to a Z๐‘Zitalic_Z-value using the one-tailed formula and obtain what we call Ztruesubscript๐‘trueZ_{\rm true}italic_Z start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT.This could be done by simulation, but we do direct calculations. Figure2 has โ€œheat mapsโ€ from CLT showing Ztrueโˆ’Zclaimsubscript๐‘truesubscript๐‘claimZ_{\rm true}-Z_{\rm claim}italic_Z start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT for claims of 5ฯƒ๐œŽ\sigmaitalic_ฯƒ for the ZBisubscript๐‘BiZ_{\rm Bi}italic_Z start_POSTSUBSCRIPT roman_Bi end_POSTSUBSCRIPT and profile likelihood algorithms.

I calculated a couple of points using the parametric bootstrap ZPBsubscript๐‘PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT:For ฯ„=1.0๐œ1.0\tau=1.0italic_ฯ„ = 1.0, ฮผb=10subscript๐œ‡b10\mu_{\rm b}=10italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 10, and Zclaim=3subscript๐‘claim3Z_{\rm claim}=3italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT = 3, I found Ztrue=3.0subscript๐‘true3.0Z_{\rm true}=3.0italic_Z start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 3.0 (!)For ฯ„=1.0๐œ1.0\tau=1.0italic_ฯ„ = 1.0, ฮผb=20subscript๐œ‡b20\mu_{\rm b}=20italic_ฮผ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT = 20, and Zclaim=5subscript๐‘claim5Z_{\rm claim}=5italic_Z start_POSTSUBSCRIPT roman_claim end_POSTSUBSCRIPT = 5, I found Ztrue=5.0subscript๐‘true5.0Z_{\rm true}=5.0italic_Z start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT = 5.0 (!)Thus, at least for these two points, any anti-conservatism of the parametric bootstrap is exactly canceled out by the over-coverage due to discreteness.That is, I cherry-picked my example problem to be one in which I knew that marginalizing the nuisance parameter (with judicious prior) yielded the standard frequentist Z๐‘Zitalic_Z, while doing parametric bootstrap as commonly done at the LHC would give higher ZPBsubscript๐‘PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT. But in this example, the discrete nature of the data in the problem โ€œsavesโ€ the coverage of ZPBsubscript๐‘PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT !

Note also the similarity of ZPBsubscript๐‘PBZ_{\mathrm{PB}}italic_Z start_POSTSUBSCRIPT roman_PB end_POSTSUBSCRIPT with Zmidโข-โขPsubscript๐‘mid-PZ_{\mathrm{mid\text{-}P}}italic_Z start_POSTSUBSCRIPT roman_mid - roman_P end_POSTSUBSCRIPT. As mentioned, cht2010 found that Lancaster mid-P confidence intervals had excellent coverage in the unconditional ensemble, i.e., not conditioning on ntotsubscript๐‘›totn_{\textnormal{\scriptsize tot}}italic_n start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT.

4.8 If ๐‰๐‰\taubold_italic_ฯ„ is not known exactly

An obvious next step is to relax the assumption that ฯ„๐œ\tauitalic_ฯ„ is known exactly, so that ฯ„๐œ\tauitalic_ฯ„ becomes a nuisance parameter. In analyses in HEP, ฯ„๐œ\tauitalic_ฯ„ is often estimated from the observed data in two so-called control regions (bins), which are thought to have the same ratio of background means ฯ„๐œ\tauitalic_ฯ„ as the original two bins. The control regions are distinguished from the original two bins by using an additional statistic that is thought to be independent of the statistic used to define the original two bins. This is often referred to as the ABCD method, after the letters commonly designating the four bins. I would strongly encourage coverage studies for the various methods applied to this case. One could also study the case where the estimate of ฯ„๐œ\tauitalic_ฯ„ is sampled from a lognormal distribution.

5 Discussion at Banff BIRS 2023

Among the many talks touching on the topic at Banff in davisonbanff2023, statistician Anthony Davisonโ€™s was paired with mine, pointing to works including his Banff challenge paper with Sartori, and describing a parametric bootstrap, writing,

โ€œLee and Young (2005, Statistics and Probability Letters) and DiCiccio and Young (2008, 2010, Biometrika) show that this parametric bootstrap procedure has third-order relative accuracy in linear exponential families, both conditionally and unconditionally, and is closely linked to objective Bayes approaches.โ€

6 Final remarks

Overall, I am encouraged by what I learned in the last year while preparing the talks and this writeup regarding our widespread use of the parametric bootstrap with the PLR test statistic in HEP. Still, there is room to learn more (which I suspect that I must leave to the younger generations). First, how much is HEP losing by not studying more thoroughly the higher-order asymptotic theory in the test statistic? Would that give more power, or at least reduce the need for parametric bootstrapping? Second, can one find relevant cases in HEP where marginalizing out-performs profiling? Is there an example having continuous data but otherwise similar to the example in this note, so that one can study the performance without the confounding complication of discreteness? Or, for those familiar with the ABCD method in HEP (where, as mentioned) ฯ„๐œ\tauitalic_ฯ„ is measured from contents of two more control regions), are there pitfalls for the parametric bootstrap in that context?I would urge more exploration of these issues in real HEP analyses!

Acknowledgments

Many thanks go to the multitudes of physicists and statisticians from whom I have learned, and particularly to Louis Lyons for bringing us together regularly at PhyStat meetings since the year 2000 (and for comments on earlier drafts). I also thank him, Sara Algeri, and Olaf Behnke for organizing this discussion. This work was partially supported by the U.S.Department of Energyunder Award Number DEโ€“SC0009937.

Part II: Statisticianโ€™s View
by Larry Wasserman

1 Introduction

This paper arosefrom my presentationat the PhyStat meeting whereBob Cousins and I were asked togive our views ontwo methods for handling nuisance parameterswhen conducting statistical inference.This written account expands on my commentsand goes into a little more detail.

Iโ€™ll follow Bobโ€™s notation whereฮผ๐œ‡\muitalic_ฮผ is the (real-valued) parameter of interestand ฮฒ๐›ฝ\betaitalic_ฮฒ is the vector of nuisance parameters.In his paper,Bob points out thatmarginalising nuisance parametersis rare at the LHC.This is true in statistics as well.There are a few cases where marginalisingis used, such as random effects modelswhere the number of parameters increases as the samplesize increases.But even profile likelihood is not that common.I would say that the most common method in statisticsis the Wald confidence intervalฮผ^ยฑzฮฑ/2โขsnplus-or-minus^๐œ‡subscript๐‘ง๐›ผ2subscript๐‘ ๐‘›\widehat{\mu}\pm z_{\alpha/2}s_{n}over^ start_ARG italic_ฮผ end_ARG ยฑ italic_z start_POSTSUBSCRIPT italic_ฮฑ / 2 end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTwhere ฮผ^^๐œ‡\widehat{\mu}over^ start_ARG italic_ฮผ end_ARG is the maximum likelihood estimator,snsubscript๐‘ ๐‘›s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the standard error (usually obtained from the Fisher information matrix)and zฮฑ/2subscript๐‘ง๐›ผ2z_{\alpha/2}italic_z start_POSTSUBSCRIPT italic_ฮฑ / 2 end_POSTSUBSCRIPT is the ฮฑ/2๐›ผ2\alpha/2italic_ฮฑ / 2 upper quantileof a standard Normal distribution.(Note that for the Wald interval,one simply inserts points estimates of the nuisance parametersin the formula for snsubscript๐‘ ๐‘›s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.)

LetY1,โ€ฆ,Ynsubscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›Y_{1},\ldots,Y_{n}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTbe a random sample froma densitypฮธsubscript๐‘๐œƒp_{\theta}italic_p start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPTwhich belongs to a model(pฮธ:ฮธโˆˆฮ˜):subscript๐‘๐œƒ๐œƒฮ˜(p_{\theta}:\ \theta\in\Theta)( italic_p start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT : italic_ฮธ โˆˆ roman_ฮ˜ ).We decompose the parameter asฮธ=(ฮผ,ฮฒ)๐œƒ๐œ‡๐›ฝ\theta=(\mu,\beta)italic_ฮธ = ( italic_ฮผ , italic_ฮฒ )where ฮผโˆˆโ„๐œ‡โ„\mu\in\mathbb{R}italic_ฮผ โˆˆ blackboard_R is the parameter of interest andฮฒโˆˆโ„k๐›ฝsuperscriptโ„๐‘˜\beta\in\mathbb{R}^{k}italic_ฮฒ โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is avector of nuisance parameters.We want to construct a confidence setCnโ‰กCnโข(Y1,โ€ฆ,Yn)subscript๐ถ๐‘›subscript๐ถ๐‘›subscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›C_{n}\equiv C_{n}(Y_{1},\ldots,Y_{n})italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT โ‰ก italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )such that

Pฮธโข(ฮผโˆˆCn)โ‰ฅ1โˆ’ฮฑforallโขฮธsubscript๐‘ƒ๐œƒ๐œ‡subscript๐ถ๐‘›1๐›ผforall๐œƒP_{\theta}(\mu\in C_{n})\geq 1-\alpha\qquad\text{for\ all\ }\thetaitalic_P start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT ( italic_ฮผ โˆˆ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) โ‰ฅ 1 - italic_ฮฑ for all italic_ฮธ(7)

or, equivalently,

infฮธPฮธโข(ฮผโˆˆCn)โ‰ฅ1โˆ’ฮฑ.subscriptinfimum๐œƒsubscript๐‘ƒ๐œƒ๐œ‡subscript๐ถ๐‘›1๐›ผ\inf_{\theta}P_{\theta}(\mu\in C_{n})\geq 1-\alpha.roman_inf start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT ( italic_ฮผ โˆˆ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) โ‰ฅ 1 - italic_ฮฑ .

We may also want to test

H0:ฮผ=ฮผ0versusฮผโ‰ ฮผ0:subscript๐ป0formulae-sequence๐œ‡subscript๐œ‡0versus๐œ‡subscript๐œ‡0H_{0}:\mu=\mu_{0}\qquad\text{versus}\qquad\mu\neq\mu_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_ฮผ = italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT versus italic_ฮผ โ‰  italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

however, for simplicity, Iโ€™ll focus on confidence sets.The present discussionis on likelihood based methods.In particular,we may use the profile likelihood

โ„’pโข(ฮผ)=supฮฒโ„’โข(ฮผ,ฮฒ)subscriptโ„’๐‘๐œ‡subscriptsupremum๐›ฝโ„’๐œ‡๐›ฝ{\cal L}_{p}(\mu)=\sup_{\beta}{\cal L}(\mu,\beta)caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_ฮผ ) = roman_sup start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT caligraphic_L ( italic_ฮผ , italic_ฮฒ )

or the marginalised (or integrated) likelihood

โ„’mโข(ฮผ)=โˆซโ„’โข(ฮผ,ฮฒ)โขฯ€โข(ฮฒ|ฮผ)โข๐‘‘ฮฒsubscriptโ„’๐‘š๐œ‡โ„’๐œ‡๐›ฝ๐œ‹conditional๐›ฝ๐œ‡differential-d๐›ฝ{\cal L}_{m}(\mu)=\int{\cal L}(\mu,\beta)\pi(\beta|\mu)d\betacaligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_ฮผ ) = โˆซ caligraphic_L ( italic_ฮผ , italic_ฮฒ ) italic_ฯ€ ( italic_ฮฒ | italic_ฮผ ) italic_d italic_ฮฒ

where ฯ€โข(ฮฒ|ฮผ)๐œ‹conditional๐›ฝ๐œ‡\pi(\beta|\mu)italic_ฯ€ ( italic_ฮฒ | italic_ฮผ ) is a prior for ฮฒ๐›ฝ\betaitalic_ฮฒ.In my view, we cannot say whether a likelihood isgood or badwithout saying what we will do with the likelihood.I assume thatwe are using the likelihoodto constructingconfidence intervals and tests.Then judging whether a likelihood isgood or badrequires assessing the quality of the confidence setsobtained from the likelihood.The usual central 1โˆ’ฮฑ1๐›ผ1-\alpha1 - italic_ฮฑ confidence set based on โ„’psubscriptโ„’๐‘{\cal L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is

C={ฮผ:โ„’pโข(ฮผ)โ‰ฅโ„’pโข(ฮผ^)โขeโˆ’c/2}={ฮผ:โ„“pโข(ฮผ)โ‰ฅโ„“pโข(ฮผ^)โˆ’c2}๐ถconditional-set๐œ‡subscriptโ„’๐‘๐œ‡subscriptโ„’๐‘^๐œ‡superscript๐‘’๐‘2conditional-set๐œ‡subscriptโ„“๐‘๐œ‡subscriptโ„“๐‘^๐œ‡๐‘2C=\Bigl{\{}\mu:\ {\cal L}_{p}(\mu)\geq{\cal L}_{p}(\widehat{\mu})e^{-c/2}\Bigr%{\}}=\Bigl{\{}\mu:\ \ell_{p}(\mu)\geq\ell_{p}(\widehat{\mu})-\frac{c}{2}\Bigr{\}}italic_C = { italic_ฮผ : caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_ฮผ ) โ‰ฅ caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( over^ start_ARG italic_ฮผ end_ARG ) italic_e start_POSTSUPERSCRIPT - italic_c / 2 end_POSTSUPERSCRIPT } = { italic_ฮผ : roman_โ„“ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_ฮผ ) โ‰ฅ roman_โ„“ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( over^ start_ARG italic_ฮผ end_ARG ) - divide start_ARG italic_c end_ARG start_ARG 2 end_ARG }(8)

where ฮธ^=(ฮผ^,ฮฒ^)^๐œƒ^๐œ‡^๐›ฝ\widehat{\theta}=(\widehat{\mu},\widehat{\beta})over^ start_ARG italic_ฮธ end_ARG = ( over^ start_ARG italic_ฮผ end_ARG , over^ start_ARG italic_ฮฒ end_ARG )is the maximum likelihood estimate,โ„“p=logโกโ„’psubscriptโ„“๐‘subscriptโ„’๐‘\ell_{p}=\log{\cal L}_{p}roman_โ„“ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = roman_log caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT andc๐‘citalic_c is the 1โˆ’ฮฑ1๐›ผ1-\alpha1 - italic_ฮฑ quantile of a ฯ‡12subscriptsuperscript๐œ’21\chi^{2}_{1}italic_ฯ‡ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTdistribution.

2 Profile or Marginalise?

We want confidence intervalswith correct coveragewhich means thatPฮธโข(ฮธโˆˆC)โ‰ฅ1โˆ’ฮฑsubscript๐‘ƒ๐œƒ๐œƒ๐ถ1๐›ผP_{\theta}(\theta\in C)\geq 1-\alphaitalic_P start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT ( italic_ฮธ โˆˆ italic_C ) โ‰ฅ 1 - italic_ฮฑ for all ฮธ๐œƒ\thetaitalic_ฮธ.In this case we say thatthe confidence set is valid.Among valid intervals,we may want one that is shortest.A validinterval with shortest lengthis efficient.(But this is a scale dependent notion.)

If the sample size is largeand certain regularity conditions hold,then the profile-based confidence setin (8)will be valid and efficient.So in these well-behaved situations,I donโ€™t see any reasonfor using โ„’msubscriptโ„’๐‘š{\cal L}_{m}caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.The main regularity conditions are:

(1) pฮธโข(y)subscript๐‘๐œƒ๐‘ฆp_{\theta}(y)italic_p start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT ( italic_y ) is a smooth function of ฮธ๐œƒ\thetaitalic_ฮธ.

(2) The range of the random variable Y๐‘ŒYitalic_Y does not depend on ฮธ๐œƒ\thetaitalic_ฮธ.

(3) The number of nuisance parameters does not increaseas the sample size increases.

In a very interesting paper that Bob mentions,berger1999provide exampleswhere โ„’psubscriptโ„’๐‘{\cal L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT apparently does not behave well.This is because,in each case,the sample size is small, or the number of nuisance parameters is increasingor some other regularity conditions fail.There may be a few cases where the marginalisedlikelihood can fix these thingsif we use a carefully chosen prior, but such casesare rare.We might improve the accuracy of the coverageby using higher order asymptotics.But this is rarely done in practice.Such methods are not easy to applyand they wonโ€™t necessarily helpif there are violations of the regularity conditions.

Sometimes we can use specific tricks,like conditioning on a well-chosen statisticas Bob does in the Poisson ON-OFF example.These methods are useful whenthey are available but they tend to be very problem specific.Bootstrapping and subsamplingare also useful in some cases.A common misconceptionis that the bootstrap is a finite sample method.It is not.These methods are asymptotic and they dostill rely on regularity conditions.

In the next two sectionsIโ€™ll discuss two approachesfor gettingvalid confidence setswithout relying onasymptotic approximations or regularity conditions.

3 Universal Inference

Universal inference(wasserman2020universal)is a methodfor constructingconfidence setsfrom the likelihoodthat does not rely onasymptotic approximationsor on regularity conditions.The method,works as follows.

  1. 1.

    Choose a large number B๐ตBitalic_B.

  2. 2.

    For j=1,โ€ฆ,B๐‘—1โ€ฆ๐ตj=1,\ldots,Bitalic_j = 1 , โ€ฆ , italic_B:

    1. (a)

      Split the data into two disjoint groups๐’Ÿ0subscript๐’Ÿ0{\cal D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ๐’Ÿ1subscript๐’Ÿ1{\cal D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

    2. (b)

      Let ฮธ^=(ฮผ^,ฮฒ^)^๐œƒ^๐œ‡^๐›ฝ\widehat{\theta}=(\widehat{\mu},\widehat{\beta})over^ start_ARG italic_ฮธ end_ARG = ( over^ start_ARG italic_ฮผ end_ARG , over^ start_ARG italic_ฮฒ end_ARG )be any estimate of ฮธ๐œƒ\thetaitalic_ฮธ using ๐’Ÿ1subscript๐’Ÿ1{\cal D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

    3. (c)

      LetTjโข(ฮผ)=โ„’0โข(ฮผ^,ฮฒ^)supฮฒโ„’โข(ฮผ,ฮฒ)subscript๐‘‡๐‘—๐œ‡subscriptโ„’0^๐œ‡^๐›ฝsubscriptsupremum๐›ฝโ„’๐œ‡๐›ฝT_{j}(\mu)=\frac{{\cal L}_{0}(\widehat{\mu},\widehat{\beta})}{\sup_{\beta}{%\cal L}(\mu,\beta)}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ฮผ ) = divide start_ARG caligraphic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over^ start_ARG italic_ฮผ end_ARG , over^ start_ARG italic_ฮฒ end_ARG ) end_ARG start_ARG roman_sup start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT caligraphic_L ( italic_ฮผ , italic_ฮฒ ) end_ARGwhereโ„’0โข(ฮธ)subscriptโ„’0๐œƒ{\cal L}_{0}(\theta)caligraphic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ฮธ ) is the likelihoodconstructed from ๐’Ÿ0subscript๐’Ÿ0{\cal D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

  3. 3.

    Let Tโข(ฮผ)=Bโˆ’1โขโˆ‘j=1BTjโข(ฮผ)๐‘‡๐œ‡superscript๐ต1superscriptsubscript๐‘—1๐ตsubscript๐‘‡๐‘—๐œ‡T(\mu)=B^{-1}\sum_{j=1}^{B}T_{j}(\mu)italic_T ( italic_ฮผ ) = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ฮผ ).

  4. 4.

    Return Cn={ฮผ:Tโข(ฮผ)โ‰ค1/ฮฑ}subscript๐ถ๐‘›conditional-set๐œ‡๐‘‡๐œ‡1๐›ผC_{n}=\{\mu:\ T(\mu)\leq 1/\alpha\}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_ฮผ : italic_T ( italic_ฮผ ) โ‰ค 1 / italic_ฮฑ }.

This confidence set is valid,that is,

Pฮธโข(ฮผโˆˆCn)โ‰ฅ1โˆ’ฮฑforallโขฮธsubscript๐‘ƒ๐œƒ๐œ‡subscript๐ถ๐‘›1๐›ผforall๐œƒP_{\theta}(\mu\in C_{n})\geq 1-\alpha\qquad\text{for\ all\ }\thetaitalic_P start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT ( italic_ฮผ โˆˆ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) โ‰ฅ 1 - italic_ฮฑ for all italic_ฮธ

and this holds for any sample size.It does not rely on large sample approximationsand it does not requireany regularity conditions.The number of splits B๐ตBitalic_B can be any number, even B=1๐ต1B=1italic_B = 1 is allowed.The advantage of using large B๐ตBitalic_B is that it removes therandomness from the procedure.The disadvantage ofUniversal inference is thatone needs to compute the likelihood many timeswhich can be computationally expensive.

4 Simulation Based Inference

As the name implies,simulation based inference (SBI)uses simulationsto conduct inference.It is usually used to deal withintractable likelihoods(dalmasso2021likelihood; masserano2023simulator; al2024amortized; cranmer2015; cranmer2020; thomas2022; xie2024; masserano2022simulation; stanley2023).But it can be used as a way to get valid inferencefor any model.The idea is to firstget a confidence set for ฮธ๐œƒ\thetaitalic_ฮธby inverting a test.Let ฮธโˆ—subscript๐œƒ\theta_{*}italic_ฮธ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT denote the true value of ฮธ๐œƒ\thetaitalic_ฮธ.For each ฮธ๐œƒ\thetaitalic_ฮธ we test thehypothesis H0:ฮธโˆ—=ฮธ:subscript๐ป0subscript๐œƒ๐œƒH_{0}:\ \theta_{*}=\thetaitalic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_ฮธ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT = italic_ฮธ.This test is based on some test statisticTโข(ฮธ,Y1,โ€ฆ,Yn)๐‘‡๐œƒsubscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›T(\theta,Y_{1},\ldots,Y_{n})italic_T ( italic_ฮธ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).The p-value for this test is

pโข(ฮธ)=pโข(ฮผ,ฮฒ)=Pฮธโข(Tโข(ฮธ,Y1โˆ—,โ€ฆ,Ynโˆ—)โ‰ฅTโข(ฮธ,Y1,โ€ฆ,Yn))๐‘๐œƒ๐‘๐œ‡๐›ฝsubscript๐‘ƒ๐œƒ๐‘‡๐œƒsuperscriptsubscript๐‘Œ1โ€ฆsuperscriptsubscript๐‘Œ๐‘›๐‘‡๐œƒsubscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›p(\theta)=p(\mu,\beta)=P_{\theta}\Bigl{(}T(\theta,Y_{1}^{*},\ldots,Y_{n}^{*})%\geq T(\theta,Y_{1},\ldots,Y_{n})\Bigr{)}italic_p ( italic_ฮธ ) = italic_p ( italic_ฮผ , italic_ฮฒ ) = italic_P start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT ( italic_T ( italic_ฮธ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ) โ‰ฅ italic_T ( italic_ฮธ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )

whereY1โˆ—,โ€ฆ,Ynโˆ—โˆผpฮธsimilar-tosuperscriptsubscript๐‘Œ1โ€ฆsuperscriptsubscript๐‘Œ๐‘›subscript๐‘๐œƒY_{1}^{*},\ldots,Y_{n}^{*}\sim p_{\theta}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆผ italic_p start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT.ThenC~={ฮธ:pโข(ฮธ)โ‰ฅฮฑ}~๐ถconditional-set๐œƒ๐‘๐œƒ๐›ผ\widetilde{C}=\{\theta:\ p(\theta)\geq\alpha\}over~ start_ARG italic_C end_ARG = { italic_ฮธ : italic_p ( italic_ฮธ ) โ‰ฅ italic_ฮฑ }is an exact, 1โˆ’ฮฑ1๐›ผ1-\alpha1 - italic_ฮฑ confidence set for ฮธ๐œƒ\thetaitalic_ฮธ.The confidence set for ฮผ๐œ‡\muitalic_ฮผis the projection

C={ฮผ:ฮธ=(ฮผ,ฮฒ)โˆˆC~forsomeฮฒ}}={ฮผ:supฮฒp(ฮผ,ฮฒ)โ‰ฅฮฑ}.C=\Bigl{\{}\mu:\ \theta=(\mu,\beta)\in\widetilde{C}\ \text{for\ some\ }\beta\}%\Bigr{\}}=\Bigl{\{}\mu:\ \sup_{\beta}p(\mu,\beta)\geq\alpha\Bigr{\}}.italic_C = { italic_ฮผ : italic_ฮธ = ( italic_ฮผ , italic_ฮฒ ) โˆˆ over~ start_ARG italic_C end_ARG for some italic_ฮฒ } } = { italic_ฮผ : roman_sup start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT italic_p ( italic_ฮผ , italic_ฮฒ ) โ‰ฅ italic_ฮฑ } .

Such projected confidence sets can be large,but the resulting confidence sets havecorrect coverage.In SBI, we samplea set of valuesฮธ1,โ€ฆ,ฮธBsubscript๐œƒ1โ€ฆsubscript๐œƒ๐ต\theta_{1},\ldots,\theta_{B}italic_ฮธ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_ฮธ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTfrom some distribution.For each ฮธjsubscript๐œƒ๐‘—\theta_{j}italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPTwe draw a sampleY1โˆ—,โ€ฆ,Ynโˆ—โˆผpฮธjsimilar-tosuperscriptsubscript๐‘Œ1โ€ฆsuperscriptsubscript๐‘Œ๐‘›subscript๐‘subscript๐œƒ๐‘—Y_{1}^{*},\ldots,Y_{n}^{*}\sim p_{\theta_{j}}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆผ italic_p start_POSTSUBSCRIPT italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT.Then we define

Ij=Iโข(Tโข(ฮธj,Y1โˆ—,โ€ฆ,Ynโˆ—)โ‰ฅTโข(ฮธj,Y1,โ€ฆ,Yn))subscript๐ผ๐‘—๐ผ๐‘‡subscript๐œƒ๐‘—superscriptsubscript๐‘Œ1โ€ฆsuperscriptsubscript๐‘Œ๐‘›๐‘‡subscript๐œƒ๐‘—subscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›I_{j}=I\Bigl{(}T(\theta_{j},Y_{1}^{*},\ldots,Y_{n}^{*})\geq T(\theta_{j},Y_{1}%,\ldots,Y_{n})\Bigr{)}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_I ( italic_T ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ) โ‰ฅ italic_T ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )

which is 1 ifTโข(ฮธj,Y1โˆ—,โ€ฆ,Ynโˆ—)โ‰ฅTโข(ฮธj,Y1,โ€ฆ,Yn)๐‘‡subscript๐œƒ๐‘—superscriptsubscript๐‘Œ1โ€ฆsuperscriptsubscript๐‘Œ๐‘›๐‘‡subscript๐œƒ๐‘—subscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›T(\theta_{j},Y_{1}^{*},\ldots,Y_{n}^{*})\geq T(\theta_{j},Y_{1},\ldots,Y_{n})italic_T ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ) โ‰ฅ italic_T ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )and 0 otherwise.Note that

pโข(ฮธj)=Pฮธjโข(Ij=1)=๐”ผโข[Ij|ฮธj].๐‘subscript๐œƒ๐‘—subscript๐‘ƒsubscript๐œƒ๐‘—subscript๐ผ๐‘—1๐”ผdelimited-[]conditionalsubscript๐ผ๐‘—subscript๐œƒ๐‘—p(\theta_{j})=P_{\theta_{j}}(I_{j}=1)=\mbox{$\mathbb{E}$}[I_{j}|\theta_{j}].italic_p ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 ) = blackboard_E [ italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] .

So Ijsubscript๐ผ๐‘—I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is an unbiased estimate ofpโข(ฮธj)๐‘subscript๐œƒ๐‘—p(\theta_{j})italic_p ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).But we canโ€™t estimate pโข(ฮธj)๐‘subscript๐œƒ๐‘—p(\theta_{j})italic_p ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )using a single observation Ijsubscript๐ผ๐‘—I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.Instead, weperform a nonparametric regression ofthe Ijsubscript๐ผ๐‘—I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPTโ€™s which allows us to use nearby values.For example we can use the kernel regression estimator

p^โข(ฮธ)=โˆ‘jIjโขKhโข(ฮธjโˆ’ฮธ)โˆ‘jKhโข(ฮธjโˆ’ฮธ)^๐‘๐œƒsubscript๐‘—subscript๐ผ๐‘—subscript๐พโ„Žsubscript๐œƒ๐‘—๐œƒsubscript๐‘—subscript๐พโ„Žsubscript๐œƒ๐‘—๐œƒ\widehat{p}(\theta)=\frac{\sum_{j}I_{j}K_{h}(\theta_{j}-\theta)}{\sum_{j}K_{h}%(\theta_{j}-\theta)}over^ start_ARG italic_p end_ARG ( italic_ฮธ ) = divide start_ARG โˆ‘ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ฮธ ) end_ARG start_ARG โˆ‘ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ฮธ ) end_ARG

where Khsubscript๐พโ„ŽK_{h}italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a kernel with bandwidth hโ„Žhitalic_hsuch asKhโข(z)=eโˆ’โ€–zโ€–2/(2โขh2)subscript๐พโ„Ž๐‘งsuperscript๐‘’superscriptnorm๐‘ง22superscriptโ„Ž2K_{h}(z)=e^{-||z||^{2}/(2h^{2})}italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_z ) = italic_e start_POSTSUPERSCRIPT - | | italic_z | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 2 italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT.Here are the steps in detail.

  1. 1.

    Choose any functionTโข(ฮธ,Y1,โ€ฆ,Yn)๐‘‡๐œƒsubscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›T(\theta,Y_{1},\ldots,Y_{n})italic_T ( italic_ฮธ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

  2. 2.

    Draw ฮธ1,โ€ฆ,ฮธBsubscript๐œƒ1โ€ฆsubscript๐œƒ๐ต\theta_{1},\ldots,\theta_{B}italic_ฮธ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_ฮธ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTfrom any distribution hโข(ฮธ)โ„Ž๐œƒh(\theta)italic_h ( italic_ฮธ ) with full support.This can be the likelihood but it can be anything.

  3. 3.

    For j=1,โ€ฆ,B๐‘—1โ€ฆ๐ตj=1,\ldots,Bitalic_j = 1 , โ€ฆ , italic_B:draw Y1โˆ—,โ€ฆ,Ynโˆ—โˆผpฮธjsimilar-tosuperscriptsubscript๐‘Œ1โ€ฆsuperscriptsubscript๐‘Œ๐‘›subscript๐‘subscript๐œƒ๐‘—Y_{1}^{*},\ldots,Y_{n}^{*}\sim p_{\theta_{j}}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆผ italic_p start_POSTSUBSCRIPT italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT and computeTj=Tโข(ฮธj,Y1โˆ—,โ€ฆ,Ynโˆ—)subscript๐‘‡๐‘—๐‘‡subscript๐œƒ๐‘—superscriptsubscript๐‘Œ1โ€ฆsuperscriptsubscript๐‘Œ๐‘›T_{j}=T(\theta_{j},Y_{1}^{*},\ldots,Y_{n}^{*})italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_T ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ).

  4. 4.

    LetIj=Iโข(Tjโ‰ฅTโข(ฮธj,Y1,โ€ฆ,Yn))subscript๐ผ๐‘—๐ผsubscript๐‘‡๐‘—๐‘‡subscript๐œƒ๐‘—subscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›I_{j}=I(T_{j}\geq T(\theta_{j},Y_{1},\ldots,Y_{n}))italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_I ( italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT โ‰ฅ italic_T ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )for j=1,โ€ฆ,B๐‘—1โ€ฆ๐ตj=1,\ldots,Bitalic_j = 1 , โ€ฆ , italic_B.Here,Ij=1subscript๐ผ๐‘—1I_{j}=1italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 ifTjโ‰ฅTโข(ฮธj,Y1,โ€ฆ,Yn)subscript๐‘‡๐‘—๐‘‡subscript๐œƒ๐‘—subscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›T_{j}\geq T(\theta_{j},Y_{1},\ldots,Y_{n})italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT โ‰ฅ italic_T ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) andIj=0subscript๐ผ๐‘—0I_{j}=0italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 otherwise.

  5. 5.

    Estimatepโข(ฮธj)=๐”ผโข[Ij|ฮธj]๐‘subscript๐œƒ๐‘—๐”ผdelimited-[]conditionalsubscript๐ผ๐‘—subscript๐œƒ๐‘—p(\theta_{j})=\mbox{$\mathbb{E}$}[I_{j}|\theta_{j}]italic_p ( italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = blackboard_E [ italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_ฮธ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] by regressingI1,โ€ฆ,IBsubscript๐ผ1โ€ฆsubscript๐ผ๐ตI_{1},\ldots,I_{B}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT onฮธ1,โ€ฆ,ฮธBsubscript๐œƒ1โ€ฆsubscript๐œƒ๐ต\theta_{1},\ldots,\theta_{B}italic_ฮธ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_ฮธ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTto get p^โข(ฮธ)=p^โข(ฮผ,ฮฒ)^๐‘๐œƒ^๐‘๐œ‡๐›ฝ\widehat{p}(\theta)=\widehat{p}(\mu,\beta)over^ start_ARG italic_p end_ARG ( italic_ฮธ ) = over^ start_ARG italic_p end_ARG ( italic_ฮผ , italic_ฮฒ ).

  6. 6.

    Return: C={ฮผโขsupฮฒp^โข(ฮผ,ฮฒ)โ‰ฅฮฑ}๐ถ๐œ‡subscriptsupremum๐›ฝ^๐‘๐œ‡๐›ฝ๐›ผC=\{\mu\>\ \sup_{\beta}\widehat{p}(\mu,\beta)\geq\alpha\}italic_C = { italic_ฮผ roman_sup start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_ฮผ , italic_ฮฒ ) โ‰ฅ italic_ฮฑ }.

We have that Pฮผโข(ฮผโˆˆCn)โ†’1โˆ’ฮฑโ†’subscript๐‘ƒ๐œ‡๐œ‡subscript๐ถ๐‘›1๐›ผP_{\mu}(\mu\in C_{n})\to 1-\alphaitalic_P start_POSTSUBSCRIPT italic_ฮผ end_POSTSUBSCRIPT ( italic_ฮผ โˆˆ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) โ†’ 1 - italic_ฮฑ as Bโ†’โˆžโ†’๐ตB\to\inftyitalic_B โ†’ โˆž.The test statisticTโข(ฮธ,Y1,โ€ฆ,Yn)๐‘‡๐œƒsubscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›T(\theta,Y_{1},\ldots,Y_{n})italic_T ( italic_ฮธ , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )can be anything.A natural choice is thelikelihood functionโ„’โข(ฮธ)โ„’๐œƒ{\cal L}(\theta)caligraphic_L ( italic_ฮธ ).In fact, even if the likelihood is intractableit too can be estimated using the simulation.See(dalmasso2021likelihood; masserano2023simulator; al2024amortized; cranmer2015; cranmer2020; thomas2022)for details.A related method is given in xie2022.

5 Optimality

Suppose we have more than one validmethod.How do we choose between them?We could try to optimize somenotion of efficiency.For example,we could try to minimizethe expected lengthof the confidence interval.However,this measure is not transformation invariantso one has to choose a scale.The marginalised likelihood could potentiallylead to shorter intervals if the prior happensto be concentrated near the true parameter valuebut, in general,this would not be the case.(hoff2022 discusses methods to includeprior information while maintaining frequentist coverageguarantees.)For testing, we usuallychoose the test with the highest power.But generally there is notest with highest power at allalternative values.One has to choose a specific alternative orone puts a prior on ฮธ๐œƒ\thetaitalic_ฮธ and maximizesthe average power.

Again, in the large sample,low dimension regimethe choice of procedure might not have muchpractical impact.In small sample size situations,careful simulation studiesare probably the best approachto assessing the quality of competingmethods.

Of course, there may be other relevant criteriasuch as simplicity and computational tractability.These need to be factored in as well.

6 Why Likelihood?

There are cases whereneither the profile likelihood or themarginalised likelihood should be used.Here is a sample example.Letc1,โ€ฆ,cnsubscript๐‘1โ€ฆsubscript๐‘๐‘›c_{1},\ldots,c_{n}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTbe a set of fixed, unknown constantssuch that0โ‰คciโ‰ค10subscript๐‘๐‘–10\leq c_{i}\leq 10 โ‰ค italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT โ‰ค 1.Letฮธ=nโˆ’1โขโˆ‘ici๐œƒsuperscript๐‘›1subscript๐‘–subscript๐‘๐‘–\theta=n^{-1}\sum_{i}c_{i}italic_ฮธ = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.LetS1,โ€ฆ,Snsubscript๐‘†1โ€ฆsubscript๐‘†๐‘›S_{1},\ldots,S_{n}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTbe n๐‘›nitalic_n independent flips of a coinwherePโข(Si=1)=Pโข(Si=0)=1/2๐‘ƒsubscript๐‘†๐‘–1๐‘ƒsubscript๐‘†๐‘–012P(S_{i}=1)=P(S_{i}=0)=1/2italic_P ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) = italic_P ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ) = 1 / 2.If Si=1subscript๐‘†๐‘–1S_{i}=1italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 you get to see cisubscript๐‘๐‘–c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.If Si=0subscript๐‘†๐‘–0S_{i}=0italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 then cisubscript๐‘๐‘–c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is unobserved.Let

ฮธ^=2nโขโˆ‘i=1nciโขSi.^๐œƒ2๐‘›superscriptsubscript๐‘–1๐‘›subscript๐‘๐‘–subscript๐‘†๐‘–\widehat{\theta}=\frac{2}{n}\sum_{i=1}^{n}c_{i}S_{i}.over^ start_ARG italic_ฮธ end_ARG = divide start_ARG 2 end_ARG start_ARG italic_n end_ARG โˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Then๐”ผโข[ฮธ^]=ฮธ๐”ผdelimited-[]^๐œƒ๐œƒ\mbox{$\mathbb{E}$}[\widehat{\theta}]=\thetablackboard_E [ over^ start_ARG italic_ฮธ end_ARG ] = italic_ฮธ anda simple 1โˆ’ฮฑ1๐›ผ1-\alpha1 - italic_ฮฑ confidence interval for ฮธ๐œƒ\thetaitalic_ฮธ isฮธ^ยฑzฮฑ/2/nplus-or-minus^๐œƒsubscript๐‘ง๐›ผ2๐‘›\widehat{\theta}\pm z_{\alpha/2}/\sqrt{n}over^ start_ARG italic_ฮธ end_ARG ยฑ italic_z start_POSTSUBSCRIPT italic_ฮฑ / 2 end_POSTSUBSCRIPT / square-root start_ARG italic_n end_ARG.The likelihood isthe probability of n๐‘›nitalic_n coin flips which is(1/2)nsuperscript12๐‘›(1/2)^{n}( 1 / 2 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPTwhich does not even depend on ฮธ๐œƒ\thetaitalic_ฮธ.In other words, the likelihood function is flatand contains no information.Whatโ€™s going here?The variablesS1,โ€ฆ,Snsubscript๐‘†1โ€ฆsubscript๐‘†๐‘›S_{1},\ldots,S_{n}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are ancillarywhich means their distributiondoes not depend on the unknown parameter.Ancillary statistics do not appearin the likelihood.But the estimator above does use the ancillarieswhich shows that the ancillaries contain valuable information.So likelihood-based inferencecan failwhen there are useful ancillaries.This issue comes up in real problemssuch as analyzing randomized clinical trialswhere the random assignment of subjects to treatment orcontrol is ancillary.

Another exampleis Example 7 fromberger1999.Let X1,โ€ฆ,XnโˆผNโข(ฮธ,ฯƒ2)similar-tosubscript๐‘‹1โ€ฆsubscript๐‘‹๐‘›๐‘๐œƒsuperscript๐œŽ2X_{1},\ldots,X_{n}\sim N(\theta,\sigma^{2})italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT โˆผ italic_N ( italic_ฮธ , italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )and suppose thatthe parameter of interest isฮผ=ฮธ/ฯƒ๐œ‡๐œƒ๐œŽ\mu=\theta/\sigmaitalic_ฮผ = italic_ฮธ / italic_ฯƒ.berger1999shows that both profile and marginalised likelihoodfunctions have poor behavior.But there is no need to use either.We can define a confidence setC={ฮธ/ฯƒ:(ฮธ,ฯƒ)โˆˆC~}๐ถconditional-set๐œƒ๐œŽ๐œƒ๐œŽ~๐ถC=\{\theta/\sigma:\ (\theta,\sigma)\in\widetilde{C}\}italic_C = { italic_ฮธ / italic_ฯƒ : ( italic_ฮธ , italic_ฯƒ ) โˆˆ over~ start_ARG italic_C end_ARG }where C~~๐ถ\widetilde{C}over~ start_ARG italic_C end_ARG is a joint confidence set for(ฮธ,ฯƒ)๐œƒ๐œŽ(\theta,\sigma)( italic_ฮธ , italic_ฯƒ ).This is a valid confidence setwhich uses neither โ„’psubscriptโ„’๐‘{\cal L}_{p}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT or โ„’msubscriptโ„’๐‘š{\cal L}_{m}caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

In some cases,we might want to use semiparametric methodsinstead of likelihood based methods.Suppose, for example,that the nuisance parameter is infinite dimensional.For example, consider testing for the presence of a signalin the presence of a background.The data take the form

Y1,โ€ฆ,Ynโˆผ(1โˆ’ฮป)โขbโข(y)+ฮปโขsโข(y)similar-tosubscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›1๐œ†๐‘๐‘ฆ๐œ†๐‘ ๐‘ฆY_{1},\ldots,Y_{n}\sim(1-\lambda)b(y)+\lambda s(y)italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT โˆผ ( 1 - italic_ฮป ) italic_b ( italic_y ) + italic_ฮป italic_s ( italic_y )

where0โ‰คฮปโ‰ค10๐œ†10\leq\lambda\leq 10 โ‰ค italic_ฮป โ‰ค 1,b๐‘bitalic_b is a known background distribution ands๐‘ sitalic_s is the signal.Suppose that s๐‘ sitalic_s is supportedon a signal region ฮฉฮฉ\Omegaroman_ฮฉbut is otherwise not known.We want to estimate ฮป๐œ†\lambdaitalic_ฮป or testH0:ฮป=0:subscript๐ป0๐œ†0H_{0}:\lambda=0italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_ฮป = 0.The unknown parameters are(ฮป,s)๐œ†๐‘ (\lambda,s)( italic_ฮป , italic_s ).Here, s๐‘ sitalic_s is any density such thatโˆซฮฉsโข(y)โข๐‘‘y=1subscriptฮฉ๐‘ ๐‘ฆdifferential-d๐‘ฆ1\int_{\Omega}s(y)dy=1โˆซ start_POSTSUBSCRIPT roman_ฮฉ end_POSTSUBSCRIPT italic_s ( italic_y ) italic_d italic_y = 1.So s๐‘ sitalic_s is an infinite dimensionalnuisance parameter.In our previous notation,the parameter of interest ฮฒ๐›ฝ\betaitalic_ฮฒ is ฮป๐œ†\lambdaitalic_ฮปand the nuisance parameter ฮผ๐œ‡\muitalic_ฮผ is s๐‘ sitalic_s.

Models of this formmight be better handled usingsemiparametric methods(tsiatis2006; kosorok2008).It is hard for me to judge whether such methods are usefulin physics but they may indeed be caseswhere they are.The main idea can be summarized as follows.Suppose that ฮผ๐œ‡\muitalic_ฮผ is the parameter of interestand ฮฒ๐›ฝ\betaitalic_ฮฒ is the (possibly infinite dimensional)nuisance parameter.Any well-behaved estimatorฮผ^^๐œ‡\widehat{\mu}over^ start_ARG italic_ฮผ end_ARG will havethe property thatnโข(ฮผ^โˆ’ฮผ)๐‘›^๐œ‡๐œ‡\sqrt{n}(\widehat{\mu}-\mu)square-root start_ARG italic_n end_ARG ( over^ start_ARG italic_ฮผ end_ARG - italic_ฮผ )will converge to a Normalwith mean 0 and some variance ฯƒ2superscript๐œŽ2\sigma^{2}italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.Generally,ฯƒ2โ‰ฅ๐”ผโข[ฯ†2โข(Y)]superscript๐œŽ2๐”ผdelimited-[]superscript๐œ‘2๐‘Œ\sigma^{2}\geq\mbox{$\mathbb{E}$}[\varphi^{2}(Y)]italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT โ‰ฅ blackboard_E [ italic_ฯ† start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Y ) ]for a function ฯ†๐œ‘\varphiitalic_ฯ†which is known as the efficientinfluence function.We then try to find ฮผ^^๐œ‡\widehat{\mu}over^ start_ARG italic_ฮผ end_ARG so thatit* corresponding ฯƒ2superscript๐œŽ2\sigma^{2}italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is equal to๐”ผโข[ฯ†2โข(Y)]๐”ผdelimited-[]superscript๐œ‘2๐‘Œ\mbox{$\mathbb{E}$}[\varphi^{2}(Y)]blackboard_E [ italic_ฯ† start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Y ) ].Such an estimator is said to besemiparametric efficient.

In our signal detection example,the efficient influence function is

ฯ†โข(y)=(1โˆ’ฮป)โขIฮฉโข(y)โˆ’Pโข(ฮฉ)โขIฮฉcโข(y)โˆซฮฉcbโข(y)โข๐‘‘y๐œ‘๐‘ฆ1๐œ†subscript๐ผฮฉ๐‘ฆ๐‘ƒฮฉsubscript๐ผsuperscriptฮฉ๐‘๐‘ฆsubscriptsuperscriptฮฉ๐‘๐‘๐‘ฆdifferential-d๐‘ฆ\varphi(y)=(1-\lambda)I_{\Omega}(y)-\frac{P(\Omega)I_{\Omega^{c}}(y)}{\int_{%\Omega^{c}}b(y)dy}italic_ฯ† ( italic_y ) = ( 1 - italic_ฮป ) italic_I start_POSTSUBSCRIPT roman_ฮฉ end_POSTSUBSCRIPT ( italic_y ) - divide start_ARG italic_P ( roman_ฮฉ ) italic_I start_POSTSUBSCRIPT roman_ฮฉ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG โˆซ start_POSTSUBSCRIPT roman_ฮฉ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_b ( italic_y ) italic_d italic_y end_ARG

where Pโข(ฮฉ)=Pโข(Yโˆˆฮฉ)๐‘ƒฮฉ๐‘ƒ๐‘ŒฮฉP(\Omega)=P(Y\in\Omega)italic_P ( roman_ฮฉ ) = italic_P ( italic_Y โˆˆ roman_ฮฉ )and the efficient estimator is simply

ฮป^=1โˆ’nโˆ’1โขโˆ‘iIฮฉcโข(Yi)โˆซฮฉcbโข(y)โข๐‘‘y.^๐œ†1superscript๐‘›1subscript๐‘–subscript๐ผsuperscriptฮฉ๐‘subscript๐‘Œ๐‘–subscriptsuperscriptฮฉ๐‘๐‘๐‘ฆdifferential-d๐‘ฆ\widehat{\lambda}=1-\frac{n^{-1}\sum_{i}I_{\Omega^{c}}(Y_{i})}{\int_{\Omega^{c%}}b(y)dy}.over^ start_ARG italic_ฮป end_ARG = 1 - divide start_ARG italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT roman_ฮฉ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG โˆซ start_POSTSUBSCRIPT roman_ฮฉ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_b ( italic_y ) italic_d italic_y end_ARG .

The point is that we have taken care of aninfinite dimensional nuisance parametervery easilyand we never mentionedlikelihood.

Another reason to considernon-likelihood methods is robustness.Both the profile likelihood andthe marginalised likelihood could leadto poor inferencesif the model is misspecified.What should we do in such cases?One possibility is to useminimum Hellinger estimation.In this case wechoose ฮธ^^๐œƒ\widehat{\theta}over^ start_ARG italic_ฮธ end_ARGto minimize

h2โข(ฮธ)=โˆซ(pฮธโข(y)โˆ’p^โข(y))2โข๐‘‘ysuperscriptโ„Ž2๐œƒsuperscriptsubscript๐‘๐œƒ๐‘ฆ^๐‘๐‘ฆ2differential-d๐‘ฆh^{2}(\theta)=\int(\sqrt{p_{\theta}(y)}-\sqrt{\widehat{p}(y)})^{2}dyitalic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ฮธ ) = โˆซ ( square-root start_ARG italic_p start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT ( italic_y ) end_ARG - square-root start_ARG over^ start_ARG italic_p end_ARG ( italic_y ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_y

where p^โข(y)^๐‘๐‘ฆ\widehat{p}(y)over^ start_ARG italic_p end_ARG ( italic_y ) is some non-parametric density estimator.This estimator is asymptoticallyequivalent to the maximum likelihood estimatorif the model is correctbut it is less sensitiveto model misspecification(beran1977; lindsay1994).

Another recent non-likelihood methodis the HulC(hulc).We divide the data intoB=logโก(2/ฮฑ)/logโก(2)๐ต2๐›ผ2B=\log(2/\alpha)/\log(2)italic_B = roman_log ( 2 / italic_ฮฑ ) / roman_log ( 2 )disjoint groups.Let ฮผ^1,โ€ฆ,ฮผ^Bsubscript^๐œ‡1โ€ฆsubscript^๐œ‡๐ต\widehat{\mu}_{1},\ldots,\widehat{\mu}_{B}over^ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , over^ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPTbe estimates from the B๐ตBitalic_B groups and letC=[minjโกฮผ^j,maxjโกฮผ^j]๐ถsubscript๐‘—subscript^๐œ‡๐‘—subscript๐‘—subscript^๐œ‡๐‘—C=[\min_{j}\widehat{\mu}_{j},\max_{j}\widehat{\mu}_{j}]italic_C = [ roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ].Define the median bias

Bn=(12โˆ’minโก{Pโข(ฮผ^โ‰ฅฮผ),Pโข(ฮผ^โ‰คฮผ)}).subscript๐ต๐‘›12๐‘ƒ^๐œ‡๐œ‡๐‘ƒ^๐œ‡๐œ‡B_{n}=\left(\frac{1}{2}-\min\Bigl{\{}P(\widehat{\mu}\geq\mu),\ P(\widehat{\mu}%\leq\mu)\Bigr{\}}\right).italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG - roman_min { italic_P ( over^ start_ARG italic_ฮผ end_ARG โ‰ฅ italic_ฮผ ) , italic_P ( over^ start_ARG italic_ฮผ end_ARG โ‰ค italic_ฮผ ) } ) .

So 0โ‰คBnโ‰ค1/20subscript๐ต๐‘›120\leq B_{n}\leq 1/20 โ‰ค italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT โ‰ค 1 / 2.In the extremesituation whereฮผ^^๐œ‡\widehat{\mu}over^ start_ARG italic_ฮผ end_ARG is larger than the true value ฮผ๐œ‡\muitalic_ฮผwith probability 1,we have Bn=1/2subscript๐ต๐‘›12B_{n}=1/2italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 / 2.Otherwise Bn<1/2subscript๐ต๐‘›12B_{n}<1/2italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < 1 / 2.If Bnโ†’0โ†’subscript๐ต๐‘›0B_{n}\to 0italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT โ†’ 0as is typically the case,we havePโข(ฮผโˆˆCn)โ†’1โˆ’ฮฑโ†’๐‘ƒ๐œ‡subscript๐ถ๐‘›1๐›ผP(\mu\in C_{n})\to 1-\alphaitalic_P ( italic_ฮผ โˆˆ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) โ†’ 1 - italic_ฮฑ.This is simple and does not requireprofiling or marginalising a likelihood function.

7 Conclusion

The confidence intervals and testsbased on the profile likelihoodshould have good coverage and reasonable sizeas long as the sample size is large,the number of nuisance parameters is not too largeand the usual regularity conditions hold.If any of these conditions fail to hold,it might be better to usealternative methods such asUniversal inference,simulation based inference,semiparametric inference or the HulC.The integrated likelihood may offer benefitsin some special casesbut I donโ€™t believeit is useful in any generality.

I conclude with remarks aboutpower.Methods like Universal inference,SBI and the HulC give satisfying coverage guaranteesbut these come with some lossof efficiency.The confidence sets from Universalinference typically shrink at the optimalrate but will nonetheless be larger than optimalconfidence sets (if such optimal sets exist).As a benchmark,consider the confidence setforY1,โ€ฆ,YnโˆผNโข(ฮผ,I)similar-tosubscript๐‘Œ1โ€ฆsubscript๐‘Œ๐‘›๐‘๐œ‡๐ผY_{1},\ldots,Y_{n}\sim N(\mu,I)italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT โˆผ italic_N ( italic_ฮผ , italic_I )where Yiโˆˆโ„dsubscript๐‘Œ๐‘–superscriptโ„๐‘‘Y_{i}\in\mathbb{R}^{d}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.Of course we donโ€™t need Universal inference herebut this case allows for precise results.dunn2023 show thatthe confidence set has radius of order d/n๐‘‘๐‘›\sqrt{d/n}square-root start_ARG italic_d / italic_n end_ARGwhich is optimal but, compared to the best confidence setthe squared radius is about twice as large.More precisely, the ratio squared radii is bounded by(4โขlogโก(1/ฮฑ)+4โขd)/(2โขlogโก(1/ฮฑ)+dโˆ’5/2)41๐›ผ4๐‘‘21๐›ผ๐‘‘52(4\log(1/\alpha)+4d)/(2\log(1/\alpha)+d-5/2)( 4 roman_log ( 1 / italic_ฮฑ ) + 4 italic_d ) / ( 2 roman_log ( 1 / italic_ฮฑ ) + italic_d - 5 / 2 ).Similarly, comparing theHulC to the Wald interval(when the latter is valid)we find that the length is expanded by a factor oflog2โก(log2โก(2/ฮฑ))subscript2subscript22๐›ผ\sqrt{\log_{2}(\log_{2}(2/\alpha))}square-root start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 2 / italic_ฮฑ ) ) end_ARG which is quite modest.For SBI it is hard to saymuch since that method works with anystatistic T๐‘‡Titalic_T and the size of the set will dependon the particular choice of T๐‘‡Titalic_T.However, it is fair to say thatusing generally applicable methodsthat provide correct coverage under weak conditions,there will be some price to pay.Evaluating these methods in some real physics problemswould be very valuable.This would be a great project forstatisticians and physicists to collaborate on.

8 Acknowledgments

Thanks to Sara Algeri, Louis Lyons andOlaf Behnke for organizing this discussion.Also, I would like to thankLouis Lyons and Bob Cousins for comments on an earlierdraft that led to improvements in the paper.

Marginalizing versus Profiling of Nuisance Parameters Notes by Robert D. Cousins Department of Physics and Astronomy University of California, Los Angeles 90095 Los Angeles, CA and Larry Wasserman Department of Statistics and Data Science Carnegie Mellon  (2024)
Top Articles
Latest Posts
Article information

Author: Pres. Lawanda Wiegand

Last Updated:

Views: 5865

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Pres. Lawanda Wiegand

Birthday: 1993-01-10

Address: Suite 391 6963 Ullrich Shore, Bellefort, WI 01350-7893

Phone: +6806610432415

Job: Dynamic Manufacturing Assistant

Hobby: amateur radio, Taekwondo, Wood carving, Parkour, Skateboarding, Running, Rafting

Introduction: My name is Pres. Lawanda Wiegand, I am a inquisitive, helpful, glamorous, cheerful, open, clever, innocent person who loves writing and wants to share my knowledge and understanding with you.