Combine discrete p values (in Python)

This module provides a toolbox for combining p values of rank tests and other tests with a discrete null distribution.

When do you need this?

This module has a scope similar to SciPy’s combine_pvalues:

  • You have a dataset consisting of independent sub-datasets. (So this is not about multiple testing or pseudo-replication.)

  • For each sub-dataset, you have performed a test investigating the same null hypothesis. (Often, this is the same test and the sub-datasets only differ in size.)

  • There is no straightforward test to apply to the entire dataset.

  • You want a single p value for the null hypothesis taking into account the entire dataset, i.e., you want to combine your test results for the sub-datasets.

However, combine_pvalues assumes that the individual tests are continuous (see below for a definition), while applying it to discrete tests will yield a systematically wrong combined p value. For example, for Fisher’s method it systematically overestimates the p value, i.e., you may falsely accept the null hypothesis (false negative). This module addresses this and thus you should consider it if:

  • At least one of the sub-tests is discrete with a low number of possible p values. What is a “low number” depends on the details, but 30 almost always is.

  • The combined p value returned by combine_pvalues is not very low already.

Also see An extensive example for a hands-on example, where only combining p values with accounting for the discreteness of tests yield the correct result.

Also, as a side product, this module also implements Monte Carlo-based weighted variants of methods other than Stouffer’s, which combine_pvalues does not provide.

Discrete and continuous tests

If the null hypothesis of a given test holds, its p values are uniformly distributed on the interval \((0,1]\) in the sense that \(\text{CDF}(p_0) = P(p≤p_0) = p_0\). However, for some tests, there is a limited number of possible outcomes for a given sample size. For example, the only possible outcomes (p values) of the one-sided sign test for a sample size of 5 are \(\frac{ 1}{32}\), \(\frac{ 3}{16}\), \(\frac{ 1}{ 2}\), \(\frac{13}{16}\), \(\frac{31}{32}\), and \(1\), simply because five numbers can only have so many different (unordered) combinations of signs. For the purposes of this module, I call these tests discrete. By contrast, for a continous test, all values on the interval \((0,1]\) are possible outcomes (for any given sample size).

Discrete tests include all rank tests, since there is only a finite number of ways to rank a given number of samples. Moreover, they contain tests of bound integer data. The most relevant discrete tests are:

  • the sign test,

  • the Mann–Whitney U test,

  • Wilcoxon’s signed rank test,

  • any test based on a ranked correlation such as Kendall’s τ and Spearman’s ρ,

  • the Kruskal–Wallis test,

  • Fisher’s exact test and any other test for integer contingency tables.

Tests whose result continuously depends on the samples are continuous. The most relevant continuous tests are:

  • all flavours of the t test,

  • the Kolmogorov–Smirnov test,

  • the test for significance of Pearson’s r,

  • ANOVA.

How this module works

To correctly compute the combined p value, we need to take into account the null distributions of the individual tests, i.e., what p values are possible. This module determines these values for popular tests or lets you specify them yourself. Of course, if you have continuous tests in the mix, you can also include them. Either way, the relevant information is stored in a CTR object (“combinable test result”). These objects can then be combined using the combine function.

The difficulty for determining the combined p value is convolving the respective null distributions. While this is analytically possible for continuous tests or a small number of discrete tests, it is requires numerical approximations otherwise. To perform these approximations, we use a Monte Carlo simulation sampling combinations of individual p values. Thanks to modern computing and NumPy, it is easy to make the number of samples very high and the result very accurate.

Complements

In several cases, this module uses the complement q of a p value. For example, combining methods such as Pearson’s or Mudholkar’s and George’s use it as part of their statistics. For continuous tests, this complement is straightforwardly computed as \(q = 1-p\). However, for discrete tests this leads to implausible results, in particular if \(p=1\). To avoid this, this module uses for q the probability to observe such a p value or a higher one. In analogy to \(\text{CDF}(p_0) = P(p≤p_0) = p_0\), we have \(\text{CCDF}(p_0) = P(p≥0) = q\) (both under the null hypothesis). This applies whenever the complement of a p value is relevant.

A simple example

We have three pairs of independent datasets (A_1,A_2), (B_1,B_2), and (C_1,C_2). Our research hypothesis is that the values in the respective first dataset are lower. However, the datasets have different properties and thus cannot simply be pooled:

  • (A_1,A_2) is a paired dataset. Since we cannot assume any distribution, we want to apply the sign test.

  • (B_1,B_2) is an unpaired dataset (with unequal sizes). Again, we cannot assume a distribution. Hence we want to apply the Mann–Whitney U test.

  • (C_1,C_2) is a paired dataset, however we can assume that the differences to be normally distributed and thus apply the t test for two dependent samples.

We start with performing the sign test on A_1 and A_2. We create a CTR from this:

from combine_pvalues_discrete import CTR, combine
result_A = CTR.sign_test(A_1,A_2,alternative="less")

result_A now contains the p value of the sign test as well as the null distribution of possible p values. It is ready for being combined with other test results, but we have to create these first. We do so by doing something very similar with B_1 and B_2 and the Mann–Whitney U test:

result_B = CTR.mann_whitney_u(B_1,B_2,alternative="less")

Finally, we perform the t test on C_1 and C_2. Since the t test is a continuous test, we do not need a special constructor to create a CTR, but can use generic one using only the p value computed with an existing function, here scipy.stats.ttest_rel. The argument [] specifies that the test is continuous:

from scipy.stats import ttest_rel
p_C = ttest_rel(C_1,C_2).pvalue
result_C = CTR(p_C)

After we have performed the individual tests, we can obtain the compound p value using combine (which we imported earlier).

print( combine([result_A,result_B,result_C]) )

An extensive example

In this example, we illustrate the benefits of this module by comparing it with other approaches to the same dataset, all of which yield wrong results or consume a lot of time.

Suppose we want to to explore the effect on a drug on dogs. We expect our observable (by which we measure the effect of the drug) be more affected by the dog breed than by the drug, e.g., because a poodle is generally weaker than a mastiff. Therefore we group the dogs by breed a priori, creating sub-datasets. Each breed group gets further randomly split into a treatment and control group. Since not all dogs complete the study, our sub-datasets become very inhomogeneous in sample size.

Our data looks like this:

data = [
	( [8,13,37]      , [43,51]       ),
	( [60,68,46,45]  , [30]          ),
	( [92,97,98]     , [84,89]       ),
	( [14]           , [21,45,31,23] ),
	( [24,58,0,24,33], [65,51,61]    ),
	( [93,76,70,83]  , [84]          ),
	( [10,2]         , [28,36,11]    ),
	( [27]           , [38,58]       ),
	( [18]           , [12]          ),
	( [20,44,14,68]  , [73,22,80]    ),
]

Each pair represents one breed, with the first half being the control group and the second the treatment group. If our drug works as desired, the second half should exhibit higher values. Finally, due to the nature of our observable, we only want to use a ranked statistics.

Thus, we want to investigate the null hypothesis that for each sub-dataset, both samples are from the same distribution (or more precisely, the null hypothesis of the Mann–Whitney U test). The alternative hypothesis is that that the first pair of samples are from a distribution with a lower median.

First, suppose we discard our information on breeds and pool the control and treatment groups. We then apply the Mann–Whitney U test to the pooled samples. This way, we do not need to combine tests, but we lose statistical power.

from scipy.stats import mannwhitneyu

pooled_Xs = [ x for X,Y in data for x in X ]
pooled_Ys = [ y for X,Y in data for y in Y ]
print( mannwhitneyu(pooled_Xs,pooled_Ys,alternative="less") )
# MannwhitneyuResult(statistic=282.0, pvalue=0.30908071682819527)

Alternatively, we can summarise the samples in each pair by their median and use the sign test to compare the groups. Again, we discard information and lose statistical power:

# Summarizing data and sign test
import numpy as np
from combine_pvalues_discrete import sign_test

reduced_Xs = [ np.median(X) for X,Y in data ]
reduced_Ys = [ np.median(Y) for X,Y in data ]
print( sign_test( reduced_Xs, reduced_Ys, alternative="less" ) )
# SignTestResult(pvalue=0.171875, not_tied=10, statistic=3)

To properly take into account all information, we have to apply the Mann–Whitney U test to each pair (breed) and then combine the p values. SciPy’s combine_pvalues allows us to do this, but it requires continuous tests. Since the Mann–Whitney U test does not fulfil this requirement, we will overestimate the combined p value:

from scipy.stats import combine_pvalues, mannwhitneyu

pvalues = [ mannwhitneyu(X,Y,alternative="less").pvalue for X,Y in data ]
statistic,pvalue = combine_pvalues(pvalues,method="fisher")
print(statistic,pvalue)
# (27.447712265267114, 0.123131292229715)

Finally, by using this module, we can take into account the discreteness of tests, obtaining a correct combined p value:

from combine_pvalues_discrete import CTR, combine

ctrs = [ CTR.mann_whitney_u(X,Y,alternative="less") for X,Y in data ]
print( combine(ctrs,method="fisher") )
# Combined_P_Value(pvalue=0.0014229998577000142, std=1.1920046440408576e-05)

(We here use Fisher’s method, since it simplifies the following demonstration. In general, I don’t recommend it.)

Checking the Result

Let’s convince ourselves that the result of combine is actually correct. To this end, we first implement the statistic of Fisher’s method for combining Mann–Whitney U tests. Note how the result agrees with that of applying combine_pvalues above:

def fisher_statistic(dataset):
	pvalues = [ mannwhitneyu(X,Y,alternative="less").pvalue for X,Y in dataset ]
	return -2*np.sum(np.log(pvalues))

data_statistic = fisher_statistic(data)
print(data_statistic)
# 27.447712265267114

Next we implement a function that samples analogous datasets corresponding to our null hypothesis (surrogates). (Since we only care about the order of samples, we do not have to recreate the magnitude of values.)

rng = np.random.default_rng()
def null_sample(data):
	return [
		( rng.random(len(X)), rng.random(len(Y)) )
		for X,Y in data
	]

Finally we sample n=10000 times from our null model and estimate the p value by comparing the values of Fisher’s statistic for the null model and the original data.

n = 10000
null_statistic = [ fisher_statistic(null_sample(data)) for _ in range(n) ]
count = np.sum( null_statistic >= data_statistic )
print( (count+1)/(n+1) )
# 0.0016998300169983002

This confirms the low p value we obtained with combine above and that the p values obtained with the other methods were too high. You may note that this value does not agree with the result of combine from above. The reason for this is that the variability of the null-model approach is so high (on account of n being low) and obtaining a precision comparable to compare would require an excessive amount of time.

Implementing your own test

If you want to analyse a given dataset with a test that this module does not provide, you need to determine two things:

  • The p value of the test applied to your dataset.

  • A list of all possible p values that you test can yield for datasets with the same sample size(s).

You can use these as arguments of CTR’s default constructor.

The best way to find all possible p values is to get a rough understanding of the test statistics and look into an existing implementation of the test, so you don’t have to fully re-invent the wheel.

Note that individual tests should always be one-sided for the following reason: If you have two equally significant, but opposing sub-results, they should not add in effect, but cancel each other. This is not possible when you use two-sided subtests, since all information on the directionality of a result gets lost.

Example: Mann–Whitney U test

Suppose, we want to implement the Mann–Whitney U test (with the alternative “less”) into the CTR framework. (This test is already implemented, so we don’t actually need to do this.)

The test is named for its statistics U which can exhibit values between 0 and \(n·m\), where \(n\) and \(m\) are the sizes of the datasets that are compared. These U values are then translated into p values. If we look into SciPy’s implementation of this test, we find that it uses an implementation of the test’s null distribution (scipy.stats._mannwhitneyu._mwu_state), which we can exploit. Finally, we can use SciPy’s implementation to compute the p value of our given datasets.

With that we have all the ingredients to write a small function to return a combinable test result for a given pair of datasets:

from combine_pvalues_discrete import CTR
from scipy.stats._mannwhitneyu import _mwu_state, mannwhitneyu

def mwu_ctr(x,y):
	p = mannwhitneyu(x,y,method="exact",alternative="less").pvalue
	n,m = len(x),len(y)
	possible_ps = [ _mwu_state.cdf(U,n,m) for U in range(n*m+1) ]
	return CTR( p, possible_ps )

Why is the default combining method Mudholkar’s and George’s?

I assume here that you want to investigate the research hypothesis that all datasets are subject to the same trend. The trend may manifest more clearly in some of the datasets (and you don’t know which a priori), but it should not be inverted (other than by chance). In this case, you would perform one-sided subtests. (If you would consider both directions of trend a finding, the combination needs to be two-sided, not the subtests.)

If the p value of such a subtest is small, the sub-dataset exhibits the trend you hypothesised. Conversely, if the complement \(q ≈ 1-p\) of a subtest is small, the sub-dataset exhibits a trend opposite to what you hypothesised – with a p value q. (See Complements on how q is defined for the purposes of this module.) I think that the combined p values should reflect this, i.e., the complement q should indicate the significance of the opposite one-sided hypothesis (not the null hypothesis) just like the p value indicates the significance of the null hypothesis.

To achieve this, the combining method must be treating p and q in a symmetrical fashion. This also means that the following results exactly negate each other:

  • a subtest with \(p=p_0\).

  • a subtest with \(q=p_0\), i.e., \(p≈1-p_0\).

Only two methods fulfil this: the one by Mudholkar and George as well as the one by Stouffer. Since the latter’s statistics becomes infinite if \(p=1\) for any subtest (and thus cannot distinguish between this happening for one or almost all tests), I prefer Mudholkar’s and George’s method.

Supported Tests

Currently, this module supports:

  • the sign test,

  • the Mann–Whitney U test,

  • Fisher’s and Boschloo’s exact tests,

  • Spearman’s ρ and Kendall’s τ.

Ties are not supported in every case. If you require any further test or support for ties, please tell me.

Command reference

class CTR(p, all_ps=None)

CTR = combinable test result

Represents a single test result. Use the default constructor to implement a test yourself or use one of the class methods for the respective result.

Parameters
p

The p value yielded by the test for the investigated sub-dataset.

all_ps

An iterable containing all possible p values of the test for datasets with the same size as the dataset for this individual test. If None or empty, all p values will be considered possible, i.e., the test will be assumed to be continuous.

classmethod mann_whitney_u(x, y, **kwargs)

Creates an object representing the result of a single Mann–Whitney U test (using SciPy’s mannwhitneyu).

The two-sided test is not supported because it makes little sense in a combination scenario. Ties are not supported yet because I expect them not to occur in the scenarios that require test combinations (but I may be wrong about this) and they make things much more complicated.

Parameters
x,y

The two arrays of samples to compare.

kwargs

Further keyword arguments to be passed on to SciPy’s mannwhitneyu, such as alternative.

classmethod sign_test(x, y=0, alternative='less')

Creates an object representing the result of a single sign test.

Parameters
x,y

The two arrays of paired samples to compare. If y is a number, a one-sample sign test is performed with y as the median. With y as an iterable, the test is two-sided.

alternative: “less” or “greater”

The two-sided test is not supported because it makes little sense in a combination scenario.

classmethod spearmanr(x, y, alternative='greater', n_thresh=9)

Creates an object representing the result of a single Spearman’s ρ test. If the size of arrays n! is smaller than n_thresh, p values are exactly determined using a permutation test. Otherwise p values are computed using SciPy’s spearmanr, but with an imposed lower limit of n! and a uniform distribution of p values is assumed.

Parameters
x,y

The two arrays of samples to correlate.

alternative: “greater” or “less”
n_thresh:

Threshold under which a permutation test is used.

classmethod kendalltau(x, y, **kwargs)

Creates an object representing the result of a single Kendall’s τ test using SciPy’s kendalltau to compute p values.

NaNs and ties are not supported.

Parameters
x,y

The two arrays of samples to correlate.

alternative: “greater” or “less”
classmethod fisher_exact(C, alternative='less')

Creates an object representing the result of Fisher’s exact test for a single contingency table C. This is unrelated to Fisher’s method of combining p values.

Parameters
C

The contingency table.

alternative: “less” or “greater”
classmethod boschloo_exact(C, alternative='less', n=32, atol=1e-10)

Creates an object representing the result of Boschloo’s exact for a single contingency table C using SciPy’s implementation.

Parameters
C

The contingency table.

alternative: “less” or “greater”
n

The same parameter of SciPy’s boschloo_exact.

atol

p values that are closer than this are treated as identical.

combine(ctrs, weights=None, method='mudholkar_george', alternative='less', n_samples=10000000, sampling_method='proportional', rtol=1e-15, atol=1e-15, RNG=None)

Estimates the combined p value of combinable test results. Usually, this result is why you are using this module.

Parameters
ctrs: iterable of CTRs

The test results that shall be combined.

method: string or function

One of “fisher”, “pearson”, “mudholkar_george”, “stouffer”, “tippett”, “edgington”, “simes”, or a self-defined function.

In the latter case, the function can have the following arguments (which must be named as given): * A two-dimensional array p containing the p values. * A two-dimensional array q containing their complements. * A one-dimensional array w containing the weights. The function must return the statistics computed along the zero-th axis. For example for the weighted Mudholkar–George method, this function would be lambda p,q,w:  w.dot(np.log(p/q)). The sign of the statistics must be such that low values indicate a high significance.

alternative: “less”, “greater”, or “two-sided”

The direction of the (common) trend that your compound null hypothesis is testing against. Mind that this is not about the sidedness of the individual tests: Those should always be one-sided.

  • If “less”, the compound research hypothesis is that the subtests exhibit a trend towards a low p value.

  • If “less”, the compound research hypothesis is that the subtests exhibit a trend towards high p values (close to 1). In this case, the method of choice will be applied to the complements of the p values (see Complements).

  • If “two-sided”, the compound research hypothesis is that the subtests exhibit either of the two above trends.

weights: iterable of numbers

Weights for individual results. Does not work for minimum-based methods (Tippett and Simes).

n_samples

Number of samples used for Monte Carlo simulation.

rtol: non-negative float
atol: non-negative float

Values of the statistics with a relative difference closer than specified by atol and rtol are regarded as identical. A small value (such as the default) may improve the results if numerical noise makes values different.

RNG

NumPy random-number generator used for the Monte Carlo simulation. Will be automatically generated if not specified.

sampling_method: “proportional” or “stochastic”

If "proportional", the frequency p values for each individual result will be exactly proportional to its probability – except for rounding. Only the rounding and the order of elements will be stochastic.

If method is "stochastic", the values will be randomly sampled and thus their actual frequencies are subject to stochastic fluctuations. This usually leads to slightly less accurate results, but the simulations are statistically independent.

The author of these lines cannot think of any disadvantage to the latter approach.

Returns
pvalue

The estimated combined p value.

std

The estimated standard deviation of p values when repeating the sampling. This is accurate for stochastic sampling and overestimating for proportional sampling.

sign_test(x, y=0, alternative='less')

Sign test.

two-sided: Pass paired samples x and y as arguments. The tested null hypothesis is that x[i] and y[i] are from the same distribution (separately for each i).

one-sided Pass a single sample x and a number y. The tested null hypothesis is that x is sampled from a distribution with a median larger than y.

Returns a tuple consisting of the p value and the number of non-tied samples.