Combine discrete *p* values (in Python)
=======================================

This module provides a toolbox for combining *p* values of rank tests and other tests with a discrete null distribution.

When do you need this?
----------------------

This module has a scope similar to SciPy’s `combine_pvalues`_:

* You have a dataset consisting of **independent** sub-datasets. (So this is not about multiple testing or pseudo-replication.)
* For each sub-dataset, you have performed a test investigating the **same** null hypothesis. (Often, this is the same test and the sub-datasets only differ in size.)
* There is no straightforward test to apply to the entire dataset.
* You want a single *p* value for the null hypothesis taking into account the entire dataset, i.e., you want to combine your test results for the sub-datasets.

**However,** `combine_pvalues` assumes that the individual tests are continuous (see below for a definition), while applying it to discrete tests will yield a systematically wrong combined *p* value.
For example, for `Fisher’s method`_ it systematically overestimates the *p* value, i.e., you may falsely accept the null hypothesis (false negative).
This module addresses this and thus you should consider it if:

* At least one of the sub-tests is *discrete* with a low number of possible *p* values. What is a “low number” depends on the details, but 30 almost always is.
* The combined *p* value returned by `combine_pvalues` is not very low already.

Also see `comparison` for a hands-on example, where only combining *p* values with accounting for the discreteness of tests yield the correct result.

**Also,** as a side product, this module also implements Monte Carlo-based **weighted** variants of methods other than Stouffer’s, which `combine_pvalues` does not provide.

Discrete and continuous tests
`````````````````````````````

If the null hypothesis of a given test holds, its *p* values are uniformly distributed on the interval :math:`(0,1]` in the sense that :math:`\text{CDF}(p_0) = P(p≤p_0) = p_0`.
However, for some tests, there is a limited number of possible outcomes for a given sample size.
For example, the only possible outcomes (*p* values) of the one-sided sign test for a sample size of 5 are
:math:`\frac{ 1}{32}`,
:math:`\frac{ 3}{16}`,
:math:`\frac{ 1}{ 2}`,
:math:`\frac{13}{16}`,
:math:`\frac{31}{32}`, and
:math:`1`,
simply because five numbers can only have so many different (unordered) combinations of signs.
For the purposes of this module, I call these tests *discrete.*
By contrast, for a *continous* test, all values on the interval :math:`(0,1]` are possible outcomes (for any given sample size).

Discrete tests include all `rank tests <https://en.wikipedia.org/wiki/Rank_test>`_, since there is only a finite number of ways to rank a given number of samples.
Moreover, they contain tests of bound integer data.
The most relevant **discrete tests** are:

* the sign test,
* the Mann–Whitney *U* test,
* Wilcoxon’s signed rank test,
* any test based on a ranked correlation such as Kendall’s *τ* and Spearman’s *ρ*,
* the Kruskal–Wallis test,
* Fisher’s exact test and any other test for integer contingency tables.

Tests whose result continuously depends on the samples are continuous.
The most relevant **continuous tests** are:

* all flavours of the *t* test,
* the Kolmogorov–Smirnov test,
* the test for significance of Pearson’s *r*,
* ANOVA.

How this module works
---------------------

To correctly compute the combined *p* value, we need to take into account the null distributions of the individual tests, i.e., what *p* values are possible.
This module determines these values for popular tests or lets you specify them yourself.
Of course, if you have continuous tests in the mix, you can also include them.
Either way, the relevant information is stored in a `CTR` object (“combinable test result”).
These objects can then be combined using the `combine` function.

The difficulty for determining the combined *p* value is convolving the respective null distributions.
While this is analytically possible for continuous tests or a small number of discrete tests, it is requires numerical approximations otherwise.
To perform these approximations, we use a Monte Carlo simulation sampling combinations of individual *p* values.
Thanks to modern computing and NumPy, it is easy to make the number of samples very high and the result very accurate.

.. _complements:

Complements
-----------

In several cases, this module uses the complement *q* of a *p* value.
For example, combining methods such as Pearson’s or Mudholkar’s and George’s use it as part of their statistics.
For continuous tests, this complement is straightforwardly computed as :math:`q = 1-p`.
However, for discrete tests this leads to implausible results, in particular if :math:`p=1`.
To avoid this, this module uses for *q* the probability to observe such a *p* value or a higher one.
In analogy to :math:`\text{CDF}(p_0) = P(p≤p_0) = p_0`, we have :math:`\text{CCDF}(p_0) = P(p≥0) = q` (both under the null hypothesis).
This applies whenever the complement of a *p* value is relevant.

A simple example
----------------

.. automodule:: simple_example


.. _comparison:

An extensive example
--------------------

.. automodule:: comparison

Implementing your own test
--------------------------

If you want to analyse a given dataset with a test that this module does not provide, you need to determine two things:

* The *p* value of the test applied to your dataset.
* A list of all possible *p* values that you test can yield for datasets with the same sample size(s).

You can use these as arguments of `CTR`’s default constructor.

The best way to find all possible *p* values is to get a rough understanding of the test statistics and look into an existing implementation of the test, so you don’t have to fully re-invent the wheel.

Note that individual tests should always be one-sided for the following reason:
If you have two equally significant, but opposing sub-results, they should not add in effect, but cancel each other.
This is not possible when you use two-sided subtests, since all information on the directionality of a result gets lost.

Example: Mann–Whitney *U* test
``````````````````````````````

.. automodule:: mwu


Why is the default combining method Mudholkar’s and George’s?
-------------------------------------------------------------

I assume here that you want to investigate the research hypothesis that all datasets are subject to the same trend.
The trend may manifest more clearly in some of the datasets (and you don’t know which a priori), but it should not be inverted (other than by chance).
In this case, you would perform one-sided subtests.
(If you would consider both directions of trend a finding, the combination needs to be two-sided, not the subtests.)

If the *p* value of such a subtest is small, the sub-dataset exhibits the trend you hypothesised.
Conversely, if the complement :math:`q ≈ 1-p` of a subtest is small, the sub-dataset exhibits a trend opposite to what you hypothesised – with a *p* value *q*.
(See `complements` on how *q* is defined for the purposes of this module.)
I think that the combined *p* values should reflect this, i.e., the complement *q* should indicate the significance of the opposite one-sided hypothesis (not the null hypothesis) just like the *p* value indicates the significance of the null hypothesis.

To achieve this, the combining method must be treating *p* and *q* in a symmetrical fashion.
This also means that the following results exactly negate each other:

* a subtest with :math:`p=p_0`.
* a subtest with :math:`q=p_0`, i.e., :math:`p≈1-p_0`.

Only two methods fulfil this: the one by Mudholkar and George as well as the one by Stouffer.
Since the latter’s statistics becomes infinite if :math:`p=1` for any subtest (and thus cannot distinguish between this happening for one or almost all tests), I prefer Mudholkar’s and George’s method.


Supported Tests
---------------

Currently, this module supports:

* the sign test,
* the Mann–Whitney *U* test,
* Fisher’s and Boschloo’s exact tests,
* Spearman’s ρ and Kendall’s τ.

Ties are not supported in every case. If you require any further test or support for ties, please `tell me <https://github.com/BPSB/combine-p-values-discrete/issues/new>`_.


Command reference
-----------------

.. automodule:: combine_pvalues_discrete
	:members: CTR, combine, sign_test

.. _combine_pvalues: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.combine_pvalues.html

.. _Fisher’s method: https://en.wikipedia.org/wiki/Fisher%27s_method