credible.bayesian.utils¶

Functions to evalute the (Bayesian) credible region of measures.

(Bayesian) credible region interpretation, with 95% coverage: The probability that the true proportion will lie within the 95% credible interval is 0.95.

Contrary to frequentist approaches, in which one can only say that if the test were repeated an infinite number of times, and one constructed a confidence interval each time, then X% of the confidence intervals would contain the true rate, here we can say that given our observed data, there is a X% probability that the true value of \(k/n\) falls within the provided interval.

See a discussion in Five Confidence Intervals for Proportions That You Should Know About for a study on coverage for most common methods.

Note

For a disambiguation with Confidence Interval (the frequentist approach), read Credible Regions or Intervals.

Functions

`average_beta`(successes, failures, lambda_, ...)	Return mean, mode, upper and lower bounds of the credible region of an average of measures with beta posteriors.
`average_beta_posterior`(successes, failures, ...)	Simulate the average beta posterior of many systems.
`average_f1_posterior`(tp, fp, fn, lambda_, ...)	Simulate the F1-score posterior of an average system with the provided markings.
`beta`(successes, failures, lambda_, coverage)	Return the mode, upper and lower bounds of the equal-tailed credible region of a probability estimate following Bernoulli trials.
`beta_array`(successes, failures, lambda_, ...)	`beta()`, for multiple systems.
`beta_posterior`(successes, failures, lambda_, ...)	Simulate the beta posterior of a system with the provided markings.
`compare_beta_posteriors`(k1, l1, k2, l2, ...)	Return the probability that system 1 is better than system 2 for a given figure of merit.
`compare_f1_scores`(tp1, fp1, fn1, tp2, fp2, ...)	Return the probability that the F1-score from 1 system is bigger than the F1-score of a second system.
`compare_systems`(n, lambda_, nb_samples, rng)	Compare 2 system (binary) outputs using a Dirichlet posterior.
`evaluate_statistics`(variates, coverage, bins)	Evaluate the left and right margins for a given M-C distribution.
`f1_posterior`(tp, fp, fn, lambda_, ...)	Simulate the F1-score posterior of a system with the provided markings.

credible.bayesian.utils.beta_array(successes, failures, lambda_, coverage)[source]¶

beta(), for multiple systems.

Parameters:

successes (Iterable[int]) – Number of successes (k) observed on the experiment.
failures (Iterable[int]) – Number of failures (l) observed on the experiment.
lambda_ (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior. If unsure, use 0.5.
coverage (float) – A floating-point number between 0 and 1.0 indicating the coverage you’re expecting. A value of 0.95 will ensure 95% of the area under the probability density of the posterior is covered by the returned equal-tailed interval.

Return type:

tuple[GenericAlias[float64], GenericAlias[float64], GenericAlias[float64], GenericAlias[float64]]

Returns:

A tuple containing 4 floating point numbers representing:

mean: The mean of the posterior distribution
mode: The mode of the posterior distribution
lower: The lower bounds of the credible region
upper: The upper bounds of the credible region

If the input was a 1-D array with multiple experiments, then the return value of this function is also composed of 1-D arrays, representing the mean, mode, lower and upper bounds of the various credible regions defined for each of the success/failure tuples on the input.

Raises:

TypeError – If the dimensions of successes and failures do not match, or in case the input types are unsupported.

credible.bayesian.utils.beta(successes, failures, lambda_, coverage)[source]¶

Return the mode, upper and lower bounds of the equal-tailed credible region of a probability estimate following Bernoulli trials.

This technique is (not) very conservative - in most of the cases, coverage closer to the extremes (0 or 1) is lower than expected (but still greater than 85%).

This implementation is based on [GOUTTE-2005]. It assumes \(k\) successes and \(l\) failures (\(n = k+l\) total trials) are issued from a series of Bernoulli trials (likelihood is binomial). The posterior is derivated using the Bayes Theorem with a beta prior. As there is no reason to favour high vs. low precision, we use a symmetric Beta prior (\(\alpha=\beta\)):

\[\begin{split}P(p|k,n) &= \frac{P(k,n|p)P(p)}{P(k,n)} \\ P(p|k,n) &= \frac{\frac{n!}{k!(n-k)!}p^{k}(1-p)^{n-k}P(p)}{P(k)} \\ P(p|k,n) &= \frac{1}{B(k+\alpha, n-k+\beta)}p^{k+\alpha-1}(1-p)^{n-k+\beta-1} \\ P(p|k,n) &= \frac{1}{B(k+\alpha, n-k+\alpha)}p^{k+\alpha-1}(1-p)^{n-k+\alpha-1}\end{split}\]

The mode for this posterior (also the maximum a posteriori) is:

\[\text{mode}(p) = \frac{k+\lambda-1}{n+2\lambda-2}\]

Concretely, the prior may be flat (all rates are equally likely, \(\lambda=1\)) or we may use Jeoffrey’s prior (\(\lambda=0.5\)), that is invariant through re-parameterisation. Jeffrey’s prior indicate that rates close to zero or one are more likely.

The mode above works if \(k+{\alpha},n-k+{\alpha} > 1\), which is usually the case for a resonably well tunned system, with more than a few samples for analysis. In the limit of the system performance, \(k\) may be 0, which will make the mode become zero.

For our purposes, it may be more suitable to represent \(n = k + l\), with \(k\), the number of successes and \(l\), the number of failures in the binomial experiment, and find this more suitable representation:

\[\begin{split}P(p|k,l) &= \frac{1}{B(k+\alpha, l+\alpha)}p^{k+\alpha-1}(1-p)^{l+\alpha-1} \\ \text{mode}(p) &= \frac{k+\lambda-1}{k+l+2\lambda-2}\end{split}\]

This can be mapped to most rates calculated in the context of binary classification this way:

Precision or Positive-Predictive Value (PPV): p = TP/(TP+FP), so k=TP, l=FP
Recall, Sensitivity, or True Positive Rate: r = TP/(TP+FN), so k=TP, l=FN
Specificity or True Negative Rage: s = TN/(TN+FP), so k=TN, l=FP
Accuracy: acc = TP+TN/(TP+TN+FP+FN), so k=TP+TN, l=FP+FN
Jaccard: j = TP/(TP+FP+FN), so k=TP, l=FP+FN

Note

Important

To calculate the limits given the required coverage, we use the incomplete inverse (regularized, or normalized) beta function, scipy.special.betaincinv instead of scipy.special.betainc. The latter requires we provide the bounds and returns the coverage, whereas here we are interested in the inverse behaviour.

Parameters:

successes (int) – Number of successes (k) observed on the experiment.
failures (int) – Number of failures (l) observed on the experiment.
lambda_ (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior. If unsure, use 0.5.
coverage (float) – A floating-point number between 0 and 1.0 indicating the coverage you’re expecting. A value of 0.95 will ensure 95% of the area under the probability density of the posterior is covered by the returned equal-tailed interval.

Return type:

tuple[float, float, float, float]

Returns:

A tuple containing 4 floating point numbers representing:

mean: The mean of the posterior distribution
mode: The mode of the posterior distribution
lower: The lower bounds of the credible region
upper: The upper bounds of the credible region

Raises:

TypeError – If the dimensions of successes and failures do not match, or in case the input types are unsupported.

credible.bayesian.utils.beta_posterior(successes, failures, lambda_, nb_samples, rng)[source]¶

Simulate the beta posterior of a system with the provided markings.

This implementation is based on [GOUTTE-2005], Equation 7.

Figures of merit that are supported by this procedure are those which have the form \(v = k / (k + l)\) (successes over successes plus failures):

Precision or Positive-Predictive Value (PPV): \(p = TP/(TP+FP)\), so \(k=TP\), \(l=FP\)
Recall, Sensitivity, or True Positive Rate: \(r = TP/(TP+FN)\), so \(k=TP\), \(l=FN\)
Specificity or True Negative Rage: \(s = TN/(TN+FP)\), so \(k=TN\), \(l=FP\)
Accuracy: \(acc = TP+TN/(TP+TN+FP+FN)\), so \(k=TP+TN\), \(l=FP+FN\)
Jaccard Index: \(j = TP/(TP+FP+FN)\), so \(k=TP\), \(l=FP+FN\)

Parameters:

successes (int) – Number of successes (k) observed on the experiment. May be a scalar of any integral type, or an array of integral values indicating a series of experiments to calculate the credible regions for.
failures (int) – Number of failures (l) observed on the experiment. May be a scalar of any integral type, or an array of integral values indicating a series of experiments to calculate the credible regions for.
lambda_ (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior.
nb_samples (int) – Number of generated gamma distribution values.
rng (Generator) – An initialized numpy random number generator.

Return type:

GenericAlias[float64]

Returns:

Variates: An array with size nb_samples containing a realization of Equation 7 from [GOUTTE-2005]. If successes and failures are sequences of arrays, then returns a 2D array where each row represents one set of realisations.

Raises:

TypeError – If the dimensions of successes and failures do not match.

credible.bayesian.utils.average_beta_posterior(successes, failures, lambda_, nb_samples, rng)[source]¶

Simulate the average beta posterior of many systems.

This implementation is based on [GOUTTE-2005], Equation 7.

Figures of merit that are supported by this procedure are those which have the form \(v = k / (k + l)\) (successes over successes plus failures):

Precision or Positive-Predictive Value (PPV): \(p = TP/(TP+FP)\), so \(k=TP\), \(l=FP\)
Recall, Sensitivity, or True Positive Rate: \(r = TP/(TP+FN)\), so \(k=TP\), \(l=FN\)
Specificity or True Negative Rate: \(s = TN/(TN+FP)\), so \(k=TN\), \(l=FP\)
Accuracy: \(acc = TP+TN/(TP+TN+FP+FN)\), so \(k=TP+TN\), \(l=FP+FN\)
Jaccard Index: \(j = TP/(TP+FP+FN)\), so \(k=TP\), \(l=FP+FN\)

Parameters:

successes (Iterable[int]) – Sequence of integers expressing the number of successes (k) observed on the various experiments whose metric is to be averaged.
failures (Iterable[int]) – Sequence of integers expressing the number of failures (l) observed on the various experiments whose metric is to be averaged.
lambda_ (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior.
nb_samples (int) – Number of generated variates for the M-C simulation.
rng (Generator) – An initialized numpy random number generator.

Return type:

GenericAlias[float64]

Returns:

Variates: An array with size nb_samples containing a realization of Equation 7 from [GOUTTE-2005], considering the averaging of all input systems.

credible.bayesian.utils.evaluate_statistics(variates, coverage, bins)[source]¶

Evaluate the left and right margins for a given M-C distribution.

Parameters:

variates (Iterable[float]) – A 1-D array containing the simulated variates.
coverage (float) – A number, between 0 and 1 to indicate the desired coverage. Typically, this number is set to 0.95 (95% coverage).
bins (int | str) – Histogram binning for defining the mode of the original distribution where variates where sampled from. A larger number of bins will make it unlikely to discover the true mode. You may also pass any of the string method names defined at numpy.histogram_bin_edges(). A method such as doane should work fine in most cases.

Return type:

tuple[float, float, float, float]

Returns:

statistics: mean, (a rough estimated of the) mode and lower and upper bounds of the credible intervals for the input simulation.

Note

Mode estimation depends on the number of input variates, and histogram binning (bins). Using scipy.stats.mode() can only evaluate most common value, which is not reliable for continuous variables.

credible.bayesian.utils.average_beta(successes, failures, lambda_, coverage, nb_samples, rng)[source]¶

Return mean, mode, upper and lower bounds of the credible region of an average of measures with beta posteriors.

This implementation is based on [GOUTTE-2005].

Parameters:

successes (Iterable[int]) – Sequence of integers expressing the number of successes (k) observed on the various experiments whose metric is to be averaged.
failures (Iterable[int]) – Sequence of integers expressing the number of failures (l) observed on the various experiments whose metric is to be averaged.
lambda_ (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior.
coverage (float) – A floating-point number between 0 and 1.0 indicating the coverage you’re expecting. A value of 0.95 will ensure 95% of the area under the probability density of the posterior is covered by the returned equal-tailed interval.
nb_samples (int) – Number of generated variates for the M-C simulation.
rng (Generator) – An initialized numpy random number generator.

Return type:

tuple[float, float, float, float]

Returns:

statistics: mean, mode, lower and upper bounds of the credible interval for the provided coverage.

credible.bayesian.utils.compare_beta_posteriors(k1, l1, k2, l2, lambda_, nb_samples, rng)[source]¶

Return the probability that system 1 is better than system 2 for a given figure of merit.

This implementation is based on [GOUTTE-2005].

Figures of merit that are supported by this procedure are those which have the form \(v = k / (k + l)\):

Precision or Positive-Predictive Value (PPV): \(p = TP/(TP+FP)\), so \(k=TP\), \(l=FP\)
Recall, Sensitivity, or True Positive Rate: \(r = TP/(TP+FN)\), so \(k=TP\), \(l=FN\)
Specificity or True Negative Rage: \(s = TN/(TN+FP)\), so \(k=TN\), \(l=FP\)
Accuracy: \(acc = TP+TN/(TP+TN+FP+FN)\), so \(k=TP+TN\), \(l=FP+FN\)
Jaccard Index: \(j = TP/(TP+FP+FN)\), so \(k=TP\), \(l=FP+FN\)

Parameters:

k1 (int) – Number of successes for the first system. See various definitions above.
l1 (int) – Number of failures for the first system. See various definitions above.
k2 (int) – Number of successes for the second system. See various definitions above. Must be syntatically equal to k1.
l2 (int) – Number of failures for the second system. See various definitions above. Must be syntatically equal to l1.
lambda_ (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior. If unsure, use 0.5.
nb_samples (int) – Number of generated variates for the M-C simulation.
rng (Generator) – An initialized numpy random number generator.

Return type:

float

Returns:

A number between 0.0 and 1.0 that describes the probability that the first system has a bigger measurement than the second.

credible.bayesian.utils.f1_posterior(tp, fp, fn, lambda_, nb_samples, rng)[source]¶

Simulate the F1-score posterior of a system with the provided markings.

This implementation is based on [GOUTTE-2005], Equation 11.

Parameters:

tp (int) – True positive count, AKA “hit”, as an integer scalar.
fp (int) – False positive count, AKA “false alarm”, or “Type I error”.
fn (int) – False Negative count, AKA “miss”, or “Type II error”, as a scalar.
lambda_ (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior. If unsure, use 0.5.
nb_samples (int) – Number of generated variates for the M-C simulation.
rng (Generator) – An initialized numpy random number generator.

Return type:

GenericAlias[float64]

Returns:

variates: An array with size nb_samples containing a realization of Equation 11 of [GOUTTE-2005].

credible.bayesian.utils.average_f1_posterior(tp, fp, fn, lambda_, nb_samples, rng)[source]¶

Simulate the F1-score posterior of an average system with the provided markings.

This implementation is based on [GOUTTE-2005], Equation 11.

Parameters:

tp (Iterable[int]) – Arrays containing true positive counts, AKA “hit”, for all systems to be considered on the average.
fp (Iterable[int]) – Arrays containing false positive counts, AKA “false alarm”, or “Type I error” for all systems to be considered on the average.
fn (Iterable[int]) – Arrays containing false Negative counts, AKA “miss”, or “Type II error” for all systems to be considered on the average..
lambda_ (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior. If unsure, use 0.5.
nb_samples (int) – Number of generated gamma distribution values.
rng (Generator) – An initialized numpy random number generator.

Return type:

GenericAlias[float64]

Returns:

variates: An array with size nb_samples containing a realization of Equation 11 from [GOUTTE-2005].

credible.bayesian.utils.compare_f1_scores(tp1, fp1, fn1, tp2, fp2, fn2, lambda_, nb_samples, rng)[source]¶

Return the probability that the F1-score from 1 system is bigger than the F1-score of a second system.

This implementation is based on [GOUTTE-2005].

Parameters:

tp1 (int) – True positive count, AKA “hit” for the first system.
fp1 (int) – False positive count, AKA “false alarm”, or “Type I error” for the first system.
fn1 (int) – False Negative count, AKA “miss”, or “Type II error”, for the first system.
tp2 (int) – True positive count, AKA “hit” for the second system.
fp2 (int) – False positive count, AKA “false alarm”, or “Type I error” for the second system.
fn2 (int) – False Negative count, AKA “miss”, or “Type II error”, for the second system.
lambda_ (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior.
nb_samples (int) – Number of generated variates for the M-C simulation.
rng (Generator) – An initialized numpy random number generator.

Return type:

float

Returns:

prob: A number between 0.0 and 1.0 that describes the probability that the first system has a bigger F1-score than the second.

credible.bayesian.utils.compare_systems(n, lambda_, nb_samples, rng)[source]¶

Compare 2 system (binary) outputs using a Dirichlet posterior.

This function returns the empyrical probability that a system (1) is better another system (2), based on their binary outputs. The comparison is carried-out as described in [GOUTTE-2005], equations 16 and 19, via a Monte-Carlo simulation, since the integral of the probability cannot be resolved analytically.

To do so, we compute the probability that \(P(\pi_1 > \pi_2)\), i.e., the probability that system 1 gives the expected output while system 2 does not, is greater than the probability that system 1 is incorrect while system 2 gives the correct answer. It assumes, therefore, systems 1 and 2 are tuned (thresholded), and provide binary outputs that can be compared to generate 3 numbers:

\(n_1\): The measured number of times system 1 provides the correct answer, whereas system 2 does not
\(n_2\): The measured number of times system 2 provides the correct answer, whereas system 1 does not
\(n_3\): The measured number of times system 1 and 2 agree, giving the same answer (wrong or write, it does not matter)

Notice that \(\pi_1 = \frac{n_1}{n_1 + n_2 + n_3}\), and so, analogously, you may calculate \(\pi_2\) and \(\pi_3\).

We then plug these numbers to simulate a Dirichlet (generalisation of the Beta distribution for multiple variables) by setting:

\(\alpha_1 = n_1 + \lambda_1\)
\(\alpha_2 = n_2 + \lambda_2\)
\(\alpha_3 = n_3 + \lambda_3\)

Where each \(\lambda_i\) correspond to the prior to be imputed to that particular variable. We typically select \(\lambda_1 = \lambda_2\),

Parameters:

n (Sequence[int]) – A triple with 3 integers representing \(n_1\), \(n_2\) and \(n_3\).
lambda_ (Sequence[float]) – A tuple with length 3, containing floating point numbers describing the parameterisation of the Dirichlet priors to consider. Use \(\lambda_i=0.5\) for Jeffrey’s prior.
nb_samples (int) – Number of generated dirichlet distribution values (make this high, for a higher precision on the simulation).
rng (Generator) – An initialized numpy random number generator.

Return type:

float

Returns:

probability: A number between 0.0 and 1.0 that describes the probability that the first system is better than the second one.