The Rockefeller University » Replic2 – software for evaluating replication datasets

Replic2 – software for evaluating replication datasets

Jurg Ott 8 Feb 2025

Scenario

Assume you have two independent case-control datasets, e.g. males and females, that have been genotyped for M₁ and M₂ variants (SNPs), respectively. You select a subset of these, N₁ and N₂, as being particularly promising, for example, because they are significant or because they show odds ratios exceeding 10 each. Upon combining N₁ and N₂ into a single dataset (spreadsheet), you find that N₁₂ of these variants are common between the two groups. How likely is this occurring just by chance, or can you conclude that group 2 represents a significant replication for group 1 by virtue of these N₁₂ common variants?

Replic2 program

This question can be answered by simulating the following null situation: You randomly pick N₁ variants out of a larger set of M₁ variants for group 1 and analogously for group 2, then combine them to see how many random matches you find, N_12rand. You do this many times to obtain, for example, 100,000 such N_12rand numbers. The empirical significance level, p, is then given by the proportion of N_12rand values equal to or exceeding the observed number N₁₂ of matches.

The easiest way to use the Replic2 program is to run it in a command box, either in Windows or Linux, followed on the command line by the name of an input file. For example, you type Replic2 sample.in, where the sample.in file contains the following lines:

1000 1000 Two large sample sizes
100 100   Two sets of SNPs selected from the large sample sizes
100000    Number of replicates for p-value calculation
3         Number of SNPs N12, common to the two groups...
5         ... repeat as often as desired
8
10
20
30
50
-1        Finish with "-1" or just end the input

The resulting output file will then list each number N₁₂ of common SNPs and their associated p-values. Because we take the observed data as one of the null datasets, there will never be a zero p-value. For example, with 100,000 replicates, the smallest possible significance level is p = 0.00001, which may be interpreted as p < 0.00001.

The program currently works only for a relatively small number of variants. As time permits, I will modify the text (Pascal script) to accommodate larger numbers. The program is available on github.

LABORATORY OF Statistical Genetics

Replic2 – software for evaluating replication datasets

Scenario

Replic2 program