Skip to Main Content

Pseudomarker: Running one marker at a time

Jurg Ott / 15 March 2024

The Pseudomarker program [1-3] can estimate parameters for linkage and/or linkage disequilibrium (association) in family data and case-control data. It is based on the Ilink program, which is part of the Linkage package. This page is based on the official Pseudomarker program manual, modified with updated information. As outlined below, this webpage refers to a shell program, RunPM, that repeatedly calls the Pseudomarker program, which has to be downloaded separately for RunPM to work.

Current version of RunPM program: 15 March 2024

RunPM program features

Pseudomarker is available for 64 bit Linux PCs. One of its main attractions is that it accommodates both family and case-control data and draws information from genetic linkage as well as genetic association (between marker and putative trait locus). Also, it estimates marker allele frequencies both under linkage and no linkage (treats them as nuisance parameters) so that the analysis becomes virtually independent of marker allele frequencies. You are requested to register and then download the pseudomarker executable code.

The Pseudomarker program requires error-free data and can check for mendelian inconsistencies. However, in larger families it tends to initially miss inconsistencies and then crashes when it encounters them in the course of calclulations. To avoid this problem I wrote a shell program, RunPM, that runs Pseudomarker for one marker at a time and prepares an output file containing all marker results. It can run Pseudomarker under dominant (D > +) or recessive (+ > D) modes of inheritance (D = disease allele, + = wild type allele), and under (1) affected-only or (2) incomplete penetrance or (3) full penetrance models. Parameters for these models are shown below and will automatically be implemented by the program depending on the parameter values chosen by the user.

While the pseudomarker program can handle autosomal or X-linked data, the implementation here focuses on autosomal analysis.

Dominant         Model 1 Model 2 Model 3
Genotype ++ +D DD p = 0.005 1-√(1-K) 
Frequency (1 - p)2 2p(1 - p) p2 f1 = 0 0.01 0
Penetrance f1 f2 f2 f2 = 0.00001 0.96 1
        prevalence = 0.02 K (input)
Recessive         Model 1 Model 2 Model 3
Genotype ++ +D DD p = 0.11 √K
Frequency (1 - p)2 2p(1 - p) p2 f1 = 0 0.01 0
Penetrance f1 f1 f2 f2 = 0.00001 0.83 1
        prevalence = 0.02 K (input)

Output messages from Pseudomarker can be somewhat cryptic. In particular, “this pedigree is not whole” requires some explanation – it refers to the fact that for a given family, relationships are incorrectly specified. For example, a child may have only one parent, or an individual has no relation with any other individual with the same family ID.

Executing RunPM

Written in Pascal, this program is available as a Linux executable. It has been tested in 64-bit Ubuntu. Here are brief instructions on its use.

Start with your family data in plink format so you will have files like data.map and data.ped. Make certain that family members are contiguous although, within a family, individuals can occur in any order. Each individual is characterized by a family ID (FID) and individual ID (IID). The FID-IID combination must be unique, but the same IIDs may occur in different families. Initially, try to find mendel errors with the plink option, --mendel, and make corrections as needed. Below, substitute data with whatever file names you are using. Then run plink with the following command line options:

--file data --recode 12 transpose --output-missing-phenotype 0 --out data2

where you may choose any name for data and data2. This operation will result in two new files, data2.tped and data2.tfam, which are required as input files to the RunPM program; the latter will ask for an input file name to which you should reply data2 (or whatever name you have been using). Then simply follow instructions or modify the enclosed runPM.param parameter file. It is highly recommended to initially run 5-10 markers in the foreground to verify that the program runs fine and the output file generated looks ok. The output file name will be, for example, resPMdata2D1.out, where D1 refers to dominant inheritance and analysis model 1 (affecteds only). Each new marker analyzed will write a new line to the output. Markers with mendel errors and monomorphic markers are skipped, where the latter are defined as having fewer than 2 minor alleles in all families combined, that is, the data must contain at least 2 minor alleles. Also, the minimum number of known alleles (alleles numbered 1 or 2) must be 4. If everythig looks good, submit the data for analysis as shown below.

In plink, the --output-missing-phenotype 0 option is important; without it, unknown phenotypes will be coded as -9. In addition to the options listed above you may want to use the --geno option to eliminate variants with call rates less than 0.90, for example, which is the plink default, --geno 0.1. Use of the --maf option is discouraged as linkage analysis has no problem with low marker allele frequencies. Another useful option is to run only one set of chromosomes, for example, chromosomes 1-3, which would be specified with --chrom 1-3.

The RunPM program repeatedly calls the Pseudomarker program and expects relevant program files to reside in a specific folder (directory). The following executables must reside in a folder named in the parameter file: pseudomarker, unknownpseudo, makepedpseudo, ilinkpseudo.

These files are included in the pseudomarker download package. The RunPM program file should be in the /usr/local/bin or the ~/bin folder if the latter folder is in the path accessed by programs. After copying files (with sudo privileges) to /usr/local/bin, make sure they are executable; otherwise, type cd /usr/local/bin followed by sudo chmod +x *.

In some situations, every marker leads to an error, in which case the output file will be empty save for some header lines. It may then be useful to capture screen output, for example, by invoking the program as RunPM < runPM.param > output.txt, where runPM.param is a small text file holding input parameter values. Please be prepared to expect long running times with pseudomarker. In a recent analysis of 1,100 individuals in small to medium-sized families, RunPM took 2.5 seconds to analyze one marker. Thus, analyzing 700,000 markers takes just about 4 weeks of uninterrupted runnning.

Parameter file

Below is an example of a parameter file. The program is preferably run as, for example, RunPM < runPM.param, where the latter file contains the following lines, and joe is your login:

Tun
d        dominant (d) or recessive (r) analysis
2        model 1 (affecte only) or 2 (incomplete penetrance) or 3 (full penetrance)
1 0      start and end line numbers of markers to process
/home/joe/
/usr/local/bin/

EXPLANATIONS
Each line in the above sample parameter file has the following meaning:

  1. Name of input file, without ".tfam" and ".tped"
  2. Dominant (d) or recessive (r)
  3. Model 1 (affected only) or 2 (incomplete penetrance) or 3 (full penetrance). With model 3, a second number after "3" is expected, which is the disease prevalence, for example, 0.001. Occasionally, full penetrance leads to mendel errors for the trait, in which case penetrances may be specified as 0.0001 and 0.9999 rather than 0 and 1. If this is desired, then the prevalence should be furnished as a negative number, for example, -0.001. The program will read this as a prevalence of 0.001 with penetrances of 0.0001 and 0.9999.
  4. Start and end line numbers of markers to process. For example, 200 3000 means that markers #200 - #3000 will be processed. To run all markers, set these numbers equal to 1 0. You may want to initially specify a small number of markers just to see whether the program works correctly.
  5. A directory to which the program will print a progress report for every 10 markers processed so that you will see how many markers total have been analyzed at any given time. Here, this is your home directory, for example, /home/joe
  6. A directory containing the executable programs: pseudomarker, unknownpseudo, makepedpseudo, ilinkpseudo. In some installations, the programs must be in the path, for example, in /usr/local/bin/.
  7. The normal situation is that this line is blank. If it contains the number 1 then a file with statistical details on program execution will be saved for each marker. The file names will be PPPn.out, where n is the sequential number of each marker. Each file will occupy 4 KB, so this feature should not be used for too many markers.

Sample data

Included in the RunPM program package is a small test dataset, Tun.tfam and Tun.tped. It may be run by typing RunPM < runPM.param. The generally recommended approach is to run it unattended by typing (runPM < runPM.param > /dev/null)&, after which the program will keep running even after you log out. Interrupting/killing the program once it started running in the background is a little tricky because RunPM keeps invoking the various pseudo routines in succession, so it is best to know that the program runs fine before you submit a long run in the backgrund. After completion, the program will leave a file, ppp.out, in your work directory. This file refers to the marker last run and may be deleted if you wish. It provides detailed information on the calculations performed.

The two relevant output files are resPMTunD2.out and resPMTunD2.log. The out file is preferably read into a spreadsheet. It contains 11 columns described below (a few columns have been omitted as not being relevant). Traditionally, the likelihood L1, maximized over θ, is expressed as a lod score, Z = log10(L1/L0), where L0 is the likelihood under the null hypothesis of no linkage. In statistics, that likelihood ratio is usually transformed into chi-square, c = 2 loge(L1/L0). In the table below, to make results for all hypotheses tested comparable, the p-value formulation is used.

num Sequential number of markers in a given run. This number refers to all markers in the input file, whether or not calculations were carried out.
chr Chromosome number for given marker
Marker Marker ID, rs number
bp Basepair location of marker. The next 3 columns (cM, Phenotype, Model) may safely be deleted as they are not needed.
LOD Maximum lod score attained by given marker over recombination fractions, log10(L ratio) of H0 vs H1
LD|Linkage p-value for log10(L ratio) of H1 vs H3
LD|NoLinkage p-value for log10(L ratio) of H0 vs H2
Linkage|LD p-value for log10(L ratio) of H2 vs H3, that is, testing for linkage while allowing for LD
Linkage+LD p-value for log10(L ratio) of H0 vs H3, combined effect of linkage and LD
all1 Number of "1" (minor) alleles seen in all individuals. Not used in calculations, for information only.
all2 Number of "2" (major) alleles seen in all individuals. Not used in calculations, for information only.

The p-values are computed from the corresponding chi-square distribution (1 df except for Linkage+LD). Some of the likelihood ratios can be very large and lead to very small p-values, which will underflow and will be shown as “0”. Such p-values should be interpreted as p < 1.5 x 10-45, corresponding to LOD > 43.

The pseudomarker program outputs detailed information about maximum likelihoods and estimated allele frequencies in a file called ppp.out but this file is overwritten by RunPM for each new marker, except it will be saved for the last marker in the output file. To view detailed output files for a given set of markers, it is best to do a separate run just for these markers and set the last line of the parameter file equal to 1 (see above).

Note that, considering the multiple testing involved in genome-wide linkage analysis, p < 0.05 corresponds to a maximum lod score of at least 3.3 [4], which corresponds to chi-square = 15.2 and an empirical one-sided significance level of p = 4.8 x 10-5.

References

1. Gertz EM, Hiekkalinna T, Digabel SL, Audet C, Terwilliger JD, et al. (2014) PSEUDOMARKER 2.0: efficient computation of likelihoods using NOMAD. BMC Bioinformatics 15: 47.

2. Goring HH, Terwilliger JD (2000) Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am J Hum Genet 66: 1310-1327.

3. Hiekkalinna T, Schaffer AA, Lambert B, Norrgrann P, Goring HH, et al. (2011) PSEUDOMARKER: a powerful program for joint linkage and/or linkage disequilibrium analysis on mixtures of singletons and related individuals. Hum Hered 71: 256-266.

4. Lander E, Kruglyak L (1995) Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 11: 241-7.

Pseudomarker: Running one marker at a timeJurg Ott / 15 March 2024The Pseudomarker program [1-3] can estimate parameters for linkage and/or linkage disequilibrium (association) in family