gnumap

Running gnumap (quick start) For help on how to run GNUMAP, just type ./bin/gnumap into a terminal and the usage information will be displayed. A typical gnumap run requires several things. For example, to run a test with the sequence file s_100_int.txt, reporting only the locations containing a local alignment score of 90% or better, using the file chrI.fa as the genome, and having the output printed to gnumap.output, I would use:

	./bin/gnumap -g examples/Cel_gen.fa -o example.output -a .9 -p -v 1 examples/example_sequences_prb.txt

(Note: This command can also be run by typing make example )

The -g option defines the genome.
The -o option tells the program where to place the output (two output files will be created: one with the alignment report for each read and another in the .sgr format usable with Integrated Genome Browser (IGB) for convenient graphical disply.
The -a option defines the minimum aligment score that will be accepted for mapped reads.
The -p option indicates that the score given in the -a option is a percentage instead of a raw score.
The last parameter is the name of Illumina's *_int.txt or *_prb.txt file to be used for the sequences. This file is a tab- and space-deliminated file containing either the base intensity or base quality scores respectively, with each line containing a separate read. In order to improve accuracy, the *_seq.txt file is not used.

Example Files

To make sure GNUMAP is running properly, there are sample files included. In order to run this sample set, type
```
	make example
```
or, for an example with SNP output,
```
	make example-snp
```
For the both examples, there should be about 3,000 out of 10,000 sequences that map to locations in the C. elegans genome.

Following are some additional example files that can be used:
- The prb, fasta, and fastq files used for comparison in the Bioinformatics paper.
  In addition, the spiked-in sequences (with original chromosome position) can be found here and the spiked locations can be found here
- A Human Genome binary file (right click and select Save As). This binary file was compiled on a 64-bit system with the following parameters:
  - mer size: 13bp
  - largest hash size: 100k
  - bases skipped: 1

Running GNUMAP with MPI:

mpiexec -np N_MACH -machinefile MACH_FILE gnumap [options...]

where N_MACH is the number of machines you are using and MACH_FILE is a file listing the machines that are available to use. Using the -c option to specify the number of processors can also be included with these parameters.

For those that are using BYU's supercomputer (or another PBS supercomputer), here is an example submission script:

#! /bin/bash

#PBS -N MPI_test
#PBS -l nodes=30:ppn=1:pmem=12gb,walltime=3:00:00
#PBS -q batch
#PBS -k oe 
#PBS -m bea
#PBS -M your_email@gmail.com

N_MACH=30
MACH_FILE=gnumap_mpi_file

GENOME="/path/to/genome/genome.fasta"
SEQFILES="$(ls /path/to/sequences/*_prb.txt)"
OUTPUT="/path/to/output/gnumap.out"
PROG="/path/to/gnumap/bin/gnumap"
PROGARGS="-g \"$(echo $GENOME | sed -e 's/ /,/g')\" -o $OUTPUT -a .9 -p -c 8 \"$(echo $SEQFILES | sed -e 's/ /,/g')\" -m 12 -j 10 -v 1"

cp $PBS_NODEFILE $MACH_FILE

echo "mpiexec -np $N_MACH -machinefile $MACH_FILE $PROG $PROGARGS"
mpiexec -np $N_MACH -machinefile $MACH_FILE $PROG $PROGARGS

The $PBS_NODEFILE is a file that lists all the nodes your program is allowed to run on. Alternatively, for a large genome would have the flag --MPI_largemem on the end of the PROGARGS command.