gnumap

Program Parameters (detailed guide)

Usage: gnumap [options] <file_to_parse>
  -g, --genome=STRING          Genome .fa file(s)
  -o, --output=STRING          Output file
  -a, --align_score=DOUBLE     Limit for sequence alignment (default: 10)
  -p, --percent                Use percentage when determining alignment score
  -q, --read_quality=DOUBLE    Read quality cutoff:  won't align reads if they are
                               below this cutoff value
                               (default=0.0)
  -v, --verbose=INT            Verbose (default=0)
  -c, --num_proc=INT           Number of processors to run on
  -m, --mer_size=INT           Mer size (default=10)
  -B, --buffer=INT             Buffer size
  -T, --max_match=INT          Maximum number of matches for a given
                               sequence (default: 1000)
  -h, --max_hash=INT           Maximum values to store in the hash
                               (default: none)
  -S, --subst_file=STRING      Position-Weight Matrix file for DNA
                               Substitutions
  -G, --gap_penalty=DOUBLE     Gap Penalty for Substitution Matrix
  -M, --max_gap=INT            Maximum Number of Gaps to use in Alignment
  -A, --adaptor=STRING         Adaptor sequence to remove from sequences
  -0, --print_full             Print locations for the entire sequence, not
                               just for the beginning.
  --print_all_sam              Include all possible SAM records in output
  -b, --bs_seq                 Flag to turn on the C to T conversion, used in
                               bisulfite sequence analysis
  --b2                         Flag to turn on bisulfite sequencing, searching the reverse
                               strand of -b
  -d, --a_to_g                 Flag that allows for A to G conversion
  --fast                       Perform a fast alignment (at some reduction
                               of accuracy)
  -s, --gen_skip=INT           Number of bases to skip when the genome is aligned
                               (default: none)
  --bin_size=INT               The resolution for GNUMAP (default: 8)
  -j, --jump=INT               The number of bases to jump in the sequence hashing
                               (default: mer_size / 2)
  -k, --num_seed=INT           The total number of seed hashes that must match to a
                               location before it is considered for alignment
                               (default: 2)
  --snp                        Turn on SNP mapping (will output a .sgrex file)
  --snp_pval=DOUBLE            P-Value cutoff for calling SNPs
                               (default: 0.001)
  --snp_monop                  Flag that turns on monoploid SNP calling
                               (default: diploid SNP calling)
  --illumina                   Defines the fastq file as Illumina file (otherwise
                               does nothing)
  --up_strand                  Will only search the positive strand of the genome
                               for matching location (will not look for reverse
                               compliment match to the genome.
  --down_strand                Will only search the negaitve strand (opposite of
                               --up_strand command)

For MPI usage:
  --MPI_largemem               If the run requires a large amount of memory, this
                               flag will spread it accross several nodes.
                               (default:  not included)

For reading and writing to a binary file:
  --save=FILENAME              Save the genome out to a file
  --read=FILENAME              Read the genome in from a file


Help options:
  -?, --help                   Show this help message

-g: Genome .fa file(s)
gnumap also suports a genome of multiple files, enclosed in quotation marks. If a folder is desired, simply type

	-g "$(ls genome/*.fa)"

and the entire *.fa contents of the folder genome will be used.

-o: Output file This is where the output file should be defined. gnumap will create two output files: one with SAM format data from each sequence, and a second .sgr file usable with the Integrated Genome Browser (IGB)
On the top of the sequence output file, the parameters used in the program will be listed, followed by the statistics for each sequence. According to specifications (and only using the fields that are useful to GNUMAP, SAM format should be as follows:

<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> <OPT>

For GNUMAP, <QNAME> will be used if it is a fastq or fasta string--otherwise the query name will be "seq" followed by the sequence number in the file.
The only bits in <FLAG> that are important are 0x0010 (where 0 represents an alignment to the forward strand and 1 represents reverse) or 0x0200, which means the match was too poor to align
<RNAME> is the chromosome mapped to
<POS> is the position on the chromosome
<MAPQ> is the mapping quality, where MAPQ = 10 * log_10(1-p(x)) where p(x) is GNUMAP's posterior probability of mapping to that specific location
<CIGAR> is the alignment differences, where I=Insertion, D=Deletion and M=Mismatch or Match with the genomic strand
<MRNM> will be '=' for all sequences (it's the mate-pair name)
<MPOS> is ignored. Always '0'
<ISIZE> is ignored. Always '0'
<SEQ> is the sequence
<QUAL> is the Phred-based quality of the sequences
<OPT> is an optional field of the format TAG:VTYPE:VALUE.
GNUMAP uses this to report the following fields:
- XA:f:<ALIGN_SCORE> is the raw alignment score
- XP:f:<POST_PROB> is the posterior probability score
- X0:i:<SIM_MATCHES> is the number of similar matches

If the sequence is too ambiguous to be used (the alignment score with itself is less than 0 or it matches too many times to the genome), the sequence will not be matched, but 0x0200 will appear in the second column. If the sequence can be mapped to the genome, one line of output per each best match location will occur in the output. If it does not align with the genome, there will be no corresponding line in the output.

-a: Minimum Alignment Score
The -a option defines the minimum aligment score that will be accepted for mapped reads. The default is a raw score of 10.

-p: Percentage
The -p option indicates that the score given in the -a option should be used as a percentage instead of a raw score.

-q: Read Quality Cutoff
The -q option determines at which level the quality control is turned on. The cutoff score is determined by the alignment score the probabilistic sequence gets with itself. The default score of zero is about as stringent as it can get; using a higher score will prevent poorer sequences from being aligned.

-v: Verbosity
Different levels of verbosity can be used. However, for most purposes, level one is sufficient, ie:

	-v 1

Using this will slow down the program slightly.

--c: Number of threads
gnumap supports multi-threading to increase speed. To use gnumap with 8 processors, use

	-c 8

If the -v 1 parameter is also used, the completion percentage will not be entirely accurate.

--m: Mer length
gnumap uses a hash table to store the genome. The -m option controls the length of these hashes (creating an m-mer hash). Using a longer mer-size, ie:

	-m 14

will increase the sequence alignment, but could potentially decrease the accuracy. As a general rule, as the mer size increases, so does the memory requirement. The current default setting is 9.

--B: Buffer Size
The buffer size determines how much of the genome and sequence files are read in at a time. Eventually, the entire genome will be stored in memory, but if it is necessary to read in a smaller portion at a time, this option can be set lower. Default setting: 10M

--T: Maximum number of matches
When a sequence from the _int.txt file is matched, all the matches above the align_score threshhold are kept. The -M option will determine how many matches are reported and scored. If a given sequence has too many matches (whether its base qualities are too low, or it comes from a heavily-duplicated region), the sequence will not be used. The current default is 1000, but can be changed, ie:

	-T 100000

--h: Maximum Hash Size
For the human genome, there are over 3 million occurances of a 9-mer sequence of A's. By decreasing the hash size, the mers with an occurance greater than the threshhold will not be used. This will speed up the program, as there are fewer mers to compute a match to for each sequence. Currently, this option is not used, but can be set using

	-h 10000

-S: Position-Weight Matrix file
The position-weight matrix used by the algorithm as default is a simple matrix with (-1) as the mismatch score and (+1) as the match score. If a different scoring system is desired, it can be included in a file and passed in as a parameter. The substitution matrix should be in the format:

A C G T

A m mm mm mm

C mm m mm mm

G mm mm m mm

T mm mm mm m

N mm mm mm mm

where m represents a match score and mm represents a mismatch score.

-G: Gap Penalty
The gap penalty used in the probabilistic Needleman-Wunsch algorithm.

-M: Maximum Number of Gaps
When initializing the Needleman-Wunsch DP-matrix, a boundary using the maximum number of gaps is used to not allow an alignment with a large number of gaps.

-A: Adaptor String
Many of the Solexa/Illumina reads also contain portions of the adaptor sequence. Specifying the -A option will remove this adaptor sequence from the end of each read. Currenly, only exact matches are used.

-0: Print Full
Often, especially when sequences are of different lengths because of adaptor trimming, it is useful to have base-pair resolution, not only identifying the beginning of each sequence, but showing length and coverage of each sequence.

--print_all_sam: Print All SAM Records
This flag will print a SAM record for any possible match that GNUMAP finds. For sequences with multiple "good" hits, this flag will print significantly more records than without it. Since the conversion to MAPQ looses a lot of data, look for the XP flag on the end of the SAM line for the posterior probability score of each record.

Important Information on Bisulfite Mapping

Due to the bisulfite treatment problem, there are four different types of reads that can be created: BSW (Bisulfite Watson), BSC (Bisulfite Crick), BSWR (BSW Reverse), and BSCR (BSC Reverse). Pictorally (in ASCII, see citation for full description), this can be described as such:

[ C' -> methylated C ]
[ T* -> bisulfite-treated C ]
[ A* -> bisulftie-treaded C on reverse-compliment strand ]
Watson   >> A C'G T T C G C T T G A G >>
Crick    << T G C'A A G C G A A C T C <<

---Denaturation, Bisulfite Treatment, PRC Amplification---

Watson
BSW  >> A C'G T T T*G T*T T G A G >>
BSWR << T G C A A A*C A*A A C T C <<
Crick
BSC  << T G C'A A G T*G A A T*T T* <<
BSCR >> A C G T T C A*C T T A*A A* >>

(Reference: Xi, Yuanxin et al, "BSMAP: whole genome bisulfite sequence MAPping program." BMC Bioinformatics, vol. 10, no. 1, p. 1471-2105.)

Thus, a complete Bisulfite analysis should include two runs, one each with -b and --b2:

-b: Bisulfite Sequencing
Using bisulfite sequencing analysis, each unmethylated 'c' nucleotide would appear as a 't' in the read. However, methylated c's would not be changed, thus identifying which locations are methylated in the genome. GNUMAP is capable of dealing with these reads, mapping each possibly methylated read to the genome. The final output is in the form of .gmp files. With the -b flag, only BSW and BSC reads are mapped. Including the --up_strand or --down_strand flags will only map the BSW or BSC reads respectively.

--b2: Bisulfite Sequence Mapping (reverse strand mapping)
To be able to map the other two reads (BSWR and BSCR), use the --b2 flag. Using the --up_strand or --down_strand flags will only map BSWR or BSCR, respectively.

-d: A to G Sequencing
In similar manner to the Bisulfite Sequencing process appearing above, the -b flag can be used to allow for substitutios from a genomic a to a genomic g.

--fast: Fast option
The fast option increases the speed of the algorithm at the possible loss of missing some sequencs. Among other things, including --fast will increase the mer size to 14. If an error occurs and the program is not capable of creating more memory, decreasing the mer size will allow it to continue.

-s, --gen_skip: Bases to Skip while hashing
When hashing the genome, instead of creating a hash at every base, the defined number of bases are skipped. This creates a smaller hash in total without losing on accuracy. In order to skip every other base, use:

   -s 1

--bin_size: Number of bases in each bin
When printing the sgr file, this is the number of bases to print. Remember this is a genome-file specific parameter, ie: The bin size of the genome file that is written to memory will need to be the genome size of every run.
To use a bin size of 1 (the sgr output file will have base-pair specificty), use:

  --bin_size=1

-j, --jump: The number of bases to jump
When matching the read, GNUMAP will jump this number of bases when determining hashes. A lower jump value might give a better result, and a jump value of too high will not give very desirable results. Default: MER_SIZE / 2
In order to specify a jump size of 1 (take every possible hash from the sequence), use:

  -j 1

-k, --num_seed: The number of seed hashes that must match
When matching sequences, k number of hashes must match to a given location before the location will be considered for alignment. Significant speedup without loss in accuracy, and tradeoffs between -j and -k can provide optimal performance. Using -k 1 is the exact same as previous versions. Default: 2
In order to require that three hash k-mers align to the read before being considered for alignment, use:

  -k 3

--snp: Turns on SNP calling flag
There are two results from the SNP flag. The first is a sgrex file, which is an extension of the sgr file. It also includes a rudimentary SNP calling program. The second benefit of the SNP flag is that the flags -0 (full print at each base) and --bin_size=1 are set. This will take a little bit more memory, and should not be used with little memory machines (if the genome is large).

--snp_pval: Minimum p-value for making SNP calls
GNUMAP uses a p-value score for calling SNPs in the final .sgrex file. If the p-value is not low enough, it will not label this position as a SNP.

--snp_monop: Monopoid SNP flag
If it is not possible to call diploid SNPs (or if it is not desired to call diploid SNPs), use this flag.

--illumina: Defines fastq as Illumina format
The only difference between Illumina fastq and normal fastq is the ASCII adjustment. Normal fastq will adjust the probability by 33 to obtain an ASCII character, whereas Illumiina uses 64.

--up_strand: Only uses the positive strand for alignment
When searching for matches to a specific sequence, this flag will only allow GNUMAP to search the positive strand of the genome (instead of also looking at the reverse compilment negative strand).

--down_strand: Only uses the negative strand for alignment
This flag does the opposite of the --up_strand command and only searches the negative strand.

--save: Binary file to save Genome
In order to save the genome file to a binary file on disk, include this flag.

  --save=<file_to_save>

--read: Binary file to read Genome
Read a previously stored genome back into memory, using the same parameters (ie, mer size, maximum hash locations, etc) as were included in the writing of the file.

  --read=<file_to_read>

<file_to_parse>

The last parameter is the sequence file created by Illumina. In the latest version of gnumap, any version (_prb.txt, _int.txt, fastq or fasta) can be used.
The _int.txt file looks like this (four numbers identifying the lane, followed by numbers identifying the intensity of each nucleotide):

8	1	889	119	 205.2 -712.3 11023.4 9163.0	9373.4 10373.5  470.0 1059.7	9777.0 12029.2  330.4  819.1...

and the _prb.txt file looks like this (four numbers for each nucleotide, separated by tabs):

8 -21 -21 -21 	-21 -21 -21 8 	8 -21 -21 -21 	-21 -21 -21 8 	-21 8 -21 -21 	8 -21 -21 -21 	-21 8 -21 -21...

If the _seq.txt, or even the fastq file is used, there will not be as much information obtained; only the likelihood of the one base will be included, losing the likelihood of all the other bases.