Usage: gnumap [options] <file_to_parse> -g, --genome=STRING Genome .fa file(s) -o, --output=STRING Output file -a, --align_score=DOUBLE Limit for sequence alignment (default: 10) -p, --percent Use percentage when determining alignment score -q, --read_quality=DOUBLE Read quality cutoff: won't align reads if they are below this cutoff value (default=0.0) -v, --verbose=INT Verbose (default=0) -c, --num_proc=INT Number of processors to run on -m, --mer_size=INT Mer size (default=10) -B, --buffer=INT Buffer size -T, --max_match=INT Maximum number of matches for a given sequence (default: 1000) -h, --max_hash=INT Maximum values to store in the hash (default: none) -S, --subst_file=STRING Position-Weight Matrix file for DNA Substitutions -G, --gap_penalty=DOUBLE Gap Penalty for Substitution Matrix -M, --max_gap=INT Maximum Number of Gaps to use in Alignment -A, --adaptor=STRING Adaptor sequence to remove from sequences -0, --print_full Print locations for the entire sequence, not just for the beginning. --print_all_sam Include all possible SAM records in output -b, --bs_seq Flag to turn on the C to T conversion, used in bisulfite sequence analysis --b2 Flag to turn on bisulfite sequencing, searching the reverse strand of -b -d, --a_to_g Flag that allows for A to G conversion --fast Perform a fast alignment (at some reduction of accuracy) -s, --gen_skip=INT Number of bases to skip when the genome is aligned (default: none) --bin_size=INT The resolution for GNUMAP (default: 8) -j, --jump=INT The number of bases to jump in the sequence hashing (default: mer_size / 2) -k, --num_seed=INT The total number of seed hashes that must match to a location before it is considered for alignment (default: 2) --snp Turn on SNP mapping (will output a .sgrex file) --snp_pval=DOUBLE P-Value cutoff for calling SNPs (default: 0.001) --snp_monop Flag that turns on monoploid SNP calling (default: diploid SNP calling) --illumina Defines the fastq file as Illumina file (otherwise does nothing) --up_strand Will only search the positive strand of the genome for matching location (will not look for reverse compliment match to the genome. --down_strand Will only search the negaitve strand (opposite of --up_strand command) For MPI usage: --MPI_largemem If the run requires a large amount of memory, this flag will spread it accross several nodes. (default: not included) For reading and writing to a binary file: --save=FILENAME Save the genome out to a file --read=FILENAME Read the genome in from a file Help options: -?, --help Show this help message
-g "$(ls genome/*.fa)"and the entire *.fa contents of the folder genome will be used.
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> <OPT>
<QNAME>will be used if it is a fastq or fasta string--otherwise the query name will be "seq" followed by the sequence number in the file.
<FLAG>that are important are 0x0010 (where 0 represents an alignment to the forward strand and 1 represents reverse) or 0x0200, which means the match was too poor to align
<RNAME>is the chromosome mapped to
<POS>is the position on the chromosome
<MAPQ>is the mapping quality, where MAPQ = 10 * log_10(1-p(x)) where p(x) is GNUMAP's posterior probability of mapping to that specific location
<CIGAR>is the alignment differences, where I=Insertion, D=Deletion and M=Mismatch or Match with the genomic strand
<MRNM>will be '=' for all sequences (it's the mate-pair name)
<MPOS>is ignored. Always '0'
<ISIZE>is ignored. Always '0'
<SEQ>is the sequence
<QUAL>is the Phred-based quality of the sequences
<OPT>is an optional field of the format
XA:f:<ALIGN_SCORE>is the raw alignment score
XP:f:<POST_PROB>is the posterior probability score
X0:i:<SIM_MATCHES>is the number of similar matches
-a: Minimum Alignment Score
The -a option defines the minimum aligment score that will be accepted for mapped reads. The default is a raw score of 10.
The -p option indicates that the score given in the -a option should be used as a percentage instead of a raw score.
-q: Read Quality Cutoff
The -q option determines at which level the quality control is turned on. The cutoff score is determined by the alignment score the probabilistic sequence gets with itself. The default score of zero is about as stringent as it can get; using a higher score will prevent poorer sequences from being aligned.
Different levels of verbosity can be used. However, for most purposes, level one is sufficient, ie:
-v 1Using this will slow down the program slightly.
--c: Number of threads
gnumap supports multi-threading to increase speed. To use gnumap with 8 processors, use
-c 8If the -v 1 parameter is also used, the completion percentage will not be entirely accurate.
--m: Mer length
gnumap uses a hash table to store the genome. The -m option controls the length of these hashes (creating an m-mer hash). Using a longer mer-size, ie:
-m 14will increase the sequence alignment, but could potentially decrease the accuracy. As a general rule, as the mer size increases, so does the memory requirement. The current default setting is 9.
--B: Buffer Size
The buffer size determines how much of the genome and sequence files are read in at a time. Eventually, the entire genome will be stored in memory, but if it is necessary to read in a smaller portion at a time, this option can be set lower. Default setting: 10M
--T: Maximum number of matches
When a sequence from the _int.txt file is matched, all the matches above the align_score threshhold are kept. The -M option will determine how many matches are reported and scored. If a given sequence has too many matches (whether its base qualities are too low, or it comes from a heavily-duplicated region), the sequence will not be used. The current default is 1000, but can be changed, ie:
--h: Maximum Hash Size
For the human genome, there are over 3 million occurances of a 9-mer sequence of A's. By decreasing the hash size, the mers with an occurance greater than the threshhold will not be used. This will speed up the program, as there are fewer mers to compute a match to for each sequence. Currently, this option is not used, but can be set using
-S: Position-Weight Matrix file
The position-weight matrix used by the algorithm as default is a simple matrix with (-1) as the mismatch score and (+1) as the match score. If a different scoring system is desired, it can be included in a file and passed in as a parameter. The substitution matrix should be in the format:
-G: Gap Penalty
The gap penalty used in the probabilistic Needleman-Wunsch algorithm.
-M: Maximum Number of Gaps
When initializing the Needleman-Wunsch DP-matrix, a boundary using the maximum number of gaps is used to not allow an alignment with a large number of gaps.
-A: Adaptor String
Many of the Solexa/Illumina reads also contain portions of the adaptor sequence. Specifying the -A option will remove this adaptor sequence from the end of each read. Currenly, only exact matches are used.
-0: Print Full
Often, especially when sequences are of different lengths because of adaptor trimming, it is useful to have base-pair resolution, not only identifying the beginning of each sequence, but showing length and coverage of each sequence.
--print_all_sam: Print All SAM Records
This flag will print a SAM record for any possible match that GNUMAP finds. For sequences with multiple "good" hits, this flag will print significantly more records than without it. Since the conversion to MAPQ looses a lot of data, look for the XP flag on the end of the SAM line for the posterior probability score of each record.
[ C' -> methylated C ] [ T* -> bisulfite-treated C ] [ A* -> bisulftie-treaded C on reverse-compliment strand ] Watson >> A C'G T T C G C T T G A G >> Crick << T G C'A A G C G A A C T C << ---Denaturation, Bisulfite Treatment, PRC Amplification--- Watson BSW >> A C'G T T T*G T*T T G A G >> BSWR << T G C A A A*C A*A A C T C << Crick BSC << T G C'A A G T*G A A T*T T* << BSCR >> A C G T T C A*C T T A*A A* >>(Reference: Xi, Yuanxin et al, "BSMAP: whole genome bisulfite sequence MAPping program." BMC Bioinformatics, vol. 10, no. 1, p. 1471-2105.)
Thus, a complete Bisulfite analysis should include two runs, one each with -b and --b2:
-b: Bisulfite Sequencing
Using bisulfite sequencing analysis, each unmethylated 'c' nucleotide would appear as a 't' in the read. However, methylated c's would not be changed, thus identifying which locations are methylated in the genome. GNUMAP is capable of dealing with these reads, mapping each possibly methylated read to the genome. The final output is in the form of .gmp files. With the -b flag, only BSW and BSC reads are mapped. Including the --up_strand or --down_strand flags will only map the BSW or BSC reads respectively.
--b2: Bisulfite Sequence Mapping (reverse strand mapping)
To be able to map the other two reads (BSWR and BSCR), use the --b2 flag. Using the --up_strand or --down_strand flags will only map BSWR or BSCR, respectively.
-d: A to G Sequencing
In similar manner to the Bisulfite Sequencing process appearing above, the -b flag can be used to allow for substitutios from a genomic a to a genomic g.
--fast: Fast option
The fast option increases the speed of the algorithm at the possible loss of missing some sequencs. Among other things, including --fast will increase the mer size to 14. If an error occurs and the program is not capable of creating more memory, decreasing the mer size will allow it to continue.
-s, --gen_skip: Bases to Skip while hashing
When hashing the genome, instead of creating a hash at every base, the defined number of bases are skipped. This creates a smaller hash in total without losing on accuracy. In order to skip every other base, use:
--bin_size: Number of bases in each bin
When printing the sgr file, this is the number of bases to print. Remember this is a genome-file specific parameter, ie: The bin size of the genome file that is written to memory will need to be the genome size of every run.
To use a bin size of 1 (the sgr output file will have base-pair specificty), use:
-j, --jump: The number of bases to jump
When matching the read, GNUMAP will jump this number of bases when determining hashes. A lower jump value might give a better result, and a jump value of too high will not give very desirable results. Default: MER_SIZE / 2
In order to specify a jump size of 1 (take every possible hash from the sequence), use:
-k, --num_seed: The number of seed hashes that must match
When matching sequences, k number of hashes must match to a given location before the location will be considered for alignment. Significant speedup without loss in accuracy, and tradeoffs between -j and -k can provide optimal performance. Using -k 1 is the exact same as previous versions. Default: 2
In order to require that three hash k-mers align to the read before being considered for alignment, use:
--snp: Turns on SNP calling flag
There are two results from the SNP flag. The first is a sgrex file, which is an extension of the sgr file. It also includes a rudimentary SNP calling program. The second benefit of the SNP flag is that the flags -0 (full print at each base) and --bin_size=1 are set. This will take a little bit more memory, and should not be used with little memory machines (if the genome is large).
--snp_pval: Minimum p-value for making SNP calls
GNUMAP uses a p-value score for calling SNPs in the final .sgrex file. If the p-value is not low enough, it will not label this position as a SNP.
--snp_monop: Monopoid SNP flag
If it is not possible to call diploid SNPs (or if it is not desired to call diploid SNPs), use this flag.
--illumina: Defines fastq as Illumina format
The only difference between Illumina fastq and normal fastq is the ASCII adjustment. Normal fastq will adjust the probability by 33 to obtain an ASCII character, whereas Illumiina uses 64.
--up_strand: Only uses the positive strand for alignment
When searching for matches to a specific sequence, this flag will only allow GNUMAP to search the positive strand of the genome (instead of also looking at the reverse compilment negative strand).
--down_strand: Only uses the negative strand for alignment
This flag does the opposite of the --up_strand command and only searches the negative strand.
--save: Binary file to save Genome
In order to save the genome file to a binary file on disk, include this flag.
--read: Binary file to read Genome
Read a previously stored genome back into memory, using the same parameters (ie, mer size, maximum hash locations, etc) as were included in the writing of the file.
The last parameter is the sequence file created by Illumina. In the latest version of gnumap, any version (_prb.txt, _int.txt, fastq or fasta) can be used.
The _int.txt file looks like this (four numbers identifying the lane, followed by numbers identifying the intensity of each nucleotide):
8 1 889 119 205.2 -712.3 11023.4 9163.0 9373.4 10373.5 470.0 1059.7 9777.0 12029.2 330.4 819.1...and the _prb.txt file looks like this (four numbers for each nucleotide, separated by tabs):
8 -21 -21 -21 -21 -21 -21 8 8 -21 -21 -21 -21 -21 -21 8 -21 8 -21 -21 8 -21 -21 -21 -21 8 -21 -21...If the _seq.txt, or even the fastq file is used, there will not be as much information obtained; only the likelihood of the one base will be included, losing the likelihood of all the other bases.