Consensus

Generate a consensus sequence from a BAM file. With default settings, a simple majority base rule will be used to build the consensus. However, the user may specify a minimum percentage of abundance for base incorporation into the consensus sequence, which may produce IUPAC codes in the consensus.

Basic Usage

quasitools consensus [options] <BAM file> <reference file>

Arguments

BAM File

A BAM file (.bam) of sequences aligned to a related reference. A BAM index file (.bai) is also required and should be named the same as the BAM file, with the extension instead changed from ".bam" to ".bai".

Reference File

A reference file related to the aligned BAM sequences. The provided reference file must be the same reference file used when producing the BAM and BAM index files.

Options

Percentage

-p, --percentage INTEGER

This percentage option causes the consensus tool to operate in one of two distinct modes. When the percentage is set to exactly 100, then the consensus is generated using by taking the most abundant base at each position. In contrast, when the percentage is less than 100, then the consensus is generated by comparing the frequency of each base at a position against the treshold. The default value is 100. These two modes of operation are described in greater detail below.

When this percentage is set to exactly 100, then the most frequent base will be incorporated into the consensus sequence. The incorporated base will be the most abundant base at the position. In the case of a tie, the base will be chosen in reverse alphabetical order. When no base is present (zero coverage, inserstion, only N), then an ambigious N base will be incorporated. Additionally, insertions that are at least a multiple of 3 (i.e. codon length) will be incorporated.

When this percentage is less than 100, then the tool will determine how many different bases pass the minimum incorporation threshold (percentage) at each position. Any bases that exceed this threshold will be incorporated into the consensus sequence at the given position. When multiple bases exceed the minimum threshold, then the base is converted to an ambigious IUPAC base. A table outlining these conversation is found below:

Bases Exceeding Treshold Incorporated Base
A A
C C
G G
T T
AC M
AG R
AT W
CG S
CT Y
GT K
ACG V
ACT H
AGT D
CGT B
ACGT N

For example, if the percentage is 20 and A has been observed 30% of the time, G has been observed 30% of the time, and T has been observed 40% of the time, then the base incorporated into the consensus will be D. However, when there is zero coverage, or no bases meet the percentage threshold an ambigious N base will be incorporated into the consensus. Insertions will not be inserted into the consensus when run in this mode of operation.

ID

-i, --id TEXT

Specify the default FASTA sequence identifier to be used for sequences without an RG tag.

Output

-o, --ouput FILENAME

The file output location for the generated consensus sequence.

Output

A consensus sequence on FASTA fortmat will be output to standard out unless specified otherwise.

Example

Data

The following example data may be used to run the tool:

Command

quasitools consensus variant.bam hiv.fasta

Output

>variant_100_AF033819.3
ACTCTGGTAACTAGAGATCCCTCAGACCCATTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACCTGA
AAGCGAAAGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGGCGG
CGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTCCGAGAGCGTCAGTATTAAGCG
GGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAGGGGGAAAGAAAAAATATAAATTAAAACATATAGTATGG
GCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCCTGGCCTGTTAGAAACATCAGAAGGCTGTAGACAAATACTGGGACA
GCTACAACCATCCCTTCAGACAGGATCAGAAGAACTTAGATCATTATATAATACAGTAGCAACCCTCTATTGTGTGCATC

(output truncated)