complexity

Reports various measures of viral quasispecies complexity.

Basic Usage

FASTA Input:

quasitools complexity fasta [OPTIONS] <FASTA READS FILE>

BAM Input:

quasitools complexity bam <FASTA REFERENCE FILE> <BAM FILE> <K-MER SIZE> [OPTIONS]

Arguments

FASTA Reads File

This input file is only necessary when running the tool in FASTA mode.

An aligned FASTA file containing multiple aligned sequences, representing haplotypes of a genomic region from the mutant spectrum of interest. This FASTA file would likely be created using a multiple sequence alignment tool from aligning amplicon sequencing data.

FASTA Reference File

This input file is only necessary when running the tool in BAM mode.

A reference file of the sequence of interest. The BAM file must be generated using this reference file.

BAM File

This input file is only necessary when running the tool in BAM mode.

A BAM file describing the alignments of reads to the same reference provided as input. These reads should be derived from a quasispecies mutant spectrum. This BAM file would likely be created using a read aligner which aligns FASTQ reads to a FASTA reference.

k-mer Size

This input is only necessary when running the tool in BAM mode.

The k-mer size defines the length of the k-mer sequence fragments. A sliding window of this k-mer size is used to scan across the reference genome. The sequences at each sliding window are used to calculate the quasispecies complexity.

Options

FILTER

-f [INTEGER]

This option is only available when running the tool in BAM mode.

This option allows for a user-defined filter size between 0 and 100. Haplotypes under the filter size will not be used when calculating the quasispecies complexity at a particular position in the genome.

OUTPUT FILE

-o [USER-DEFINED-FILE-NAME.CSV]

This option is availble when running the tool in both BAM and FASTA mode.

This option allows users to define an output file location, where the program output will be written in CSV format.

Output

The quasispecies complexity measures are taken from Gregori, Josep, et al. 2016. These include the following various indices:

Incidence (Entity Level):

Number of haplotypes
Number of polymorphic sites
Number of mutations

Abundance (Molecular Level):

Shannon entropy
Simpson index
Gini-Simpson index
Hill numbers

Functional (Incidence):

Minimum mutation frequency (Mf min)
Mutation frequency (Mfe)
FAD
Sample nucleotide diversity

Functional (Abundance):

Maximum mutation frequency (Mf max)
Population nucleotide diversity

Applications

Assessing the quasispecies complexity of a genomic region.
Comparing the quasispecies complexity of multiple genomic regions from the same mutant spectrum.

Example: FASTA Reads File

Data

The following example data may be used to run the tool:

aligned.fasta

Command

quasitools complexity fasta aligned.fasta

Output

Position,Number of Haplotypes,Haplotype Population,Number of Polymorphic Sites,Number of Mutations,Shannon Entropy,Shannon Entropy Normalized to N,Shannon Entropy Normalized to H,Simpson Index,Gini-Simpson Index,Hill Number #0,HIll Number #1,Hill Number #2,Hill Number #3,Minimum Mutation Frequency,Mutation Frequency,Functional Attribute Diversity,Sample Nucleotide Diversity (Entity),Maximum Mutation Frequency,Population Nucleotide Diversity,Sample Nucleotide Diversity
0,9,30,38,40,1.8774672554524843,0.5520018525167073,0.8544721713101401,0.19111111111111112,0.8088888888888889,9.0,6.536927510444632,5.232558139534883,4.543368996115371,0.013333333333333334,0.05555555555555555,7.379999999999999,0.10249999999999998,0.03866666666666667,0.06682222222222223,0.06912643678160921

Example: BAM File With a Reference FASTA

Data

The following example data may be used to run the tool:

Command

quasitools complexity bam generated.fasta generated.bam 200

Output

Position,Number of Haplotypes,Haplotype Population,Number of Polymorphic Sites,Number of Mutations,Shannon Entropy,Shannon Entropy Normalized to N,Shannon Entropy Normalized to H,Simpson Index,Gini-Simpson Index,Hill Number #0,HIll Number #1,Hill Number #2,Hill Number #3,Minimum Mutation Frequency,Mutation Frequency,Functional Attribute Diversity,Sample Nucleotide Diversity (Entity),Maximum Mutation Frequency,Population Nucleotide Diversity,Sample Nucleotide Diversity
0,6,6,6,15,1.7917594692280547,0.9999999999999999,0.9999999999999999,0.16666666666666669,0.8333333333333333,6.0,5.999999999999998,5.999999999999999,6.000000000000001,0.0125,0.01916666666666667,0.7300000000000004,0.02433333333333335,0.019166666666666665,0.020277777777777773,0.02433333333333333