Data Formats
This resource provides a detailed description of the various data formats used by quasitools, either as input or output, and their relationship to each other. In general, the data inputs usually must be consistent with each other. If you change one of the inputs, it is possible that some of the other inputs will need to change as well.
Overview
The following is a summary of the formats used by quasitools:
Format | Description |
---|---|
Reads | A FASTQ file containing sequencing reads. |
Reference | A FASTA reference of either a gene, chromosome, or genome. |
BAM | Specifies sequence alignments between the FASTA reference and FASTQ reads. |
BAI | Indexes the BAM file for faster processing. |
BED4 | Specifies the coordinates of coding sequences in the reference. |
Mutation Database | Specifies meaningful amino acid mutations within coding sequences. |
Codon Variants CSV | Specifies nucleotide variants within codons and resulting amino acid mutations. |
VCF | Specifies observed nucleotide variants and related information. |
AAVF | Specifies observed amino acid variants and related information. |
Relationships
The following is a summary of the relationships between quasitools inputs:
Input | Input | Relationship |
---|---|---|
BAM | Reads | The BAM file describes the alignment of reads to a reference. |
BAM | Reference | The BAM file describes the alignment of reads to a reference. |
BAM | BAI | The BAM index (BAI) file must be created from the BAM file. |
BED4 | Reference | The BED4 file describes coding sequences on the reference. |
BED | Mutation Database | The BED "name" column must be consistent with the MutationDB "coding sequences" column. |
Reads
Most tools in quasitools do not operated directly on reads and instead require BAM alignment files, which must be created manually by the user. Please see the BAM format section for more information.
Reference
The reference file must be in FASTA format. When providing a BAM file, BAM index file (BAI), and a reference file together to the same program within quasitools, the BAM file must have been generated from an alignment with the reference and the BAM index file generated from the BAM file.
BAM
A BAM file (.bam) is the binary version of a SAM file. A SAM (Sequence Alignment/Map) file is a tab-delimited text file that specifies sequence alignments. SAM files are often converted into BAM files to reduce storage space and allow faster processing.
Within quasitools, BAM files are used to provide information about how sequencing reads aligned to a reference file. A BAM index file (.bai) is also required by quasitools and should be named the same as the BAM file, with the extension instead changed from ".bam" to ".bai". When running quasitools, it is important to ensure that the BAM file, BAM index file, and reference file are consistent with each other. The BAM file should be generated from an alignment against the reference and the BAM index file should be generated from the BAM file.
BAM alignment files may be generated using read or sequence alignment software, such as Bowtie2. If your sequence alignment software does not output a BAI file, you may use samtools to index your BAM file.
BED4
A BED file specifies the coordinates of coding sequences, with repsect to sequences within a FASTA reference. The FASTA sequences are larger sequences, such a chromosomes or genes. The specified coding sequences are smaller sequences within the FASTA sequences. These small sequences within FASTA sequences may be genes or gene products. However, the should not contain non-coding sequences. The coordinates of the coding sequences must be specified in 0-based nucleotide coordinates and the length of each coding sequence must be divisible by three.
The BED files used with quasitools must be BED4+ files and therefore must contain at least the first 4 BED file columns. It is very likely that these BED4 files will need to be manually created with knowledge of the locations of coding sequences in the reference. It is very important that the names
of these coding sequences are the same names used in the coding sequences
column of the mutation database, if the user is providing both files as inputs to a tool. The format of the BED files expected by quasitools is as follows:
[identifier] [start] [end] [name]
There must only be one record per line in the BED4 file. The items in each record are specified as follows:
Item | Description | Values |
---|---|---|
identifier | The sequence identifier of the FASTA sequence within the FASTA reference file. | string |
start | The starting nucleotide coordinate of the coding sequence. | integer |
end | The ending nucleotide coordinate of the coding sequence. | integer |
name | The name of the coding sequence. | string |
Example: Genes Witin a Chromsome
Reference File (FASTA)
>chromosome1
ACGTACGT ...
>chromesome2
GGAATTCC ...
BED4 File
chromosome1 300 599 gene1
chromosome1 900 1199 gene2
chromosome2 300 599 gene3
Observe that chromosome1
and chromosome2
are the names of contigs in a reference FASTA file. Our BED4 file specifies three genes within the reference file. gene1
and gene2
reside on the chromosome1
contig and gene3
resides on the chromosome2
contig.
Example: Gene Products Witin a Gene
Reference File (FASTA)
>gene1
ACGTACGT ...
BED4 File
gene1 300 599 product1
gene1 900 1199 product2
Observe that our reference file specifics one contig, gene1
, which corresponds to a gene. The coding sequences specific in the BED4 file are instead gene products: product1
and product2
.
Example: Gene Products Witin a Chromosome
Reference File (FASTA)
>chromosome1
ACGTACGT ...
>chromesome2
GGAATTCC ...
BED4 File
chromosome1 300 399 product1
chromosome1 900 999 product2
chromosome2 300 399 product3
Observe that chromosome1
and chromosome2
are the names of contigs in a reference FASTA file. Our BED4 file specifies three coding sequences which are gene products (assumed to be within genes). product1
and product2
reside on chromosome1
and product3
resides on chromosome2
.
Mutation Database
The mutation database describes specific mutations within the named coding sequences specified previously in the BED4 file. It is used to report animo acid mutations that the user has identified. The mutation database is represented as a specifically formated TSV (tab separated values) file. The mutation database format is as follows:
[coding sequence] [wildtype] [position] [mutation] [category] [surveillance] [comment]
Item | Description | Values |
---|---|---|
coding sequence | The name of the coding sequence (CDS or gene product). | string |
wildtype | The amino acid of wildtype. | character |
position | The position of the amino acid within the coding sequence. | integer |
mutation | The amino acid of the muation. | character |
category | A categorization of the mutation. | string |
surveillance | Whether or not the mutation is part of a surveillance program. | string |
comment | A user-provided comment about the mutation. | string |
In this format, there is one mutation entry per line in the file. Each item of every entry is separated by tabs. A line may be a comment if the very first character of the line is a #
character. Please note that the position
is the amino acid position within that particular coding sequence, in amino acid coordinates. It is very important that the values of the "coding sequence" column are the same names used in the "name" column of the BED4 file, if the user is providing both files as inputs to a tool. Every name that appears in the coding sequence
column must also appear in the name
column of the BED4 file. However, not every name
in the BED4 file must appear under the coding sequence
column in the mutation database.
It is very likely that these mutation database files will need to be manually created with knowledge of the locations of mutations within the coding sequences specified in the BED file.
Example: Mutations Within a Gene
Mutation Database
#genetic region wildtype position mutation category surveillance comment
gene1 M 10 F major Yes comment1
gene1 K 20 I minor No comment2
gene2 W 5 V major No comment3
BED4 File
chromosome1 300 599 gene1
chromosome1 900 1199 gene2
chromosome2 300 599 gene3
Observe that the all names in the first column of the mutation database (gene1
, gene2
) appear in the last column of the BED4 file. However, not every named coding sequence in the BED4 (gene3
) has to appear in the mutation database coding sequence
column.
Codon Variants CSV
The codon variants CSV file is consistent with the standard CSV format. The file describes nucleotide variants within codons and their resulting amino acid variants. It also clarifies whether the nucleotide variants are synonymous or a non-synonymous mutations. The codon variants CSV file has the following comment header:
#gene,nt position (gene),nt start position,nt end position,ref codon,mutant codon,ref AA,mutant AA,coverage,mutant frequency,mutant type,NS count,S count
The items on each line correspond to the following information:
Position | Name | Description |
---|---|---|
0 | gene | The name of the coding region. |
1 | nt position (gene) | The start and end nucleotide positions of the gene which contains the codon. |
2 | nt start position | The start nucleotide position of the codon. |
3 | nt end position | The end nucleotide position of the codon. |
4 | ref codon | The codon in the reference. |
5 | mutant codon | The mutant codon in the data. |
6 | ref AA | The corresponding amino acid in the reference. |
7 | mutant AA | The corresponding amino acid in the mutant. |
8 | coverage | The coverage of the codon. |
9 | mutant frequency | The observed frequency of the mutant codon. |
10 | mutant type | Whether or not the mutation is synonymous (S) or nonsynonymous (NS). |
11 | NS count | The expected number of nonsynonymous sites in the codon. A number between [0, 3]. |
12 | S count | The expected number of synonymous sites in the codon. A number between [0, 3]. |
Please refer to Nei and Gojobori 1986 for more information about how to calculate NS count
and S count
.
VCF
The VCF format used within quasitools is consistent with the standard VCF format. However, quasitools uses custom values in the FILTER
and INFO
columns.
FILTER
Name | Meaning |
---|---|
dp100 | This variant was filtered because the coverage depth was less than 100. |
q30 | This variant was filtered because the quality of the variant was less than 30. |
ac5 | This variant was filtered because the variant was observed less than 5 times. |
Of particular interest is the q30
flag. This flag is set when the estimated quality of the variant is less than 30. quasitools calculates the probability of a variant being legitimate using the Poisson cumulative distribution function. In this framework, λ is the expected number of errors at a particular position (the coverage depth of that position multiplied by the error rate).
In order for the variant to accepted as a real mutation, the probability of the observed variant being caused entirely by errors must be sufficiently low. In other words, at least some of the observed variant was probably caused by at least one real mutation. In order for the probability to be sufficiently low, it must be less than Q30 (1 in 1000 chance). When performing this probability calculation, we assume the worst case scenario: all expected substution errors at a particular position are the same nucleotide as the variant being tested, rather than being evenly distributed evenly across all possible substitions.
INFO
Name | Meaning |
---|---|
DP | The total coverage depth of the pileup at this position. |
AC | The number of times this particular variants was observed in the pileup at this position. |
AF | The frequency of this particular variants was observed in the pileup at this position. |
AAVF
AAVF is a text file format, inspired by the Variant Call Format (VCF). It contains meta-information lines, a head line, and then data lines each containing information about a position in a gene within a genome. Please refer to the AAVF documentation for more information.