Walkthrough

Overview

The purpose of this walkthrough will be to illustrate a simple, but complete example of using Neptune to locate discriminatory sequences. We will identity signature sequences within an artificial data set containing three inclusion sequences and three exclusion sequences. The output will be a list of signatures, sorted by score, for each inclusion input, and one consolidated signatures file, sorted by signature score, containing signatures from all inclusion inputs.

Input Data

We will be using very small, artificial genomes for this walkthrough. However, these small genomes will be sufficient to illustrate the operation of Neptune. The artificial genome sequence content is derived from Escherichia coli and has been modified to introduce simple variation between genomes.

The example inclusion genomes are located in the following location:

neptune/tests/data/example/inclusion/

The example exclusion genomes are located in the following location:

neptune/tests/data/example/exclusion/

The inclusion and exclusion directories each contain three FASTA-format genomes. The genomes all have some insertions and deletions that differentiate them from each other. However, the three inclusion genomes primarily differ from the three exclusion genomes in that they share large sequences that are absent from all exclusion genomes.

Running Neptune

Neptune will automatically calculate many of the parameters that might otherwise be specified by the user, such as the minimum number of inclusion inputs the signature sequence must be present within for it to be considered shared sequence. At minimum, Neptune requires the user specify the inclusion sequences, exclusion sequences, and an output directory. We will provide Neptune inclusion and exclusion sequences in the form of FASTA file genomes located within directories. The following command will run Neptune on the example data and output to the specified directory:

neptune
    --inclusion tests/data/example/inclusion/
    --exclusion tests/data/example/exclusion/
    --output output/

Output

Standard Output

After running Neptune, very similar output will be printed to standard output, indicating that Neptune is starting and completing different stages of operation:

Neptune v2.0.0

Estimating k-mer size ...
k = 15

k-mer Counting...
Submitted 6 jobs.
0.04893639404326677 seconds

k-mer Aggregation...
Submitted 65 jobs.
0.0374402878805995 seconds

Signature Extraction...
Submitted 3 jobs.
0.024856548057869077 seconds

Signature Filtering...
Submitted 2 jobs.
Submitted 3 jobs.
1.965036110021174 seconds

Consolidate Signatures...
Submitted 1 jobs.
0.21438432089053094 seconds

Complete!

Consolidated Signatures

As we did not specify references from which to extract signatures, Neptune will automatically investigate all inclusion genomes for signatures and consolidate those signatures into a single consolidated signature file. The output/consolidated/consolidated.fasta file contains these consolidated signatures. This file may be understood as the final output of the application. The following FASTA-format output is from the consolidated signatures file produced from this example:

>0.0 score=1.0000 in=1.0000 ex=0.0000 len=103 ref=inclusion3 pos=99
TAGTCTCCAGGATTCCCGGGGCGGTTCAGATAATCTTAGCATTGACCGCCTTTATATAGAAGCTGTTATTCAAGAAGCATTTTCAAGCAGTGATGTAAGAAAA
>1.1 score=0.9979 in=0.9979 ex=0.0000 len=640 ref=inclusion2 pos=3494
CGCGGGCGATATTTTCACAGCCATTTTCAGGAGTTCAGCCATGAACGCTTATTACATTCAGGATCGTCTTGAGGCTCAGAGCTGGGAGCGTCACTACCAGCAGATCGCCCGTGAAGAGAAAGAGGCAGAACTGGCAGACACATGGAAAAAGGCCTGCCCCAGCACCTGTTTTGAATCGCTATGCATCGATCATTTGCAACGCCACGGGGCCAGCAAAAAAGCCATTACCCGTGCGTTTGATGACGATGTTGAGTTTCAGGAGCGCATGGCAGAACACATCCGGTACTGGTTAAACCATTGCTCACCACCAGGTTGATATTGATTCAGAGGTATAAAACGAATGAGTACAGCACTCGCAACGCTGGCAGGGAAGCTGGCTGAACGTGTCGGCATGGATTCTGTCGACCCACAGGAACTGATCACCACTCTTCGCCAGACGGCATTTAAAGGTGATGCCAGCGATGCGCAGTTCATCGCATTGCTGATCGTCGCCAACCAGTACGGTCTTAATCCGTGGACGAAAGAAATTTACGCCTTTCCTGATAAGCAGAACGGCATCGTTCCGGTGGTGGGCGTTGATGGCTGTCCCGTATCATCAATGAAAACCAGCAGTTTGAGGCATGGTACTTTGAGCAGGACA
>2.2 score=0.9966 in=0.9966 ex=0.0000 len=98 ref=inclusion1 pos=5209
GCGAGTTTTGCGAGATGGTGCCGGAGTTCATCGAAAAAATGGACGAGGCACTGCTGAAATTGGTTTTGTATTTGGGGAGCAATGGCGATGAAGCATCC

The FASTA header contains information relavent to the identified signature. A detailed explanation of this information is located within the output section of this documentation. The score is the sum of the in (inclusion/sensitivity) and ex (exclusion/specificity) scores, and represents a combined measure of sensitivity and specificity. The length describes the length of the signature in bases. The ref (reference) and pos (position) describe the location of the signature within the reference FASTA record it was extracted from.

In this example, Neptune identified three signatures: 0.0, 1.1, and 2.2 of lengths 103, 640, and 98, respectively. These IDs (0.0, 1.1, 2.2) are arbitrarily assigned by Neptune and are only used to uniquely identify each signature. They have no other meaning.

We can see that the 0.0 signature is derived from the inclusion3 reference, the 1.1 from the inclusion2 reference, and the 2.2 from the inclusion1 reference. These signatures were located at positions 99, 3497, and 5209 within the inclusion1 reference. These signatures are of very high quality, within the context of our data set, with scores of 1.0000, 0.9979, and 0.9969, within the possible range of score values from -1.00 to +1.00.

Sorted Signatures

If we're interested in looking at the signatures produced from each individual inclusion input, we need to investigate the output in the output/sorted directory. The following are the signatures extracted exclusively from the inclusion1.fasta input:

>0 score=1.0000 in=1.0000 ex=0.0000 len=103 ref=inclusion1 pos=99
TAGTCTCCAGGATTCCCGGGGCGGTTCAGATAATCTTAGCATTGACCGCCTTTATATAGAAGCTGTTATTCAAGAAGCATTTTCAAGCAGTGATGTAAGAAAA
>1 score=0.9979 in=0.9979 ex=0.0000 len=640 ref=inclusion1 pos=3497
CGCGGGCGATATTTTCACAGCCATTTCAGGAGTTCAGCCATGAACGCTTATTACATTCAGGATCGTCTTGAGGCTCAGAGCTGGGAGCGTCACTACCAGCAGATCGCCCGTGAAGAGAAAGAGGCAGAACTGGCAGACACATGGAAAAAGGCCTGCCCCAGCACCTGTTTGAATCGCTATGCATCGATCATTTGCAACGCCACGGGGCCAGCAAAAAAGCCATTACCCGTGCGTTTGATGACGATGTTGAGTTTCAGGAGCGCATGGCAGAACACATCCGGTACTGGTTGAAACCATTGCTCACCACCAGGTTGATATTGATTCAGAGGTATAAAACGAATGAGTACAGCACTCGCAACGCTGGCAGGGAAGCTGGCTGAACGTGTCGGCATGGATTCTGTCGACCCACAGGAACTGATCACCACTCTTCGCCAGACGGCATTTAAAGGTGATGCCAGCGATGCGCAGTTCATCGCATTGCTGATCGTCGCCAACCAGTACGGTCTTAATCCGTGGACGAAAGAAATTTACGCCTTTCCTGATAAGCAGAACGGCATCGTTCCGGTGGTGGGCGTTGATAGGCTGTCCCGTATCATCAATGAAAACCAGCAGTTTGAGGCATGGTACTTTGAGCAGGACA
>2 score=0.9966 in=0.9966 ex=0.0000 len=98 ref=inclusion1 pos=5209
GCGAGTTTTGCGAGATGGTGCCGGAGTTCATCGAAAAAATGGACGAGGCACTGCTGAAATTGGTTTTGTATTTGGGGAGCAATGGCGATGAAGCATCC

The following are the signatures extracted exclusively from the inclusion2.fasta input:

>0 score=1.0000 in=1.0000 ex=0.0000 len=103 ref=inclusion2 pos=99
TAGTCTCCAGGATTCCCGGGGCGGTTCAGATAATCTTAGCATTGACCGCCTTTATATAGAAGCTGTTATTCAAGAAGCATTTTCAAGCAGTGATGTAAGAAAA
>1 score=0.9979 in=0.9979 ex=0.0000 len=640 ref=inclusion2 pos=3494
CGCGGGCGATATTTTCACAGCCATTTTCAGGAGTTCAGCCATGAACGCTTATTACATTCAGGATCGTCTTGAGGCTCAGAGCTGGGAGCGTCACTACCAGCAGATCGCCCGTGAAGAGAAAGAGGCAGAACTGGCAGACACATGGAAAAAGGCCTGCCCCAGCACCTGTTTTGAATCGCTATGCATCGATCATTTGCAACGCCACGGGGCCAGCAAAAAAGCCATTACCCGTGCGTTTGATGACGATGTTGAGTTTCAGGAGCGCATGGCAGAACACATCCGGTACTGGTTAAACCATTGCTCACCACCAGGTTGATATTGATTCAGAGGTATAAAACGAATGAGTACAGCACTCGCAACGCTGGCAGGGAAGCTGGCTGAACGTGTCGGCATGGATTCTGTCGACCCACAGGAACTGATCACCACTCTTCGCCAGACGGCATTTAAAGGTGATGCCAGCGATGCGCAGTTCATCGCATTGCTGATCGTCGCCAACCAGTACGGTCTTAATCCGTGGACGAAAGAAATTTACGCCTTTCCTGATAAGCAGAACGGCATCGTTCCGGTGGTGGGCGTTGATGGCTGTCCCGTATCATCAATGAAAACCAGCAGTTTGAGGCATGGTACTTTGAGCAGGACA
>2 score=0.9933 in=0.9933 ex=0.0000 len=99 ref=inclusion2 pos=5206
GCGAGTTTTGACGAGATGGTGCCGGAGTTCATCGAAAAAATGGACGAGGCACTGCTGAAATTGGTTTTGTATTTGGGGAGCAATGGCGATGAAGCATCC

The following are the signatures extracted exclusively from the inclusion3.fasta input:

>0 score=1.0000 in=1.0000 ex=0.0000 len=103 ref=inclusion3 pos=99
TAGTCTCCAGGATTCCCGGGGCGGTTCAGATAATCTTAGCATTGACCGCCTTTATATAGAAGCTGTTATTCAAGAAGCATTTTCAAGCAGTGATGTAAGAAAA
>2 score=0.9833 in=0.9833 ex=0.0000 len=100 ref=inclusion3 pos=5203
GCGAGTTTTAACGAGATGGTGCCGGAGTTCATCGAAAAAATGGACCGAGGCACTGCTGAAATTGGTTTTGTATTTGGGGAGCAATGGCGATGAAGCATCC
>1 score=0.9792 in=0.9979 ex=0.0187 len=640 ref=inclusion3 pos=3492
CGCGGGCGATATTTTCACAGCCATTTTCAGGAGTTCAGCCATGAACGCTTATTACATTCAGGATCGTCTTGAGGCTCAGAGCTGGGAGCGTCACTACCAGCAGATCGCCCGTGAAGAAAAGAGGCAGAACTGGCAGACACATGGAAAAAGGCCTGCCCCAGCACCTGTTTGAATCGCTATGCATCGATCATTTGCAACGCCACGGGGCCAGCAAAAAATGCCATTACCCGTGCGTTTGATGACGATGTTGAGTTTCAGGAGCGCATGGCAGAACACATCCGGTACTGGTTGAAACCATTGCTCACCACCAGGTTGATATTGATTCAGAGGTATAAAACGAATGAGTACAGCACTCGCAACGCTGGCAGGGAAGCTGGCTGAACGTGTCGGCATGGATTCTGTCGACCCACAGGAACTGATCACCACTCTTCGCCAGACGGCATTTAAAGGTGATGCCAGCGATGCGCAGTTCATCGCATTGCTGATCGTCGCCAACCAGTACGGTCTTAATCCGTGGACGAAAGAAATTTACGCCTTTCCTGATAAGCAGAACGGCATCGTTCCGGTGGTGGGCGTTGATGGCTGTCCCGTATCATCAATGAAAACCAGCAGTTTGAGGCATGGTACTTTGAGCAGGACA

The output from these files appears very similar, as is expected when Neptune identifies highly discriminatory signatures from a homogeneous data set. However, there are some slight differences between some of these signatures. For example, signature ID 2 in inclusion1.fasta corresponds to 2 in inclusion2.fasta, but corresponds to ID 1 in inclusion3.fasta. The lengths (98, 99, 100) and scores (0.9966, 0.9933, 0.9833) are slightly different. Another slight difference is the sequence similarity of signature ID 1 in inclusion3.fasta with exclusion sequence:

>1 score=0.9792 in=0.9979 ex=0.0187 len=640 ref=inclusion3 pos=3492
CGCGGGCGATATTTTCACAGCCATTTTCAGGAGTTCAGCCATGAACGCTTATTACATTCAGGATCGTCTTGAGGCTCAGAGCTGGGAGCGTCACTACCAGCAGATCGCCCGTGAAGAAAAGAGGCAGAACTGGCAGACACATGGAAAAAGGCCTGCCCCAGCACCTGTTTGAATCGCTATGCATCGATCATTTGCAACGCCACGGGGCCAGCAAAAAATGCCATTACCCGTGCGTTTGATGACGATGTTGAGTTTCAGGAGCGCATGGCAGAACACATCCGGTACTGGTTGAAACCATTGCTCACCACCAGGTTGATATTGATTCAGAGGTATAAAACGAATGAGTACAGCACTCGCAACGCTGGCAGGGAAGCTGGCTGAACGTGTCGGCATGGATTCTGTCGACCCACAGGAACTGATCACCACTCTTCGCCAGACGGCATTTAAAGGTGATGCCAGCGATGCGCAGTTCATCGCATTGCTGATCGTCGCCAACCAGTACGGTCTTAATCCGTGGACGAAAGAAATTTACGCCTTTCCTGATAAGCAGAACGGCATCGTTCCGGTGGTGGGCGTTGATGGCTGTCCCGTATCATCAATGAAAACCAGCAGTTTGAGGCATGGTACTTTGAGCAGGACA

This signature had some similarity with exclusion sequence, represented by the ex=0.0187, and indicates a very small amount of imprecise discrimination in this signature. This example further illustrates that the score (0.9792) is the sum of the in (0.9979) and ex (0.0187) values.

These differences in signatures from each inclusion target are a consequence of input sequence variation. The user's discretion will be required in determining which of these are most appropriate. Nonetheless, as described above, Neptune will attempt to consolidate these signatures into a single output file, if a single answer is desirable.