### abstract ###
Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before.
The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths.
Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms.
Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology.
From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides.
Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly.
This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.
### introduction ###
Genome sequencing technology has moved into a new era with the introduction of extremely fast sequencing technologies that can produce over one billion base pairs of DNA in a single run.
Some of the fastest methods today, based on strategies such as cyclic reversible termination CITATION and ligation-based sequencing CITATION, produce the shortest read lengths, ranging from 15 50 bp.
These lengths are sufficient for resequencing projects, including efforts to sample the human population, but they have yet to prove as useful for sequencing of novel species.
The difficulty is that no existing assembly algorithms can accurately reconstruct a genome from such short reads CITATION .
The first published report of a bacterial genome sequence from short reads used pyrosequencing technology, which was able to generate reads averaging 110 bp.
That study CITATION demonstrated the feasibility of assembling the small bacterial genome of Mycoplasma genitalium from reads that covered the genome 40-fold.
This combination of coverage and read length allowed Margulies et al. to generate contiguous stretchs of DNA averaging 22.4 kilobases.
Results using pyrosequencing have improved steadily as read lengths have increased to 250 bp and longer, but the difficulty of de novo assembly has raised questions about the utility of alternative sequencing technologies those that produce reads shorter than 50 bp for genome sequencing projects.
Assembly of novel strains and species where the genome has not previously been sequenced from very short reads has proven more difficult, although simulation studies have indicated that it should be possible CITATION.
A recent study showed that a combination of pyrosequencing reads and paired-end sequencing could be used to assemble a 4 million base pair genome into just 139 contigs, linked together in 22 scaffolds CITATION.
Another recent effort used a hybrid strategy that mixed pyrosequencing and traditional Sanger sequencing to produce draft assemblies of marine microbes CITATION.
In contrast, the very short reads generated by the Solexa Sequence Analyzer have thus far been useful primarily for polymorphism discovery in the human genome, for resequencing and polymorphism discovery in Caernohabditis elegans CITATION, and for other applications such as ChIP-seq CITATION, which identifies genomic regions bound by transcription factors.
The very short reads currently 30 35 bp produced by CRT technologies such as Solexa present a far more difficult assembly problem.
Standard assembly algorithms such as Arachne CITATION, CITATION and Celera Assembler CITATION cannot process such short reads at all, spurring the development of several new algorithms designed for short reads, including SSAKE CITATION, Velvet CITATION, Edena CITATION, and ALLPATHS CITATION.
These latter methods can handle Solexa data, but they produce highly fragmented assemblies when provided with whole-genome data from a bacterial genome.
The inherent problem with very short reads is that every repetitive sequence longer than the read length causes breaks in the assembly.
To demonstrate the feasibility of assembling a bacterial genome from 33 bp reads, using related genomes to assist the process, we chose Pseudomonas aeruginosa strain PAb1, a highly virulent strain isolated from a frostbite patient.
P. aeruginosa is a ubiquitous environmental bacteria of clinical importance as the leading cause of gram-negative nosocomial infections CITATION, CITATION.
Several P. aeruginosa genomes have been sequenced previously, including two laboratory strains: PAO1, originally isolated from a wound, and PA14 isolated from a burn CITATION, CITATION.
PA14 and PAO1 are 99 percent identical across the 6.05 Mbp shared by both genomes, and their similarity to PAb1 allowed us to improve the assembly and provided a means to check its accuracy.
One of our goals in sequencing PAb1 was to identify genomic differences that contribute to its altered pathogenicity.
Here we report the assembly of P. aeruginosa PAb1 entirely from 33 bp reads, using a novel assembly strategy that takes advantage of related genomes and homologous protein sequences.
The assembly is of very high quality, comparable to or better than draft assemblies produced using earlier sequencing technologies.
This study shows that a novel bacterial genome can be sequenced entirely with very short read technology, without the use of paired-end sequences, and assembled into a high-quality genome.
Even at 40-fold coverage, the amount of sequence represents just one-quarter of a single sequencing run on a Solexa instrument, which brings the sequencing cost easily within the reach of most scientists.
By making all of our assembly software free and open source, we hope to further bring down the barriers to desktop whole-genome sequencing.
We generated 8,627,900 random shotgun reads from P. aeruginosa PAb1 using Solexa technology.
All reads were exactly 33 bp in length.
We used four distinct computational steps to assemble the genome of PAb1.
For the initial step, we used the comparative assembly algorithm AMOScmp CITATION, which aligns all reads to a reference genome, and then builds contigs based on these alignments.
The algorithm gains efficiency by avoiding the costly all-versus-all overlapping step, which is particularly difficult with very short reads due to the high incidence of false overlaps CITATION.
We modified AMOSCmp by tuning the MUMmer software CITATION, which is run within AMOScmp, to look for exact matches to the reference genome of at least 17 bp, allowing at most two mismatches in each read.
We found that careful trimming of the reads based on their matches to the reference produced better assemblies than un-trimmed reads.
The initial assembly used 7,500,501 reads, leaving 1,127,399 as singletons.
The PAb1 genome is closer to PA14 than to PAO1, and we therefore used PA14 as the primary reference for orienting the contigs.
Our second step was a novel enhancement to the comparative assembly strategy, in which we used multiple reference genomes.
We used the complete genomes of both PAO1 CITATION and PA14 CITATION separately to build multiple comparative assemblies, and found that PA14 produced the better assembly, comprising 2,053 contigs containing 6,206,284 bp.
The bulk of the sequence was contained in 157 contigs longer than 10 Kbp, which collectively covered 5,568,616 bp.
There were 331,364 bp in the PA14 genome that were not covered by the initial assembly, due to divergence between the two strains.
However, the gaps in the comparative assembly based on PAO1 occurred in different locations due to differences between the strains.
The best assembly based on PAO1 comprised 2797 contigs covering 6,043,652 bp.
We aligned the two assemblies to one another to identify locations where a contig in the PAO1-based assembly might span two or more contigs in the PA14-based assembly.
For each such case, we filled the gap with the sequence from the PAO1 assembly using the Minimus assembler CITATION to stitch together the contigs.
This algorithm closed 203 gaps, reducing the number of contigs to 1850, of which all but 305 were 200 bp.
The bulk of the genome, 5,949,162 bp, was contained in just 113 contigs of 10,000 bp or longer.
Note that the overlapping contigs between the two assemblies did not agree perfectly.
In order to produce a clean merged assembly, we re-mapped the reads to the contigs using AMOScmp to create consistent multi-alignments.
The third step used a novel algorithm, gene-boosted assembly.
For this step, we took the contigs from the previous step and identified protein-coding genes using our annotation pipeline, which is based on Glimmer CITATION and Blast CITATION.
Because amino acid sequences are much more conserved than nucleotide sequences, we were able to use the predicted protein sequences to fill gaps even where the DNA sequences diverged.
The annotation pipeline identified 5,769 proteins in the 305 longest contigs.
From the initial annotation, we identified those genes that extended beyond the ends of contigs or that spanned the gaps between contigs.
We extracted the amino acid sequences corresponding to these gap positions, with a small buffer sequence included on each side of each gap.
Next we used tblastn CITATION to align each protein sequence to all the unused reads translated in all 6 frames.
This step identified, for each gap, a small set of reads that would fill in the missing protein sequence, and the tblastn results provided initial locations for a multiple alignment.
We then used a new program, ABBA, to assemble the reads together with the flanking contigs and close the gaps.
This gene-boosted assembly protocol extended many contigs and closed 185 gaps, ranging in length from 14 1095 bp, reducing the number of long contigs to 120.
As a separate test, we conducted a gene-boosted assembly of PAb1 using only the annotated proteins from PA14 without any reference genomic sequence.
For this experiment, we aligned all the translated reads to each protein and used ABBA to assemble each one.
For 4,572 of the proteins, ABBA produced a single contig that covered the entire reference protein, and another 831 proteins assembled into a few contigs.
Thus 5,403 out of 5,602 of the PAb1 proteins can be assembled using a pure gene-boosting approach, and additional proteins would likely be assembled if we used a large set of proteins for boosting.
This demonstrates that in the absence of a closely related genome sequence, gene-boosted assembly can use protein sequences which diverge much more slowly than genomic DNA to assemble most of the genes of a new bacterial strain, although the results will lack global genome structure information.
The fourth step of our method identified any remaining DNA sequences that were unique to PAb1 and outside predicted gene regions.
We separately constructed pure de novo assemblies of the 8.6 million Solexa reads using SSAKE, Edena, and Velvet.
The Velvet assembly was the best of the three, creating 10,684 contigs, the longest being 16,239 bp.
We used MUMmer to align these contigs to the 120 long contigs in our scaffold from the previous step, and identified cases where de novo contigs spanned gaps or extended contigs.
This step allowed us to close 46 gaps, reducing the number of contigs in our main scaffold to 74.
After removing Velvet contigs that were already contained in our scaffold, we had 436 unplaced de novo contigs spanning 416,897 bp.
The longest unplaced contig was 10,493 bp.
