Science, discussed.

Discovering genetic variants from de novo assembly of high-throughput sequencing

High-throughput sequencing is dominating genetics research (and a lot of biotech) because of its speed, low cost, and the resulting abundance of data.  But it still has some major shortcomings.  One big problem is that analysis of this data usually requires aligning reads to a reference genome.  Since this tends to bias any results heavily toward the reference, it can be rather problematic if your sequenced genome is very different from the reference, especially for novel sequence insertions and large rearrangements.  I personally struggled with this problem a few years ago when I needed to analyze yeast data for a strain without a reference — I did the analysis using the standard S288c strain as a reference, but a few percent of the genome was substantially different, which meant we couldn’t reliably identify true variants in those regions.

An alternative is to perform de novo assembly of the reads yourself and call variants with respect to that, rather than an already-known reference.  De novo assembly has its own problems, but with long enough reads and sufficient coverage, assembly might be better than to force your data to align to a suboptimal reference.  A new pipeline that streamlines de novo assembly and variant calling together was just posted to arXiv last week:

FermiKit: assembly-based variant calling for Illumina resequencing 
Heng Li

The author, Heng Li, is well-known in bioinformatics circles for Samtools, a widely-used package of tools for analyzing sequencing data.  The new pipeline described in this paper, FermiKit, strings together existing packages for assembly and variant calling along with a novel data compression technique (critical for human data, which is usually enormous).  The pipeline seems to run rather fast (~1 day for a typical human data set) and also seems pretty easy to use.

Reading about this new package actually spurred my memory of another pipeline that does both de novo assembly and variant calling:

High-throughput microbial population genomics using the Cortex 
variation assembler
Zamin Iqbal, Isaac Turner, and Gil McVean

At present I cannot speak to the relative merits of these tools, but I am definitely looking forward to trying them in the future.

NB: On the topic of software and computational methods in biology, last week there was a provocative blog post about their importance and role in biology research last week — specifically, whether we should consider them as an intellectual contribution on par with a typical research paper.  It triggered quite a wave of comments from people in the community, many of which are worth reading in my opinion.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


This entry was posted on 02/05/2015 by in Computational Biology, Science and tagged .


Creative Commons License
This blog is licensed under a Creative Commons Attribution 3.0 License.


Enter your email address to follow this blog and receive notifications of new posts by email.

Join 1,560 other followers

All Categories

%d bloggers like this: