Integrative methods for reference-independent genome assembly and error detection /

Saved in:
Bibliographic Details
Author / Creator:Bun, Christopher Chean, author.
Ann Arbor : ProQuest Dissertations & Theses, 2016
Description:1 electronic resource (140 pages)
Format: E-Resource Dissertations
Local Note:School code: 0330
URL for this record:
Hidden Bibliographic Details
Other authors / contributors:University of Chicago. degree granting institution.
Notes:Advisors: Rick Stevens Committee members: James Davis; Ian Foster; Robert Grossman; Fangfang Xia.
Dissertation Abstracts International, Volume: 78-06(E), Section: B.
Summary:High-throughput genetic sequencing technologies have driven the proliferation of new genomic data. From the advent of long-read Sanger sequencing to the now low-cost, short-read generation and upcoming era of single-molecule techniques, methods to address the complex genome assembly problem have evolved alongside and are introduced at an expeditious pace. These algorithms attempt to produce an accurate representation of a target genome from datasets filled with errors and ambiguities. Many of the challenges introduced, unfortunately, must be addressed through an algorithm's ad-hoc criteria and heuristics, and as a result, can output assembly hypotheses that contain significant errors. Without an inexpensive or computational approach to assess the quality of a given assembly hypothesis, researchers must make due with draft-level genome projects for downstream analysis. Solving three fundamental challenges will alleviate this issue: (i) automation and incorporation of algorithms from the dynamic landscape of genome assembly tools, (ii) developing optimal assembly algorithms best suited for various types, or mixtures, of sequencing data, and (iii) developing an approach to assess de novo genome assembly quality independence of a reference genome.
We provide several contributions towards this effort: We first introduce AssemblyRAST, a general compute orchestration framework and accompanying domain-specific language that facilitates rapid workflow design for rapid genome assembly, analysis, and method discovery. Next, we demonstrate the improvement of genome assemblies through novel integrative algorithm techniques. Finally, we devise a method for reference-independent assembly evaluation and error identification through supervised learning, along with several applications to further improve existing techniques.