Amazon Research Proposal - Improving Genomic Analysis on Clouds in Preparation for Next-Generation Sequencing

From Bioinformatics

Jump to: navigation, search

The following application was submitted to Amazon on August 13th, 2010 for a cloud research grant.

Improving Genomic Analysis on Clouds in Preparation for Next-Generation Sequencing

Philip C. Church, Adam Wong, Andrzej Goscinski, Christophe Lefèvre

Cloud computing is gradually being adopted by bio-informatics researchers; however there are a number of performance issues that occur when using common bio-informatics applications on a distributed computer platform such as a cloud. With next generation sequencing techniques, such as pore-based sequencing, genomic data will be even cheaper and faster to generate. Private labs will have many gigabytes of data to transfer, store and analyse. Transferring this large amount of genomic data across a network and to distributed nodes is time consuming and will generate a large amount of network traffic. In addition, a number of bio-informatics tools are not developed for high performance computing and therefore, when run on the cloud, cannot take advantage of multiple core and nodes. The goal of this project is to devise methods to improve the performance of genomic analysis on clouds by minimising data transferred over the network and to simplify the parallelization of bioinformatics applications.

With the increased throughput of genomic data collection, transfer of genomic data over the network, storage and querying through databases is becoming a bottle neck. We plan on developing a cloud service to compress and decompress genomic data in order to minimise the data which is stored and transferred. This specialized compression algorithm will make use of a number of genomic compression techniques. For example the use of a reduced character set of 2 bits will results in a size decrease of common FASTA formatted files by 4 times. The developed method will balance compression speed and compressed data size. Software will also be developed to facilitate genomic compression and decompression on the client’s side. In addition to improving network transfer, compressing genomic data can improve the performance of comparisons; a common technique used in database querying and sequence alignment.

A number of bio-informatics tools such as the ones provided through the statistical scripting language R are executed sequentially. Because these R scripts are not parallelized, when run on the cloud, they cannot take full advantage of the elastic nature of cloud computing. This problem can be addressed by developing tools to simplify the distribution of bioinformatics applications across multiple cores and nodes. Implementation will be in the form of a configurable application which will automate embarrassingly parallel distribution through a three step workflow (separation of data, execution of multiple processes on cores and nodes and the collection and merging of results). Multiple modes of data separation and merging will be supported allowing use in a number of applications. Ensuring all resources are utilized will provide improved analysis speed to bioinformatics researchers.

By combining improvements in genomic data transfer and genomic data analysis we seek to speed up bio-informatics analysis. These methods will be compatible with the Amazon cloud platform and will improve the use of computational resources for bio-informatics research. To test the effectiveness of our methods, we plan on running performance tests over multiple nodes using of a number of bio-informatics applications and an R based microarray analysis pipeline of our own design. To facilitate this testing we will make use of the Amazon Elastic Compute Cloud to perform computations, the Amazon SimpleDB for data storage and the Amazon Elastic MapReduce to distribute workflow.

Projects

  • R Microarray Pipeline
  • Genomic Compression Service
  • Accela
Personal tools