Genome Wide Association Studies (Computational Biology)

Team

Publications and Talks

Journal Articles

  1. Large-Scale Linear Regression: Development of High-Performance Routines
    Applied Mathematics and Computation, Volume 275, pp. 411-421, 15 February 2016.
    @article{Frank2016:90,
        author  = "Alvaro Frank and Diego Fabregat-Traver and Paolo Bientinesi",
        title   = "Large-Scale Linear Regression: Development of High-Performance Routines",
        journal = "Applied Mathematics and Computation",
        year    = 2016,
        volume  = 275,
        pages   = "411-421",
        month   = feb,
        url     = "http://arxiv.org/abs/1504.07890"
    }
    In statistics, series of ordinary least squares problems (OLS) are used to study the linear correlation among sets of variables of interest; in many studies, the number of such variables is at least in the millions, and the corresponding datasets occupy terabytes of disk space. As the availability of large-scale datasets increases regularly, so does the challenge in dealing with them. Indeed, traditional solvers---which rely on the use of ``black-box'' routines optimized for one single OLS---are highly inefficient and fail to provide a viable solution for big-data analyses. As a case study, in this paper we consider a linear regression consisting of two-dimensional grids of related OLS problems that arise in the context of genome-wide association analyses, and give a careful walkthrough for the development of {\sc ols-grid}, a high-performance routine for shared-memory architectures; analogous steps are relevant for tailoring OLS solvers to other applications. In particular, we first illustrate the design of efficient algorithms that exploit the structure of the OLS problems and eliminate redundant computations; then, we show how to effectively deal with datasets that do not fit in main memory; finally, we discuss how to cast the computation in terms of efficient kernels and how to achieve scalability. Importantly, each design decision along the way is justified by simple performance models. {\sc ols-grid} enables the solution of $10^{11}$ correlated OLS problems operating on terabytes of data in a matter of hours.
    abstractwebPDFbibtexhide
  2. High Performance Solutions for Big-data GWAS
    Parallel Computing, Volume 42, pp. 75 - 87, February 2015.
    Special issue on Parallelism in Bioinformatics.
    @article{Peise2015:754,
        author  = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi",
        title   = "High Performance Solutions for Big-data GWAS",
        journal = "Parallel Computing",
        year    = 2015,
        volume  = 42,
        pages   = "75 - 87",
        month   = feb,
        note    = "Special issue on Parallelism in Bioinformatics",
        url     = "http://arxiv.org/pdf/1403.6426v1"
    }
    In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.
    abstractwebPDFbibtexhide
  3. Big-Data, High-Performance, Mixed Models Based Genome-Wide Association Analysis
    Diego Fabregat-Traver, Sodbo Sharapov, Caroline Hayward, Igor Rudan, Harry Campbell, Yurii S. Aulchenko and Paolo Bientinesi
    F1000Research, Volume 3(200), August 2014.
    Open peer reviews.
    @article{Fabregat-Traver2014:430,
        author       = "Diego Fabregat-Traver and Sodbo Sharapov and {Caroline } Hayward and {Igor } Rudan and {Harry } Campbell and {Yurii S.} Aulchenko and Paolo Bientinesi",
        title        = "Big-Data, High-Performance,  Mixed Models Based Genome-Wide Association Analysis",
        journal      = "F1000Research",
        year         = 2014,
        volume       = 3,
        number       = 200,
        month        = aug,
        note         = "Open peer reviews",
        howpublished = "F1000Research"
    }
    To raise the power of genome-wide association studies (GWAS) and avoid false-positive results, one can rely on mixed model based tests. When large samples are used, and especially when multiple traits are to be studied in the omics context, this approach becomes computationally unmanageable. Here, we develop new methods for analysis of arbitrary number of traits, and demonstrate that for the analysis of single-trait and multiple-trait GWAS, different methods are optimal. We implement these methods in a high-performance computing framework that uses state-of-the-art linear algebra kernels, incorporates optimizations, and avoids redundant computations, thus increasing throughput while reducing memory usage and energy consumption. We show that compared to existing libraries, the OmicABEL software---which implements these methods---achieves speed-ups of up to three orders of magnitude. As a consequence, samples of tens of thousands of individuals as well as samples phenotyped for many thousands of ''omics'' measurements can be analyzed for association with millions of omics features without the need for super-computers.
    abstractwebbibtexhide

Peer Reviewed Conference Publications

  1. Streaming Data from HDD to GPUs for Sustained Peak Performance
    Proceedings of the Euro-Par 2013, 19th International European Conference on Parallel and Distributed Computing, Lecture Notes in Computer Science, Volume 8097, pp. 788-799, Springer Berlin Heidelberg, May 2013.
    @inproceedings{Beyer2013:618,
        author    = "Lucas Beyer and Paolo Bientinesi",
        title     = "Streaming Data from HDD to GPUs for Sustained Peak Performance",
        year      = 2013,
        volume    = 8097,
        series    = "Lecture Notes in Computer Science",
        pages     = "788-799",
        month     = may,
        publisher = "Springer Berlin Heidelberg",
        url       = "http://arxiv.org/pdf/1302.4332v1.pdf"
    }
    In the context of the genome-wide association studies (GWAS), one has to solve long sequences of generalized least-squares problems; such a task has two limiting factors: execution time --often in the range of days or weeks-- and data management --data sets in the order of Terabytes. We present an algorithm that obviates both issues. By pipelining the computation, and thanks to a sophisticated transfer strategy, we stream data from hard disk to main memory to GPUs and achieve sustained peak performance; with respect to a highly-optimized CPU implementation, our algorithm shows a speedup of 2.6x. Moreover, the approach lends itself to multiple GPUs and attains almost perfect scalability. When using 4 GPUs, we observe speedups of 9x over the aforementioned implementation, and 488x over a widespread biology library.
    abstractwebPDFbibtexhide
  2. Algorithms for Large-scale Whole Genome Association Analysis
    Proceedings of the 20th European MPI Users' Group Meeting, EuroMPI '13, pp. 229-234, ACM, 2013.
    @inproceedings{Peise2013:504,
        author    = "Elmar Peise and Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi",
        title     = "Algorithms for Large-scale Whole Genome Association Analysis ",
        booktitle = "Proceedings of the 20th European MPI Users' Group Meeting",
        year      = 2013,
        series    = "EuroMPI '13",
        pages     = "229--234 ",
        address   = "New York, NY, USA",
        publisher = "ACM",
        url       = "http://arxiv.org/pdf/1304.2272v1"
    }
    In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datasets.
    abstractwebPDFbibtexhide

Technical Report

  1. Streaming Data from HDD to GPUs for Sustained Peak Performance
    Aachen Institute for Computational Engineering Science, RWTH Aachen, February 2013.
    Technical Report AICES-2013/02-1.
    @techreport{Beyer2013:398,
        author      = "Lucas Beyer and Paolo Bientinesi",
        title       = "Streaming Data from HDD to GPUs for Sustained Peak Performance",
        institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen",
        year        = 2013,
        month       = feb,
        note        = "Technical Report AICES-2013/02-1"
    }
    In the context of the genome-wide association studies (GWAS), one has to solve long sequences of generalized least-squares problems; such a task has two limiting factors: execution time --often in the range of days or weeks-- and data management --data sets in the order of Terabytes. We present an algorithm that obviates both issues. By pipelining the computation, and thanks to a sophisticated transfer strategy, we stream data from hard disk to main memory to GPUs and achieve sustained peak performance; with respect to a highly-optimized CPU implementation, our algorithm shows a speedup of 2.6x. Moreover, the approach lends itself to multiple GPUs and attains almost perfect scalability. When using 4 GPUs, we observe speedups of 9x over the aforementioned implementation, and 488x over a widespread biology library.
    abstractwebbibtexhide

Talks

  1. Algorithms for Large-scale Whole Genome Association Analysis
    PBio 2013: International Workshop on Parallelism in Bioinformatics.
    EuroMPI 2013, Madrid, September 2013.
    In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datasets.
    abstractPDFhide
  2. High Performance Computational Biology: Asynchronous IO and Elemental for GWAS
    Annual Report 1.
    AICES, RWTH Aachen, August 2013.