Multi-trait GWAS as a two-dimensional grid of generalized least-squares problems. GWAS with multiple traits requires the solution of m x t correlated GLS problems, originating a three-dimensional object B of size m x t x p.

Genome-Wide Association Studies

Multi-trait GWAS based on linear mixed models pose a challenge in terms of both the computational requirements (solution of up to trillions of correlated GLS problems) and the size of the datasets (up to terabytes) to be processed. Considering the collection of GLS problems as a whole, and exploiting application-specific knowledge is key to the design efficient algorithms.
  1. Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies
    ACM Transactions on Mathematical Software (TOMS), Volume 40(4), pp. Article 27, June 2014.
        author  = "Diego Fabregat-Traver and Paolo Bientinesi",
        title   = "Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies",
        journal = "ACM Transactions on Mathematical Software (TOMS)",
        year    = 2014,
        volume  = 40,
        number  = 4,
        pages   = "Article 27",
        month   = jun
    In many scientific and engineering applications one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by the Genome-Wide Association Studies (GWAS), a widely spread tool in computational biology. GWAS entails the solution of up to trillions ($10^{12}$) of correlated generalized least-squares problems, posing a daunting challenge: the performance of petaflops ($10^{15}$ floating-point operations) over terabytes of data residing on disk. In this paper, we design an efficient algorithm for performing GWAS on multi-core architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems, thus reducing the overall computational complexity. Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be distributed among the available cores, to attain high performance and scalability. We believe the paper contributes valuable guidelines of general applicability for computational scientists on how to develop and optimize numerical algorithms.

Distributed Algorithm

enables large scale genome wide association studies.
  1. High Performance Solutions for Big-data GWAS
    Parallel Computing, Volume 42, pp. 75 - 87, February 2015.
    Special issue on Parallelism in Bioinformatics.
        author  = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi",
        title   = "High Performance Solutions for Big-data GWAS",
        journal = "Parallel Computing",
        year    = 2015,
        volume  = 42,
        pages   = "75 - 87",
        month   = feb,
        note    = "Special issue on Parallelism in Bioinformatics"
    In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.

Graphical representation of the multilevel summation method

Molecular Dynamics

A typical bottleneck in molecular dynamics simulations is the performance of the long-range solver. Because of increasing simulation sizes, the limitations of current long-range methods become more and more problematic. Therefore, we are interested in investigating alternatives such as the multilevel summation method.

Contact: Daniel Tameling

Algorithm Generation

yields hundreds of implementations for tensor contractions.
  1. On the Performance Prediction of BLAS-based Tensor Contractions
    High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, 2015.
        author    = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi",
        title     = "On the Performance Prediction of BLAS-based Tensor Contractions",
        booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation",
        year      = 2015,
        editor    = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.",
        volume    = 8966,
        series    = "Lecture Notes in Computer Science",
        pages     = "193-212",
        publisher = "Springer International Publishing"
    Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.

About HPAC

The High-Performance and Automatic Computing group is concerned with the development and analysis of accurate and efficient numerical algorithms, with focus on numerical linear algebra. We target applications from materials science, molecular dynamics and computational biology, and the whole range of high-performance architectures.


  • Numerical linear algebra
    • Sequences of problems
    • Small scale operations
    • Tensor operations
    • Error analysis
    • Parallel eigensolvers
  • Parallelism
    • Vectorization
    • Multicore
    • Distributed-memory
    • Coprocessors: GPU, Xeon Phi, ...
    • Hybrid
  • Automation
    • Algorithm and code generation
    • Performance modeling and prediction
    • Algorithm ranking
  • Applications
    • Genome analysis
    • Molecular dynamics simulations
    • Symbolic algorithmic differentiation for matrix operations
    • Electronic structure calculations

Projects for Bachelor and Master theses

Check list of available projects.