
Multitrait GWAS as a twodimensional grid of generalized leastsquares problems.
GWAS with multiple traits requires the solution of m x t correlated
GLS problems, originating a threedimensional object B of size
m x t x p.
GenomeWide Association Studies
Multitrait GWAS based on linear mixed models pose a challenge in terms
of both the computational requirements (solution of up to trillions of correlated
GLS problems) and the size of the datasets (up to terabytes) to be processed.
Considering the collection of GLS problems as a whole, and exploiting
applicationspecific knowledge is key to the design efficient algorithms.
 Computing Petaflops over Terabytes of Data: The Case of GenomeWide Association Studies
ACM Transactions on Mathematical Software (TOMS), Volume 40(4), pp. Article 27, June 2014. @article{FabregatTraver2014:798,
author = "Diego FabregatTraver and Paolo Bientinesi",
title = "Computing Petaflops over Terabytes of Data: The Case of GenomeWide Association Studies",
journal = "ACM Transactions on Mathematical Software (TOMS)",
year = 2014,
volume = 40,
number = 4,
pages = "Article 27",
month = jun
} In many scientific and engineering applications one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by
the GenomeWide Association Studies (GWAS),
a widely spread tool in computational biology.
GWAS entails the solution of up to trillions ($10^{12}$) of correlated generalized leastsquares problems, posing a daunting challenge: the performance of petaflops ($10^{15}$ floatingpoint operations) over terabytes of data residing on disk.
In this paper, we design an efficient algorithm for performing GWAS on multicore architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems,
thus reducing the overall computational complexity.
Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be
distributed among the available cores, to attain high performance and scalability.
We believe the paper contributes valuable guidelines
of general applicability for computational scientists on how to develop and optimize numerical algorithms. abstractwebPDFbibtexhide

Distributed Algorithm
enables large scale genome wide association studies.
 High Performance Solutions for Bigdata GWAS
Parallel Computing, Volume 42, pp. 75  87, February 2015. Special issue on Parallelism in Bioinformatics. @article{Peise2015:754,
author = "Elmar Peise and Diego FabregatTraver and Paolo Bientinesi",
title = "High Performance Solutions for Bigdata GWAS",
journal = "Parallel Computing",
year = 2015,
volume = 42,
pages = "75  87",
month = feb,
note = "Special issue on Parallelism in Bioinformatics"
} In order to associate complex traits with genetic polymorphisms, genomewide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain highperformance and allow the analysis of enormous datasets. abstractwebPDFbibtexhide

Graphical representation of the multilevel summation method
Molecular Dynamics
A typical bottleneck in molecular dynamics simulations is the
performance of the longrange solver. Because of increasing
simulation sizes, the limitations of current longrange methods
become more and more problematic. Therefore, we are interested
in investigating alternatives such as the multilevel summation
method.
More
Contact: Daniel Tameling

Algorithm Generation
yields hundreds of implementations for tensor contractions.
 On the Performance Prediction of BLASbased Tensor Contractions
High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193212, Springer International Publishing, 2015. @inproceedings{Peise2015:380,
author = "Elmar Peise and Diego FabregatTraver and Paolo Bientinesi",
title = "On the Performance Prediction of BLASbased Tensor Contractions",
booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation",
year = 2015,
editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.",
volume = 8966,
series = "Lecture Notes in Computer Science",
pages = "193212",
publisher = "Springer International Publishing"
} Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of highperformance kernels for such operations is known to be a challenging task. While for operations on one and twodimensional tensors there exist standardized interfaces and highlyoptimized libraries (BLAS), for higher dimensional tensors neither standards nor highlytuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating highperformance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cacheaware microbenchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the bestperforming algorithms in a tiny fraction of the time taken by the direct execution of the algorithms. abstractwebPDFbibtexhide
