# Publications - Diego Fabregat-Traver

### Submitted Paper

**Accelerating scientific codes by performance and accuracy modeling**Journal of Computational Science, May 2017.abstractPDFScientific software is often driven by multiple parameters that affect both accuracy and performance. Since finding the optimal configuration of these parameters is a highly complex task, it extremely common that the software is used suboptimally. In a typical scenario, accuracy requirements are imposed, and attained through suboptimal performance. In this paper, we present a methodology for the automatic selection of parameters for simulation codes, and a corresponding prototype tool. To be amenable to our methodology, the target code must expose the parameters affecting accuracy and performance, and there must be formulas available for error bounds and computational complexity of the underlying methods. As a case study, we consider the particle-particle particle-mesh method (PPPM) from the LAMMPS suite for molecular dynamics, and use our tool to identify configurations of the input parameters that achieve a given accuracy in the shortest execution time. When compared with the configurations suggested by expert users, the parameters selected by our tool yield reductions in the time-to-solution ranging between 10% and 60%. In other words, for the typical scenario where a fixed number of core-hours are granted and simulations of a fixed number of timesteps are to be run, usage of our tool may allow up to twice as many simulations. While we develop our ideas using LAMMPS as computational framework and use the PPPM method for dispersion as case study, the methodology is general and valid for a range of software tools and methods.

### Journal Articles

**Large-Scale Linear Regression: Development of High-Performance Routines**Applied Mathematics and Computation, Volume 275, pp. 411-421, 15 February 2016.@article{Frank2016:90, author = "Alvaro Frank and Diego Fabregat-Traver and Paolo Bientinesi", title = "Large-Scale Linear Regression: Development of High-Performance Routines", journal = "Applied Mathematics and Computation", year = 2016, volume = 275, pages = "411-421", month = feb, url = "http://arxiv.org/abs/1504.07890" }

abstractwebPDFbibtexIn statistics, series of ordinary least squares problems (OLS) are used to study the linear correlation among sets of variables of interest; in many studies, the number of such variables is at least in the millions, and the corresponding datasets occupy terabytes of disk space. As the availability of large-scale datasets increases regularly, so does the challenge in dealing with them. Indeed, traditional solvers---which rely on the use of ``black-box'' routines optimized for one single OLS---are highly inefficient and fail to provide a viable solution for big-data analyses. As a case study, in this paper we consider a linear regression consisting of two-dimensional grids of related OLS problems that arise in the context of genome-wide association analyses, and give a careful walkthrough for the development of {\sc ols-grid}, a high-performance routine for shared-memory architectures; analogous steps are relevant for tailoring OLS solvers to other applications. In particular, we first illustrate the design of efficient algorithms that exploit the structure of the OLS problems and eliminate redundant computations; then, we show how to effectively deal with datasets that do not fit in main memory; finally, we discuss how to cast the computation in terms of efficient kernels and how to achieve scalability. Importantly, each design decision along the way is justified by simple performance models. {\sc ols-grid} enables the solution of $10^{11}$ correlated OLS problems operating on terabytes of data in a matter of hours.**High Performance Solutions for Big-data GWAS**Parallel Computing, Volume 42, pp. 75 - 87, February 2015.

Special issue on Parallelism in Bioinformatics.@article{Peise2015:754, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "High Performance Solutions for Big-data GWAS", journal = "Parallel Computing", year = 2015, volume = 42, pages = "75 - 87", month = feb, note = "Special issue on Parallelism in Bioinformatics", url = "http://arxiv.org/pdf/1403.6426v1" }

abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.**Big-Data, High-Performance, Mixed Models Based Genome-Wide Association Analysis**F1000Research, Volume 3(200), August 2014.

Open peer reviews.@article{Fabregat-Traver2014:430, author = "Diego Fabregat-Traver and Sodbo Sharapov and {Caroline } Hayward and {Igor } Rudan and {Harry } Campbell and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Big-Data, High-Performance, Mixed Models Based Genome-Wide Association Analysis", journal = "F1000Research", year = 2014, volume = 3, number = 200, month = aug, note = "Open peer reviews", howpublished = "F1000Research" }

abstractwebbibtexTo raise the power of genome-wide association studies (GWAS) and avoid false-positive results, one can rely on mixed model based tests. When large samples are used, and especially when multiple traits are to be studied in the omics context, this approach becomes computationally unmanageable. Here, we develop new methods for analysis of arbitrary number of traits, and demonstrate that for the analysis of single-trait and multiple-trait GWAS, different methods are optimal. We implement these methods in a high-performance computing framework that uses state-of-the-art linear algebra kernels, incorporates optimizations, and avoids redundant computations, thus increasing throughput while reducing memory usage and energy consumption. We show that compared to existing libraries, the OmicABEL software---which implements these methods---achieves speed-ups of up to three orders of magnitude. As a consequence, samples of tens of thousands of individuals as well as samples phenotyped for many thousands of ''omics'' measurements can be analyzed for association with millions of omics features without the need for super-computers.**Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies**ACM Transactions on Mathematical Software (TOMS), Volume 40(4), pp. 27:1-27:22, June 2014.@article{Fabregat-Traver2014:798, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2014, volume = 40, number = 4, pages = "27:1--27:22", month = jun, publisher = "ACM", url = "http://arxiv.org/pdf/1210.7683v1" }

abstractwebPDFbibtexIn many scientific and engineering applications one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by the Genome-Wide Association Studies (GWAS), a widely spread tool in computational biology. GWAS entails the solution of up to trillions ($10^{12}$) of correlated generalized least-squares problems, posing a daunting challenge: the performance of petaflops ($10^{15}$ floating-point operations) over terabytes of data residing on disk. In this paper, we design an efficient algorithm for performing GWAS on multi-core architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems, thus reducing the overall computational complexity. Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be distributed among the available cores, to attain high performance and scalability. We believe the paper contributes valuable guidelines of general applicability for computational scientists on how to develop and optimize numerical algorithms.**Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions**Applied Mathematics and Computation, Volume 235, pp. 454-468, May 2014.@article{Di_Napoli2014:210, author = "Edoardo {Di Napoli} and Diego Fabregat-Traver and Gregorio Quintana-Orti and Paolo Bientinesi", title = "Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions", journal = "Applied Mathematics and Computation", year = 2014, volume = 235, pages = "454--468", month = may, publisher = "Elsevier", url = "http://arxiv.org/pdf/1307.2100" }

abstractwebPDFbibtex**Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures**Applied Mathematics and Computation, Volume 234, pp. 606-617, May 2014.@article{Fabregat-Traver2014:430, author = "Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures", journal = "Applied Mathematics and Computation", year = 2014, volume = 234, pages = "606--617", month = may, url = "http://arxiv.org/pdf/1210.7325v1" }

abstractwebPDFbibtexGeneralized linear mixed-effects models in the context of genome-wide association studies (GWAS) represent a formidable computational challenge: the solution of millions of correlated generalized least-squares problems, and the processing of terabytes of data. We present high performance in-core and out-of-core shared-memory algorithms for GWAS: By taking advantage of domain-specific knowledge, exploiting multi-core parallelism, and handling data efficiently, our algorithms attain unequalled performance. When compared to GenABEL, one of the most widely used libraries for GWAS, on a 12-core processor we obtain 50-fold speedups. As a consequence, our routines enable genome studies of unprecedented size.**Application-tailored Linear Algebra Algorithms: A Search-Based Approach**International Journal of High Performance Computing Applications (IJHPCA), Volume 27(4), pp. 425 - 438, November 2013.@article{Fabregat-Traver2013:4, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Application-tailored Linear Algebra Algorithms: A Search-Based Approach", journal = "International Journal of High Performance Computing Applications (IJHPCA)", year = 2013, volume = 27, number = 4, pages = "425 - 438", month = nov, publisher = "Sage Publications, Inc.", url = "http://arxiv.org/abs/1211.5904" }

abstractwebPDFbibtexIn this paper, we tackle the problem of automatically generating algorithms for linear algebra operations by taking advantage of problem-specific knowledge. In most situations, users possess much more information about the problem at hand than what current libraries and computing environments accept; evidence shows that if properly exploited, such information leads to uncommon/unexpected speedups. We introduce a knowledge-aware linear algebra compiler that allows users to input matrix equations together with properties about the operands and the problem itself; for instance, they can specify that the equation is part of a sequence, and how successive instances are related to one another. The compiler exploits all this information to guide the generation of algorithms, to limit the size of the search space, and to avoid redundant computations. We applied the compiler to equations arising as part of sensitivity and genome studies; the algorithms produced exhibit, respectively, 100- and 1000-fold speedups.

### Peer Reviewed Conference Publications

**Program Generation for Small-Scale Linear Algebra Applications**Proceedings of the International Symposium on Code Generation and Optimization, February 2018.@inproceedings{Spampinato2018:858, author = "{Daniele } Spampinato and Diego Fabregat-Traver and Paolo Bientinesi and Markus Pueschel", title = "Program Generation for Small-Scale Linear Algebra Applications", year = 2018, address = "Vienna, Austria", month = feb }

abstractbibtexWe present SLinGen, a program generation system for linear algebra. The input to SLinGen is an application expressed mathematically in a linear-algebra-inspired language (LA) that we define. LA provides basic scalar/vector/matrix additions/multiplications, higher level operations including linear systems solvers, Cholesky and LU factorizations, as well as loops. The output of SLinGen is performance-optimized single-source C code, optionally vectorized with intrinsics. The target of SLinGen are small-scale computations on fixed-size operands, for which a straightforward implementation using optimized libraries (e.g., BLAS or LAPACK) is known to yield suboptimal performance (besides increasing code size and introducing dependencies), but which are crucial in control, signal processing, computer vision, and other domains. Internally, SLinGen uses synthesis and DSL-based techniques to optimize at a high level of abstraction. We benchmark our program generator on three prototypical applications: the Kalman filter, Gaussian process regression, and an L1-analysis convex solver, as well as basic routines including Cholesky factorization and solvers for the continuous-time Lyapunov and Sylvester equations. The results show significant speed-ups compared to straightforward C with icc/clang or a polyhedral optimizer, as well as library-based and template-based implementations.**Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods**Proceedings of the JARA-HPC Symposium, Lecture Notes in Computer Science, Volume 10164, pp. 200-211, Springer, 2017.@inproceedings{Fabregat-Traver2017:4, author = "Diego Fabregat-Traver and Davor Davidovic and Markus Höhnerbach and Edoardo {Di Napoli}", title = " Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods", year = 2017, volume = 10164, series = "Lecture Notes in Computer Science", pages = "200--211", publisher = "Springer", url = "https://arxiv.org/pdf/1611.00606v1" }

abstractwebPDFbibtex**On the Performance Prediction of BLAS-based Tensor Contractions**High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, April 2015.@inproceedings{Peise2015:380, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "On the Performance Prediction of BLAS-based Tensor Contractions", booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation", year = 2015, editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.", volume = 8966, series = "Lecture Notes in Computer Science", pages = "193-212", month = apr, publisher = "Springer International Publishing", url = "http://arxiv.org/pdf/1409.8608v1" }

abstractwebPDFbibtexTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.**Algorithms for Large-scale Whole Genome Association Analysis**Proceedings of the 20th European MPI Users' Group Meeting, EuroMPI '13, pp. 229-234, ACM, 2013.@inproceedings{Peise2013:504, author = "Elmar Peise and Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Algorithms for Large-scale Whole Genome Association Analysis ", booktitle = "Proceedings of the 20th European MPI Users' Group Meeting", year = 2013, series = "EuroMPI '13", pages = "229--234 ", address = "New York, NY, USA", publisher = "ACM", url = "http://arxiv.org/pdf/1304.2272v1" }

abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datasets.**A Domain-Specific Compiler for Linear Algebra Operations**High Performance Computing for Computational Science - VECPAR 2012, Lecture Notes in Computer Science, Volume 7851, pp. 346-361, Springer, 2013.@inproceedings{Fabregat-Traver2013:340, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "A Domain-Specific Compiler for Linear Algebra Operations", booktitle = "High Performance Computing for Computational Science - VECPAR 2012", year = 2013, editor = "M. Dayde, O. Marques, and K. Nakajima", volume = 7851, series = "Lecture Notes in Computer Science", pages = "346--361", address = "Heidelberg", publisher = "Springer", url = "http://arxiv.org/pdf/1205.5975v1" }

abstractwebPDFbibtexWe present a prototypical linear algebra compiler that automatically exploits domain-specific knowledge to generate high-performance algorithms. The input to the compiler is a target equation together with knowledge of both the structure of the problem and the properties of the operands. The output is a variety of high-performance algorithms to solve the target equation. Our approach consists in the decomposition of the input equation into a sequence of library-supported kernels. Since in general such a decomposition is not unique, our compiler returns not one but a number of algorithms. The potential of the compiler is shown by means of its application to a challenging equation arising within the*genome-wide association study*. As a result, the compiler produces multiple "best" algorithms that outperform the best existing libraries.**Automatic Generation of Loop-Invariants for Matrix Operations**Computational Science and its Applications, International Conference, pp. 82-92, IEEE Computer Society, 2011.@inproceedings{Fabregat-Traver2011:238, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Automatic Generation of Loop-Invariants for Matrix Operations", booktitle = "Computational Science and its Applications, International Conference", year = 2011, pages = "82-92", address = "Los Alamitos, CA, USA", publisher = "IEEE Computer Society", url = "http://hpac.rwth-aachen.de/aices/preprint/documents/AICES-2011-02-01.pdf" }

abstractwebPDFbibtexIn recent years it has been shown that for many linear algebra operations it is possible to create families of algorithms following a very systematic procedure. We do not refer to the fine tuning of a known algorithm, but to a methodology for the actual generation of both algorithms and routines to solve a given target matrix equation. Although systematic, the methodology relies on complex algebraic manipulations and non-obvious pattern matching, making the procedure challenging to be performed by hand, our goal is the development of a fully automated system that from the sole description of a target equation creates multiple algorithms and routines. We present CL1ck, a symbolic system written in Mathematica, that starts with an equation, decomposes it into multiple equations, and returns a set of loop-invariants for the algorithms -- yet to be generated -- that will solve the equation. In a successive step each loop-invariant is then mapped to its corresponding algorithm and routine. For a large class of equations, the methodology generates known algorithms as well as many previously unknown ones. Most interestingly, the methodology unifies algorithms traditionally developed in isolation. As an example, the five well known algorithms for the LU factorization are for the first time unified under a common root.**Knowledge-Based Automatic Generation of Partitioned Matrix Expressions**Computer Algebra in Scientific Computing, Lecture Notes in Computer Science, Volume 6885, pp. 144-157, Springer, 2011.@inproceedings{Fabregat-Traver2011:54, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Knowledge-Based Automatic Generation of Partitioned Matrix Expressions", booktitle = "Computer Algebra in Scientific Computing", year = 2011, editor = "Gerdt, Vladimir and Koepf, Wolfram and Mayr, Ernst and Vorozhtsov, Evgenii", volume = 6885, series = "Lecture Notes in Computer Science", pages = "144-157", address = "Heidelberg", publisher = "Springer", url = "http://hpac.rwth-aachen.de/aices/preprint/documents/AICES-2011-01-03.pdf" }

abstractwebPDFbibtexIn a series of papers it has been shown that for many linear algebra operations it is possible to generate families of algorithms by following a systematic procedure. Although powerful, such a methodology involves complex algebraic manipulation, symbolic computations and pattern matching, making the generation a process challenging to be performed by hand. We aim for a fully automated system that from the sole description of a target operation creates multiple algorithms without any human intervention. Our approach consists of three main stages. The first stage yields the core object for the entire process, the Partitioned Matrix Expression (PME), which establishes how the target problem may be decomposed in terms of simpler sub-problems. In the second stage the PME is inspected to identify predicates, the Loop-Invariants, to be used to set up the skeleton of a family of proofs of correctness. In the third and last stage the actual algorithms are constructed so that each of them satisfies its corresponding proof of correctness. In this paper we focus on the first stage of the process, the automatic generation of Partitioned Matrix Expressions. In particular, we discuss the steps leading to a PME and the knowledge necessary for a symbolic system to perform such steps. We also introduce Cl1ck, a prototype system written in Mathematica that generates PMEs automatically.**Automatic Generation of Partitioned Matrix Expressions for Matrix Operations.**Proceedings of the 8th International Conference of Numerical Analysis and Applied Mathematics (ICNAAM 2010), AIP Conference Proceedings, Volume 1281, pp. 774-777, 2010.@inproceedings{Fabregat-Traver2010:304, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Automatic Generation of Partitioned Matrix Expressions for Matrix Operations.", year = 2010, volume = 1281, series = "AIP Conference Proceedings", pages = "774-777", url = "http://hpac.rwth-aachen.de/aices/preprint/documents/AICES-2010-10-01.pdf" }

abstractwebPDFbibtexWe target the automatic generation of formally correct algorithms and routines for linear algebra operations. Given the broad variety of architectures and configurations with which scientists deal, there does not exist one algorithmic variant that is suitable for all scenarios. Therefore, we aim to generate a family of algorithmic variants to attain high-performance for a broad set of scenarios. One of the authors has previously demonstrated that automatic derivation of a family of algorithms is possible when the Partitioned Matrix Expression (PME) of the target operation is available. The PME is a recursive definition that states the relations between submatrices in the input and the output operands. In this paper we describe all the steps involved in the automatic derivation of PMEs, thus making progress towards a fully automated system.

### Thesis

**Knowledge-Based Automatic Generation of Linear Algebra Algorithms and Code**RWTH Aachen, April 2014.@phdthesis{Fabregat-Traver2014:278, author = "Diego Fabregat-Traver", title = " Knowledge-Based Automatic Generation of Linear Algebra Algorithms and Code", school = "RWTH Aachen", year = 2014, month = apr, url = "http://arxiv.org/abs/1404.3406" }

abstractPDFbibtexThis dissertation focuses on the design and the implementation of domain-specific compilers for linear algebra matrix equations. The development of efficient libraries for such equations, which lie at the heart of most software for scientific computing, is a complex process that requires expertise in a variety of areas, including the application domain, algorithms, numerical analysis and high-performance computing. Moreover, the process involves the collaboration of several people for a considerable amount of time. With our compilers, we aim to relieve the developers from both designing algorithms and writing code, and to generate routines that match or even surpass the performance of those written by human experts.

### Other

**Program Generation for Linear Algebra Using Multiple Layers of DSLs**Proceedings of the DSLDI 2016, Extended abstract, October 2016.bibtex@misc{Spampinato2016:960, author = "{Daniele } Spampinato and Diego Fabregat-Traver and Markus Pueschel and Paolo Bientinesi", title = "Program Generation for Linear Algebra Using Multiple Layers of DSLs", month = oct, year = 2016, type = "Extended abstract" }

### Technical Report

**High-Throughput Genome-Wide Association Analysis for Single and Multiple Phenotypes**Aachen Institute for Computational Engineering Science, RWTH Aachen, July 2012.

Technical Report AICES-2012/07-1.@techreport{Fabregat-Traver2012:20, author = "Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "High-Throughput Genome-Wide Association Analysis for Single and Multiple Phenotypes", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2012, month = jul, note = "Technical Report AICES-2012/07-1", url = "http://arxiv.org/abs/1207.2169" }

abstractPDFbibtexThe variance component tests used in genomewide association studies of thousands of individuals become computationally exhaustive when multiple traits are analysed in the context of omics studies. We introduce two high-throughput algorithms -- CLAK-CHOL and CLAK-EIG -- for single and multiple phenotype genome-wide association studies (GWAS). The algorithms, generated with the help of an expert system, reduce the computational complexity to the point that thousands of traits can be analyzed for association with millions of polymorphisms in a course of days on a standard workstation. By taking advantage of problem specific knowledge, CLAK-CHOL and CLAK-EIG significantly outperform the current state-of-the-art tools in both single and multiple trait analysis.