# Performance Modeling and Prediction for Dense Linear Algebra

### Team

## Summary

The solution to problems from various fields such as chemistry,
physics, and biology can usually be obtained through a number
of alternative approaches. In most cases, the only way to
determine the fastest approach is to implement, test, and
compare all of them. Deriving and implementing these
approaches is very well studied and supported by numerous tools
and frameworks, yet it still requires profound knowledge of
computer science and hardware. The goal of our project is to
devise strategies and tools to support and automate the whole
evaluation process: Given a specific hardware, we aim at
predicting the execution time of computer programs *without
executing them*, in order to both select the fastest
program and its optimal configuration.

A myriad of computational problems involve the solution of linear algebra equations. Research of the past 30 years has shown that efficient algorithms for the solution for these problems can be structured in several layers of libraries, at the very bottom of which is only a handful of kernel operations. Due to their crucial role, implementations of the kernels, which are highly architecture dependent, are among the most optimized software available. Just as bricks are used to build a wall, these kernels are the building blocks of scientific software, and different types of bricks correspond to different linear algebra kernels. Given a computational problem, programmers are supported by an ensemble of tools to use these building blocks to generate various alternative solution algorithms and implementations. However, the execution time of the resulting programs remains extremely difficult to predict accurately. In fact, not only does it depend on the problem size and the kernels used, but it is also influenced by numerous factors such as the particular kernel implementation, the computer architecture, and even the correlations between kernels.

In this project, we address the problem of predicting properties, such as the execution time, of a whole algorithm (wall), by studying properties of the kernel operations (bricks), and by analyzing their interactions (the mortar). At the core of this process, we generate accurate statistical performance models that describe our building blocks. Then, according to information based on their interactions, we combine these models and predict the performance of the full algorithm.

The strength of our approach mainly originates from the building blocks' performance models. Generated once for a given computer system, they can be used to analyze any number of algorithms that are constructed from them. Our ultimate goal is an software framework that, given a linear algebra problem, mechanically constructs possible solution algorithms from the building blocks and uses the performance models for these to select and optimize the best one.

As part of the Ph.D. project, we aim at a methodology that applies to a variety of systems, ranging from standalone computers (possibly with specialized hardware such as graphics cards), to networks and clusters of them.

## Publications and Talks

### Peer Reviewed Conference Publications

**On the Performance Prediction of BLAS-based Tensor Contractions**High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, April 2015.@inproceedings{Peise2015:380, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "On the Performance Prediction of BLAS-based Tensor Contractions", booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation", year = 2015, editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.", volume = 8966, series = "Lecture Notes in Computer Science", pages = "193-212", month = apr, publisher = "Springer International Publishing", url = "http://arxiv.org/pdf/1409.8608v1" }

abstractwebPDFbibtexTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.**A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels**High Performance Computing for Computational Science -- VECPAR 2014, Lecture Notes in Computer Science, Volume 8969, pp. 245-258, Springer International Publishing, April 2015.@inproceedings{Peise2015:250, author = "Elmar Peise and Paolo Bientinesi", title = "A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels", booktitle = "High Performance Computing for Computational Science -- VECPAR 2014", year = 2015, editor = "Daydé, Michel and Marques, Osni and Nakajima, Kengo", volume = 8969, series = "Lecture Notes in Computer Science", pages = "245-258", month = apr, publisher = "Springer International Publishing", url = "https://arxiv.org/pdf/1402.5897v1" }

abstractwebPDFbibtexIt is universally known that caching is critical to attain high- performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.**Performance Modeling for Dense Linear Algebra**Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (PMBS12), SCC '12, pp. 406-416, IEEE Computer Society, November 2012.@inproceedings{Peise2012:50, author = "Elmar Peise and Paolo Bientinesi", title = "Performance Modeling for Dense Linear Algebra", booktitle = "Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (PMBS12)", year = 2012, series = "SCC '12", pages = "406--416", address = "Washington, DC, USA", month = nov, publisher = "IEEE Computer Society", url = "http://arxiv.org/pdf/1209.2364" }

abstractwebPDFbibtexIt is well known that the behavior of dense linear algebra algorithms is greatly influenced by factors like target architecture, underlying libraries and even problem size; because of this, the accurate prediction of their performance is a real challenge. In this article, we are not interested in creating accurate models for a given algorithm, but in correctly ranking a set of equivalent algorithms according to their performance. Aware of the hierarchical structure of dense linear algebra routines, we approach the problem by developing a framework for the automatic generation of statistical performance models for BLAS and LAPACK libraries. This allows us to obtain predictions through evaluating and combining such models. We demonstrate that our approach is successful in both single- and multi-core environments, not only in the ranking of algorithms but also in tuning their parameters.

### Theses

**Performance Modeling and Prediction for Dense Linear Algebra**HPAC, RWTH Aachen, May 2017.

1st version (under evaluation).@phdthesis{Peise2017:620, author = "Elmar Peise", title = "Performance Modeling and Prediction for Dense Linear Algebra", school = "HPAC, RWTH Aachen", year = 2017, month = may, note = "1st version (under evaluation)", url = "https://arxiv.org/pdf/1706.01341" }

abstractwebPDFbibtexThis dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.**Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms**AICES, RWTH Aachen, May 2012.@mastersthesis{Peise2012:168, author = "Elmar Peise", title = "Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms", school = "AICES, RWTH Aachen", year = 2012, month = may, url = "http://arxiv.org/pdf/1207.5217v3" }

abstractwebPDFbibtexA large class of dense linear algebra operations, such as LU decomposition or inversion of a triangular matrix, are usually performed by blocked algorithms. For one such operation, typically, not only one but many algorithmic variants exist; depending on computing architecture, libraries and problem size, each variant attains a different performances. We propose methods and tools to rank the algorithmic variants according to their performance for a given scenario without executing them. For this purpose, we identify the routines upon which the algorithms are built. A first tool - the Sampler - measures the performance of these routines. Using the Sampler, a second tool models their performance. The generated models are then used to predict the performance of the considered algorithms. For a given scenario, these predictions allow us to correctly rank the algorithms according to their performance without executing them. With the help of the same tools, algorithmic parameters such as block-size can be optimally tuned.

### Technical Reports

**The ELAPS Framework: Experimental Linear Algebra Performance Studies**November 2016.@techreport{Peise2016:560, author = "Elmar Peise and Paolo Bientinesi", title = "The ELAPS Framework: Experimental Linear Algebra Performance Studies", year = 2016, month = nov, url = "http://arxiv.org/pdf/1504.08035v1" }

abstractwebPDFbibtexIn scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.**Cache-aware Performance Modeling and Prediction for Dense Linear Algebra**November 2014.@techreport{Peise2014:904, author = "Elmar Peise and Paolo Bientinesi", title = "Cache-aware Performance Modeling and Prediction for Dense Linear Algebra", year = 2014, month = nov, url = "http://arxiv.org/pdf/1409.8602v1" }

abstractPDFbibtexCountless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK, and typically each operation can be obtained in many alternative ways. Interestingly, identifying the fastest implementation---without executing it---is a challenging task even for experts. An equally challenging task is that of tuning each routine to performance-optimal configurations. Indeed, the problem is so difficult that even the default values provided by the libraries are often considerably suboptimal; as a solution, normally one has to resort to executing and timing the routines, driven by some form of parameter search. In this paper, we discuss a methodology to solve both problems: identifying the best performing algorithm within a family of alternatives, and tuning algorithmic parameters for maximum performance; in both cases, we do not execute the algorithms themselves. Instead, our methodology relies on timing and modeling the computational kernels underlying the algorithms, and on a technique for tracking the contents of the CPU cache. In general, our performance predictions allow us to tune dense linear algebra algorithms within few percents from the best attainable results, thus allowing computational scientists and code developers alike to efficiently optimize their linear algebra routines and codes.

### Talks

**The ELAPS Framework: Experimental Linear Algebra Performance Studies**SIAM Conference on Parallel Processing for Scientific Computing.

Université Pierre et Marie Curie, Paris, April 2016.abstractwebPDFThe multi-platform open source framework ELAPS facilitates easy and fast, yet powerful performance experimentation and prototyping of dense linear algebra algorithms. In contrast to most existing performance analysis tools, it targets the initial stages of the development process and assists developers in both algorithmic and optimization decisions. Users construct experiments to investigate how performance and efficiency vary from one algorithm to another, depending on factors such as caching, algorithmic parameters, problem size, and parallelism.**On the Performance Prediction of BLAS-based Tensor Contractions**5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS14).

SC14, New Orleans, LA, USA, 16 November 2014.abstractwebPDF**Estimating the Efficiency of BLAS-based Tensor Contractions**Annual Report 2.

AICES, RWTH Aachen, 6 November 2014.**A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels**The Ninth International Workshop on Automatic Performance Tuning (iWAPT2014), VECPAR 2014.

University of Oregon and Hilton Conference Center, Eugene, Oregon, USA, July 2014.abstractwebPDFIt is universally known that caching is critical to attain high-performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.**Performance Modeling for DLA Kernels**BLIS Retreat.

University of Texas at Austin, September 2013.**Performance Modeling for Dense Linear Algebra**3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS12).

SC12, Salt Lake City, Utah, USA, November 2012.abstractPDF**Performance Modeling for Ranking Blocked Algorithms**Parallel Matrix Algorithms and Applications (PMAA) 2012.

Birkbeck University of London, June 2012.abstractPDFA large class of dense linear algebra operations, such as LU decomposition or inversion of a triangular matrix, are usually performed by blocked algorithms. For one such operation, typically, not only one but many algorithmic variants exist; depending on computing architecture, libraries and problem size, each variant attains a different performances. We propose methods and tools to rank the algorithmic variants according to their performance for a given scenario without executing them. For this purpose, we identify the routines upon which the algorithms are built. A first tool ,the Sampler, measures the performance of these routines. Using the Sampler, a second tool models their performance. The generated models are then used to predict the performance of the considered algorithms. For a given scenario, these predictions allow us to correctly rank the algorithms according to their performance without executing them. With the help of the same tools, algorithmic parameters such as block-size can be optimally tuned.**Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms**Master's Thesis colloquium.

AICES, RWTH Aachen, April 2012.abstractPDFA large class of dense linear algebra operations, such as LU decomposition or inversion of a triangular matrix, are usually performed by blocked algorithms. For one such operation, typically, not only one but many algorithmic variants exist; depending on architecture, libraries and problem size, each variant attains a different performances. In my project, I developed methods and tools to rank the algorithmic variants according to their performance for a given scenario without executing them. For this purpose, I analyze the routines upon which the algorithms are built and introduce a tool that, based on measurements, models their performance. The generated performance models are then used to predict the performance of the considered algorithms. For a given scenario, these predictions allow me not only to rank the variants but to determine the optimal algorithm configuration.