# Recent Publications

### Submitted Papers

**ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems**May 2018.

submitted to ACM TOMS.abstractwebPDFSolving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties which are pertinent to the entire sequence, and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting them can be substantial. We present ChASE, a modern algorithm and library based on subspace iteration with polynomial acceleration. Novel to ChASE is the computation of the spectral estimates that enter in the filter and an optimization of the polynomial degree which further reduces the necessary FLOPs. ChASE is written in C++ using the modern software engineering concepts which favor a simple integration in application codes and a straightforward portability over heterogeneous platforms. When solving sequences of Hermitian eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits from the sequence's spectral properties and outperforms direct solvers in many scenarios. The library ships with two distinct parallelization schemes, supports execution over distributed GPUs, and it is easily extensible to other parallel computing architectures.**Assessment of sound spatialisation algorithms for sonic rendering with headsets**Journal of New Music Research, November 2017.abstractwebGiven an input sound signal and a target virtual sound source, sound spatialisation algorithms manipulate the signal so that a listener perceives it as though it were emitted from the target source. There exist several established spatialisation approaches that deliver satisfactory results when loudspeakers are used to playback the manipulated signal. As headphones have a number of desirable characteristics over loudspeakers, such as portability, isolation from the surrounding environment, cost and ease of use, it is interesting to explore how a sense of acoustic space can be conveyed through them. This article first surveys traditional spatialisation approaches intended for loudspeakers, and then reviews them with regard to their adaptability to headphones.

### Journal Articles

**MatchPy: Pattern Matching in Python**Journal of Open Source Software, Volume 3(26), pp. 2, June 2018.**Accelerating scientific codes by performance and accuracy modeling**Journal of Computational Science, April 2018.

Accepted.@article{Fabregat-Traver2018:408, author = "Diego Fabregat-Traver and {Ahmed E.} Ismail and Paolo Bientinesi", title = "Accelerating scientific codes by performance and accuracy modeling ", journal = "Journal of Computational Science", year = 2018, month = apr, note = "Accepted", url = "http://arxiv.org/abs/1608.04694" }

abstractPDFbibtexScientific software is often driven by multiple parameters that affect both accuracy and performance. Since finding the optimal configuration of these parameters is a highly complex task, it extremely common that the software is used suboptimally. In a typical scenario, accuracy requirements are imposed, and attained through suboptimal performance. In this paper, we present a methodology for the automatic selection of parameters for simulation codes, and a corresponding prototype tool. To be amenable to our methodology, the target code must expose the parameters affecting accuracy and performance, and there must be formulas available for error bounds and computational complexity of the underlying methods. As a case study, we consider the particle-particle particle-mesh method (PPPM) from the LAMMPS suite for molecular dynamics, and use our tool to identify configurations of the input parameters that achieve a given accuracy in the shortest execution time. When compared with the configurations suggested by expert users, the parameters selected by our tool yield reductions in the time-to-solution ranging between 10% and 60%. In other words, for the typical scenario where a fixed number of core-hours are granted and simulations of a fixed number of timesteps are to be run, usage of our tool may allow up to twice as many simulations. While we develop our ideas using LAMMPS as computational framework and use the PPPM method for dispersion as case study, the methodology is general and valid for a range of software tools and methods.**The ELAPS Framework: Experimental Linear Algebra Performance Studies**International Journal of High Performance Computing, pp. 1-13, March 2018.@article{Peise2018:560, author = "Elmar Peise and Paolo Bientinesi", title = "The ELAPS Framework: Experimental Linear Algebra Performance Studies", journal = "International Journal of High Performance Computing", year = 2018, pages = "1-13", month = mar, publisher = "Sage", url = "http://arxiv.org/pdf/1504.08035v1" }

abstractwebPDFbibtexIn scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.**Design of a high-performance GEMM-like Tensor-Tensor Multiplication**ACM Transactions on Mathematical Software (TOMS), Volume 44(3), pp. 28:1-28:29, January 2018.@article{Springer2018:554, author = "Paul Springer and Paolo Bientinesi", title = "Design of a high-performance GEMM-like Tensor-Tensor Multiplication", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2018, volume = 44, number = 3, pages = "28:1--28:29", month = jan, url = "https://arxiv.org/pdf/1607.00145.pdf" }

abstractwebPDFbibtexWe present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.**Spin Summations: A High-Performance Perspective**ACM Transactions on Mathematical Software (TOMS), 2018.

Accepted.@article{Springer2018:60, author = "Paul Springer and Devin Matthews and Paolo Bientinesi", title = "Spin Summations: A High-Performance Perspective", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2018, note = "Accepted", url = "https://arxiv.org/pdf/1705.06661.pdf" }

abstractwebPDFbibtexBesides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms---including CCSD(T) and CCSDT(Q)---are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to high-performance calculations, including temporal locality as well as scattered memory accesses. This publication explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4x and 5.5x over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2x over an out-of-place variant.

### Peer Reviewed Conference Publications

**Extended Pipeline for Content-Based Feature Engineering in Music Genre Recognition**Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, IEEE Press, April 2018.@inproceedings{Raissi2018:320, author = "{Tina } Raissi and Alessandro Tibo and Paolo Bientinesi", title = "Extended Pipeline for Content-Based Feature Engineering in Music Genre Recognition", year = 2018, month = apr, publisher = "IEEE Press", url = "https://arxiv.org/pdf/1805.05324.pdf" }

abstractwebPDFbibtexWe present a feature engineering pipeline for the construction of musical signal characteristics, to be used for the design of a supervised model for musical genre identification. The key idea is to extend the traditional two-step process of extraction and classification with additive stand-alone phases which are no longer organized in a waterfall scheme. The whole system is realized by traversing backtrack arrows and cycles between various stages. In order to give a compact and effective repre- sentation of the features, the standard early temporal integra- tion is combined with other selection and extraction phases: on the one hand, the selection of the most meaningful char- acteristics based on information gain, on the other hand, the inclusion of the nonlinear correlation between this subset of features, determined by an autoencoder. The results of the experiments conducted on GTZAN dataset reveal a noticeable contribution of this methodology towards the model’s perfor- mance in classification task.**The Generalized Matrix Chain Algorithm**Proceedings of 2018 IEEE/ACM International Symposium on Code Generation and Optimization, pp. 11, 24 February 2018.@inproceedings{Barthels2018:130, author = "Henrik Barthels and Marcin Copik and Paolo Bientinesi", title = "The Generalized Matrix Chain Algorithm", booktitle = "Proceedings of 2018 IEEE/ACM International Symposium on Code Generation and Optimization", year = 2018, pages = 11, address = "Vienna, Austria", month = feb, url = "http://arxiv.org/abs/1804.04021" }

abstractwebPDFbibtexIn this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product $M := A_1 A_2 \cdots A_n$ that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 5. The motivation for this work comes from the fact that---despite great advances in the development of compilers---the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.**Program Generation for Small-Scale Linear Algebra Applications**Proceedings of the International Symposium on Code Generation and Optimization, February 2018.@inproceedings{Spampinato2018:858, author = "{Daniele } Spampinato and Diego Fabregat-Traver and Paolo Bientinesi and Markus Pueschel", title = "Program Generation for Small-Scale Linear Algebra Applications", year = 2018, address = "Vienna, Austria", month = feb, url = "https://arxiv.org/pdf/1805.04775.pdf" }

abstractwebPDFbibtexWe present SLinGen, a program generation system for linear algebra. The input to SLinGen is an application expressed mathematically in a linear-algebra-inspired language (LA) that we define. LA provides basic scalar/vector/matrix additions/multiplications, higher level operations including linear systems solvers, Cholesky and LU factorizations, as well as loops. The output of SLinGen is performance-optimized single-source C code, optionally vectorized with intrinsics. The target of SLinGen are small-scale computations on fixed-size operands, for which a straightforward implementation using optimized libraries (e.g., BLAS or LAPACK) is known to yield suboptimal performance (besides increasing code size and introducing dependencies), but which are crucial in control, signal processing, computer vision, and other domains. Internally, SLinGen uses synthesis and DSL-based techniques to optimize at a high level of abstraction. We benchmark our program generator on three prototypical applications: the Kalman filter, Gaussian process regression, and an L1-analysis convex solver, as well as basic routines including Cholesky factorization and solvers for the continuous-time Lyapunov and Sylvester equations. The results show significant speed-ups compared to straightforward C with icc/clang or a polyhedral optimizer, as well as library-based and template-based implementations.