- Distributed parallel non-equilibrium Green’s function approach to inelastic charge transportGAMM 2017.
7 March 2017.
- A Compiler for Linear Algebra OperationsACM Student Research Competition at SPLASH 2016.
3 November 2016.
- IPCC @ RWTH Aachen University: Optimization of multibody and long-range solvers in LAMMPSIPCC Showcase November 2016.
- The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance PortabilitySC 2016.
- The Matrix Chain Algorithm to Compile Linear Algebra ExpressionsDSLDI 2016.
31 October 2016.
- Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methodsJHPCS'16.
4 October 2016.
- Accelerating Particle-Particle Particle-Mesh Methods for Molecular DynamicsIPCC Toulouse.
- Cl1ck + LGen: FLAME for small scale linear algebra BLIS Retreat 2016.
University of Texas at Austin, 19 September 2016.
- Cl1ck: A code generator for linear algebra kernelsProgramming Languages Lunch Colloquium.
University of Texas at Austin, 12 September 2016.abstractwebPDFhideWe present Cl1ck, a code generator for specialized linear algebra kernels. Cl1ck adopts the FLAME methodology for the derivation of formally correct loop-based algorithms, and takes a three-stage approach: First, the input operation is transformed into one or more Partitioned Matrix Expressions (PMEs), i.e., a recursive definition of the operation; then, the PMEs are decomposed to identify a family of loop invariants; finally, loop-based algorithms are built around these loop invariants using formal methods techniques. Different back-ends enable then the translation of the algorithms into Matlab and optimized C code.
- Design of a High-Performance GEMM-like Tensor-Tensor MultiplicationBLIS Retreat 2016.
September 2016.abstractwebPDFhideWe present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.