# Publications - Roman Iakymchuk

### Journal Articles

**Modeling Performance through Memory-Stalls**ACM SIGMETRICS Performance Evaluation Review, Volume 40(2), pp. 86-91, October 2012.@article{Iakymchuk2012:120, author = "Roman Iakymchuk and Paolo Bientinesi", title = "Modeling Performance through Memory-Stalls", journal = "ACM SIGMETRICS Performance Evaluation Review", year = 2012, volume = 40, number = 2, pages = "86--91", month = oct, publisher = "ACM", address = "New York, NY, USA", url = "http://hpac.rwth-aachen.de/~pauldj/pubs/pmbs.pdf" }

abstractwebPDFbibtexWe aim at modeling the performance of linear algebra algorithms without executing either the algorithms or any parts of them. The performance of an algorithm can be expressed in terms of the time spent on CPU execution and memory-stalls. The main concern of this paper is to build analytical models to accurately predict memory-stalls. The scenario in which data resides in the L2 cache is considered; with this assumption, only L1 cache misses occur. We construct an analytical formula for modeling the L1 cache misses of fundamental linear algebra operations such as those included in the Basic Linear Algebra Subprograms (BLAS) library. The number of cache misses occurring in higher-level algorithms--like a matrix factorization--is then predicted by combining the models for the appropriate BLAS subroutines. As case studies, we consider the LU factorization and GER--a BLAS operation and a building block for the LU factorization. We validate the models on both Intel and AMD processors, attaining remarkably accurate performance predictions.**On an One-Step Modification of Gauss-Newton Method under Generalized Lipschitz Conditions for Solving the Nonlinear Least Squares Problem**PAMM. Vol. 9. Special issue: 80th Annual Meeting of the International Association of Applied Mathematics and Mechanics (GAMM), Gdansk, Volume 9(1), pp. 565-566, 2009.bibtex@article{Shakhno2009:638, author = "Stepan Shakhno and Roman Iakymchuk", title = "On an One-Step Modification of Gauss-Newton Method under Generalized Lipschitz Conditions for Solving the Nonlinear Least Squares Problem", journal = "PAMM. Vol. 9. Special issue: 80th Annual Meeting of the International Association of Applied Mathematics and Mechanics (GAMM), Gdansk", year = 2009, volume = 9, number = 1, pages = "565-566" }

**On a Secant Type Method for Nonlinear Least Squares Problems**Journal of Numerical and Applied Mathematics, Volume 97, pp. 112-121, 2009.bibtex@article{Shakhno2009:558, author = "Stepan Shakhno and Oleksandra Gnatyshyn and Roman Iakymchuk", title = "On a Secant Type Method for Nonlinear Least Squares Problems", journal = "Journal of Numerical and Applied Mathematics", year = 2009, volume = 97, pages = "112--121" }

### Book Chapter

**HPC on Competitive Cloud Resources**In Handbook of Cloud Computing, pp. 493-516, Springer, 2010.PDFbibtex@inbook{Bientinesi2010:54, author = "Paolo Bientinesi and Roman Iakymchuk and Jeff Napper", title = "HPC on Competitive Cloud Resources", pages = "493--516", publisher = "Springer", year = 2010, editor = "Borko Furht and Armando Escalante", booktitle = "In Handbook of Cloud Computing", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen University", url = "http://hpac.rwth-aachen.de/~iakymchuk/pub/AICES-2010-06-01.pdf" }

### Peer Reviewed Conference Publications

**Execution-Less Performance Modeling**Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS11) held as part of the Supercomputing Conference (SC11), November 2011.webbibtex@inproceedings{Iakymchuk2011:258, author = "Roman Iakymchuk and Paolo Bientinesi", title = "Execution-Less Performance Modeling", booktitle = "Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS11) held as part of the Supercomputing Conference (SC11)", year = 2011, address = "Seattle, USA", month = nov, institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen University" }

**Improving High-Performance Computations on Clouds Through Resource Underutilization**Proceedings of ACM 26th Symposium on Applied Computing, pp. 119-126, ACM, 2011.@inproceedings{Iakymchuk2011:404, author = "Roman Iakymchuk and Jeff Napper and Paolo Bientinesi", title = "Improving High-Performance Computations on Clouds Through Resource Underutilization", booktitle = "Proceedings of ACM 26th Symposium on Applied Computing", year = 2011, pages = "119--126 ", address = "Taichung, Taiwan", publisher = "ACM" }

abstractwebbibtexWe investigate the effects of shared resources for high-performance computing in a commercial cloud environment where multiple virtual machines share a single hardware node. Although good performance is occasionally obtained, contention degrades the expected performance and introduces significant variance. Using the DGEMM kernel and the HPL benchmark, we show that the underutilization of resources considerably improves expected performance by reducing contention for the CPU and cache space. For instance, for some cluster configurations, the solution is reached almost an order of magnitude earlier on average when the available resources are underutilized. The performance benefits for single node computations are even more impressive: Underutilization improves the expected execution time by two orders of magnitude. Finally, in contrast to unshared clusters, extending underutilized clusters by adding more nodes often improves the execution time due to an increased parallelism even with a slow interconnect. In the best case, by underutilizing the nodes performance was improved enough to entirely offset the cost of an extra node in the cluster.**Performance Prediction through Time Measurements**The Proceedings of the First International Conference on High-Performance Computing (HPC-UA 2011), 2011.@inproceedings{Iakymchuk2011:330, author = "Roman Iakymchuk", title = "Performance Prediction through Time Measurements", booktitle = "The Proceedings of the First International Conference on High-Performance Computing (HPC-UA 2011)", year = 2011, institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen University", url = "http://hpac.rwth-aachen.de/~iakymchuk/pub/hpc_ua.pdf" }

abstractPDFbibtexIn this article we address the problem of predicting performance of linear algebra algorithms for small matrices. This approach is based on reducing the performance prediction to modeling the execution time of algorithms. The execution time of higher level algorithms like the LU factorization is predicted through modeling the computational time of the kernel linear algebra operations such as the BLAS subroutines. As the time measurements confirmed, the execution time of the BLAS subroutines has a piecewise-polynomial behavior. Therefore, the subroutines time is modeled by conducting only few samples and then applying polynomial interpolation. The validation of the approach is established by comparing the predicted execution time of the unblocked LU factorization, which is built on top of two BLAS subroutines, with the separately measured one. The applicability of the approach is illustrated through performance experiments on Intel and AMD processors.

### Other

**HPC Sharing in the Cloud**August 2010.@misc{Napper2010:958, author = "Jeff Napper and Roman Iakymchuk and Paolo Bientinesi", title = "HPC Sharing in the Cloud", howpublished = "HPCWire.com", month = aug, year = 2010 }

abstractwebbibtexMany scientific applications can require significant coordination between large numbers of nodes. Massive effort is spent in the HPC arena to hide much of the latency from coordination with expensive low-latency networks and fine-tuned communication libraries. Such efforts have not yet been translated to the commercial cloud computing arena, which still typically provide systems with varying amounts of installed memory but rarely various quality interconnects.