Hi there 馃憢

Welcome to my blog. I write about Machine Learning, NLP, and other topics I find interesting.

Why Measuring Time is Not Enough: a Practical Roofline Model for ML Training

Originally published on HackerNoon Say we are training an LLM or any other DL model. Sometimes we just have this feeling that training is too slow. We didn鈥檛 pay for our GPUs to sit idle. We want to use them as efficiently as possible. How can we speed things up? What is the best performance we can aim for? Properly benchmarking code is a challenge of its own. In a previous article, we鈥檝e discussed how to do this correctly and what pitfalls are hidden in benchmarking: CPU overhead, L2 cache, etc. Go check that post out if you haven鈥檛 already! But let鈥檚 imagine we have a properly set-up benchmark which shows that the matmul kernel takes 1ms. Is this bad or good? Today, we鈥檙e going to learn how to answer these kinds of questions and understand how far we are from hitting the hardware limits. ...

February 9, 2026 路 14 min 路 2933 words 路 Vlad Savinov

What Really Determines the Speed of Your PyTorch Code? CUDA Benchmarking Guide

Originally published on HackerNoon Anyone who works with PyTorch model code starts asking the same questions: Why is this taking so long? How do I make my training loop faster? Whether you鈥檙e an ML engineer, a researcher or just decided to play around with a random ML repository over the weekend, you will eventually try to understand how to speed your code up. However, before we can do that, we need to learn how to measure performance correctly. And then draw the right conclusions from these measurements. This article is about exactly that, about properly benchmarking CUDA or PyTorch code. ...

January 27, 2026 路 13 min 路 2627 words 路 Vlad Savinov