Why Measuring Time is Not Enough: a Practical Roofline Model for ML Training

Originally published on HackerNoon Say we are training an LLM or any other DL model. Sometimes we just have this feeling that training is too slow. We didn’t pay for our GPUs to sit idle. We want to use them as efficiently as possible. How can we speed things up? What is the best performance we can aim for? Properly benchmarking code is a challenge of its own. In a previous article, we’ve discussed how to do this correctly and what pitfalls are hidden in benchmarking: CPU overhead, L2 cache, etc. Go check that post out if you haven’t already! But let’s imagine we have a properly set-up benchmark which shows that the matmul kernel takes 1ms. Is this bad or good? Today, we’re going to learn how to answer these kinds of questions and understand how far we are from hitting the hardware limits. ...

February 9, 2026 · 14 min · 2933 words · Vlad Savinov