Hi there 👋

Welcome to my blog. I write about Machine Learning, NLP, and other topics I find interesting.

Why Measuring Time is Not Enough: a Practical Roofline Model for ML Training

Originally published on HackerNoon Say we are training an LLM or any other DL model. Sometimes we just have this feeling that training is too slow. We didn’t pay for our GPUs to sit idle. We want to use them as efficiently as possible. How can we speed things up? What is the best performance we can aim for? Properly benchmarking code is a challenge of its own. In a previous article, we’ve discussed how to do this correctly and what pitfalls are hidden in benchmarking: CPU overhead, L2 cache, etc. Go check that post out if you haven’t already! But let’s imagine we have a properly set-up benchmark which shows that the matmul kernel takes 1ms. Is this bad or good? Today, we’re going to learn how to answer these kinds of questions and understand how far we are from hitting the hardware limits. ...

What Really Determines the Speed of Your PyTorch Code? CUDA Benchmarking Guide

Originally published on HackerNoon Anyone who works with PyTorch model code starts asking the same questions: Why is this taking so long? How do I make my training loop faster? Whether you’re an ML engineer, a researcher or just decided to play around with a random ML repository over the weekend, you will eventually try to understand how to speed your code up. However, before we can do that, we need to learn how to measure performance correctly. And then draw the right conclusions from these measurements. This article is about exactly that, about properly benchmarking CUDA or PyTorch code. ...