I made a kernel 2.2x faster. It made my training loop 3x slower (kyrieblunders.bearblog.dev)

<a href="https://news.ycombinator.com/item?id=48373266">Comments</a>