Hey, here I have a general question. It says that operator fusion combines multiple operators into a single kernel without saving the intermediate results in memory. From my understanding, it achieve efficient communication by avoiding the frequent access to the global memory. Right? I just want to know what’s the difference with using the shared memory. Lets say, we can also avoid the inter-kernel communication via the shared memory?
Anyone could help me?