Question on operator fusion

Hey, here I have a general question. It says that operator fusion combines multiple operators into a single kernel without saving the intermediate results in memory. From my understanding, it achieve efficient communication by avoiding the frequent access to the global memory. Right? I just want to know what’s the difference with using the shared memory. Lets say, we can also avoid the inter-kernel communication via the shared memory?

Anyone could help me?

I think shared memory only has the same lifetime as the block, it no longer exists when the kernel is finished.