Inline assembly micro-kernel and performance experiment for ARM and x86

Followup on

The RFC for inline micro-kernel asm has been officially supported, and there have been the interesting efforts among the community to explore how much performance gains we can get by using inline asm. Since this is an exploration direction that does not have a dedicated end goal of actions(unlike the support of the feature), I am opening this thread in the forum, anyone is welcomed to discuss and followup on their experiment status on this direction

This is a very useful effort. Is there a tutorial which I can use to ramp up? I think this would need tensorize as well.

I would be open to work on the tutorial (maybe starting with tensorize) if there is nothing out there yet.

there is a coordinated effort at https://github.com/dmlc/tvm/issues/1379

in short, there has been some discussions on tutorials for tensorize, but not yet inline asm, contributions are welcomed!