@xqdan One thing that might be nice is to understand the set of hardware intrinsics that TVM should lower a schedule down to (for instance are we using tensorization intrinsics, or different types of DMA load/stores). In addition, it might be good to understand how a programmer can expose more parallelism for the chip to take advantage of. For instance with the VTA reference design we used virtual threads that would be lowered to low-level dataflow-like synchronization operations to uncover task-level parallelism within the chip.
Highlighting these challenges when targeting the DaVinci chip would be nice, and perhaps contrasting it with VTA so that programmers can understand how it relates in terms of challenges.
Overall this is very exciting stuff!