Upstreaming tensorize implementation



I’d like to contribute my implementation of conv2d that relies on tensorize, The tensorize calls an external library api for GEMM. Could I ask what is the best way to add this implementation to TVM please? Right now I have a python script that uses tensorize and I load the external library(.so) from the script and call_extern within the tensorize implementation. The script benchmarks the individual convolutional layers in ResNet

My question is where in the TVM repository do I upstream the script? Also what is the best way to add the dependency on the external library? Should I add to src/contrib?

I can give more details and the implementation source if requried.



What hardware platform are you working for? For tensorize, you could add one compute / schedule template like ‘winograd_nnpack_fp32’ .

For tensorize, the external library is a must? Or could you just include the content we really need? In my opinion, tensorize microkernel should be just one file enough, we could have 4x8 / 8x8 ukernel and so on. If the external library is must, you should add src/contrib like NNPack.


Thanks, I’m targeting a tensorize implementation for x86, I’ll submit a git pull request soon. The external library is required for us. I’m guessing I should edit the cmake module files to add the library by myself?


yes. could refer NNPack / cudnn / sort and so on.


Previously I was using tvm.decl_buffer and access_ptr much in the same way as the following tutorial:

But I am unclear how to call call_extern when I’m putting the function implementation within src/contrib… An alternative solution might be to use call_packed… But is it possible to pass pass access_ptr to call_packed? I want to pass in a float ptr…

Thanks in advance,


Wanted to add another problem I’m facing. When I call tensorize with call_extern within a AutoTVM template function it is not able to resolve the external call. But the problem does not appear when the function is not decorated as a AutoTVM template function. I load the dynamic library via

def load_lib():
“”“Load library, the functions will be registered into TVM”""
curr_path = os.path.dirname(os.path.abspath(os.path.expanduser(file)))
# load in as global so the global extern symbol is visible to other dll.
lib = ctypes.CDLL(
os.path.join(curr_path, “…/”), ctypes.RTLD_GLOBAL)
return lib

_LIB = load_lib()

in the script that I invoke. contains my external function called with call_extern but is not resolved. Says symbol lookup error. The script uses tensorize , and one of the intrinsic function is implmented by call_extern to a function within the library.


AutoTVM will likely start another process of TVM so the library you load may not be loaded in the other process. That is why it is fine if you force it globally in the tvm library. You could make use of the RPC and force it on during the start of rpc server.

In your particular case, it would still be great if you can directly put things in contrib. If it relies on external deps, we can still use an example, and you can put that in apps for now for additional examples. If you have the source code of the micro kernel, it would be great to directly inline that in the schedule. See


Thanks, this clarifies things. I’ll add to apps directory and isolate the source code for the micro kernel in the future and check into contrib. I think adding to apps will be the first stage.


Is there an example of how to use the RPC and force it during the start of the server? Looks like I should starting the RPC server listening on a particular port on localhost and then run the RPCrunner in AutoTVM? The problem is I’m running in a slurm batch scheduled environment and I don’t know if this will work. Is there any other way to export the library when using AutoTVM?


if you directly inline the code into the tensorize, it should work as the exported library already contains the dependency.


I understand I could do that, but the thing is the library containing the function that is being referenced by call_extern by tensorize is under active development and I’d like to just call it as an external function by loading the library so that I get the latest implementation according to the dynamic library version.

What I’m looking for is a solution like calling an external blas function from within tensorize by passing buffer(float *) pointers to the function. I see there are blas examples in contrib but those are not called by tensorize…


if you just want to use the library by yourself, the simplest approach is still to hack tvm to always load that dll by default.


Ok, for now using LD_PRELOAD with the external library seems to workaround the issue with AutoTVM…
Is there a way to pass a float *(buffer ptr) as an argument to call_packed?



For efficiency consideration. I would always recommend to try use call extern first if we want to inline a piece of code. PackedFunc call is fine it is called once or twice in a function, but might cause perf hit if you call it a lot in a deep inner loop. For the same reason, I would even recommend inline the asm kernel instead of use .so file, if the kernel is very small