Do not write tensor data in MicroTVM with AutoTVM

KireinaHoro · March 27, 2020, 11:44am

I’m trying to do some autotuning with MicroTVM RISC-V Spike. It seems like the current implementation will write input arrays to the device over OpenOCD, which is prohibitively slow when the input gets bigger (e.g. 512x512 int8 arrays). To my understanding, we should be able to omit the data copy as they’re not crucial for getting the correct performance reading. Is there a way to do this?

@weberlo would you please take a look at this? I’m trying to get the RISC-V Spike flow complete, after which I would continue with testing on a Rocket Chip system on FPGA. Thanks!

tqchen · March 27, 2020, 3:09pm

we could also explore a possibly alternative approach to hook up spike directly(not via openOCD), which might enable us to get around the problem of data copy speed(via directly memory copy shortcuts into the simulator).

openOCD is only one way to implement https://github.com/apache/incubator-tvm/blob/master/src/runtime/micro/low_level_device.h and the JTag interface of spike could indeed be slow because the memory need to be set via a bit by bit basis

I talked to some spike dev before and who said it is possible, but i haven’t look into how to hack that.

KireinaHoro · March 27, 2020, 3:11pm

Spike accepts remote-bitbang protocol, which is exactly how OpenOCD itself communicates with Spike. The protocol should not be too complex to implement, since we only need Read Write and Execute.

I’m actually working on a second implementation for LowLevelDevice for Rocket Chip (RISC-V) on Zynq targets. Something working should come in in a few weeks. Stay tuned!

KireinaHoro · March 27, 2020, 3:14pm

Besides, I’m wondering if the AutoTVM implementation for uTVM in https://github.com/apache/incubator-tvm/pull/4274 is complete: I can only find some RPC server modification, but not how to bridge the micro build process into LocalBuilder. As I’m on a deadline, I’m simply hacking into the codebase to get things to work, but I wonder if I missed anything.

Maybe @weberlo can comment on this as well?

weberlo · March 27, 2020, 10:28pm

@KireinaHoro Yes, OpenOCD is unfortunately very slow. The way I’ve sped things up in my personal usage has been to run AutoTVM with 8 microcontrollers, but I’d like to have a solution that doesn’t require “brute forcing” it with devices.

Do you have the check_correctness option set? If so, it could be this line that’s forcing a copy.

But if you don’t have check_correctness set, it looks like this else block will trigger, which will also perform a device copy.

As a short-term hack, you could comment out both of those blocks. There might still be other device copies that I’m unaware of.

In the long-term, it might be worth it to have code on the device that generates random tensors, rather than the host generating them and transmitting them over OpenOCD. Or we could keep the same input/output tensors on the device for the entire tuning run—not recommended if the operator being tuned is sensitive to the values of its inputs.

Besides, I’m wondering if the AutoTVM implementation for uTVM in https://github.com/apache/incubator-tvm/pull/4274 is complete

As far as I know, the implementation of AutoTVM for µTVM in mainline TVM is complete. It’s worked in my own experiments, but I’m not an expert in the AutoTVM side of the codebase, so it’s possible I’ve missed some functionality. There are definitely some ergonomic changes that need to be made, some of which I’ve already implemented in my fork.

One big problem is that each RPC server has a fixed µTVM device config, so if you want to change the memory profile of an operator (maybe it takes up more of the text section, but doesn’t require as large of a workspace section), you need to restart the server with a new device config that reflects that memory layout. My current solution to that problem is to watch the file system for changes to dev_config.json and automatically restart the server, but it would be great if we could dynamically change the memory layout.

Anyways, let me know if you have any more questions. Good luck on your deadline! I’m excited to see your upcoming PR.

KireinaHoro · March 29, 2020, 4:27pm

Okay I think I found the right place to hack to disable copying contents when creating an NDArray: https://github.com/KireinaHoro/tvm/commit/6c5aafd4fe41a65eb1474d5c3dbcbb00339795e9.

Will submit a PR when I get the time.

tqchen · March 29, 2020, 5:01pm

you can use tvm.nd.empty to achieve the same goal. Note that however, if the memory is still need to be initialized(perhaps via a remote RPC call), otherwise it can impact the perf

KireinaHoro · March 29, 2020, 5:06pm

That sounds reasonable, but it’s not working, as the default behavior in AutoTVM for ref_input is None is already using nd.empty:

args = [nd.empty(x[0], dtype=x[1], ctx=ctx) for x in build_result.arg_info]

Maybe the copyfrom logic should check the empty case, but I haven’t looked into it yet. Regarding the performance numbers, the cycle count of copy=True and False didn’t show a difference, so I think it’s okay for me as for now. Let’s discuss this in detail when I get the PR ready.

tqchen · March 29, 2020, 5:10pm

It depends on the case, some floating pt unit have longer time for NA and different values. If it is a fixed cycle unit, we might be fine

KireinaHoro · March 29, 2020, 5:12pm

Got the point. Let’s bring this up when I get the PR ready. Thanks for the attention.