[RFC] TVM Target Specification

tqchen · May 29, 2020, 10:52pm

In collaboration with @areusch

A target object in TVM provides all the information needed for code lowering and code generation. Currently, a target is specified by a string in the format of <target-id> [attributes], where the <target-id> is defined by the name of the final code generator(llvm, cuda, opencl). Many new challenges arise as we introduce advanced features such as cross-compilation, heterogeneous targets and customized code generator to the compiler stack:

C0: Represent the host compiler information: when we cross-compile to WebGPU, we also need to specify the wasm as the target for the host code.
C1: Represent composite targets in heterogeneous compilation settings.
C2: Customizing compilation passes and code generator based on the target.
C3: Simple canonical name(e.g. nvidia/tx2, broadcom/rpi4b) to refer to the target of interest.
C4: Pass in additional target-specific attributes like ISA/library extension to a target.

This RFC proposes a strawman to address these problems. See also our previous discussions on this topic Target and Attributes

Scope of a Target

While a Target implies things about the runtime environment of generated code, the Target configuration is only intended to configure the TVM compiler. Options that are specific to the runtime environment should configure the DLContext rather than the Target. As a simple way to decide whether a configuration option belongs in the Target config, ask the question

Would this option change the contents of the generated Module?

Some examples of things that belong in the target string:

ISA and ISA extensions (i.e. FPU presence) for the target platform
Availability of C libraries

Some examples of things that purely change the runtime execution environment:

Linker options, for CPU-targeted code:
- Link-time optimization
- Code location (i.e. FLASH vs RAM)
Options that influence code loading
- i.e. for micro targets: configuration for the flash tool

Strawman Proposal

We propose to use json encode a target. The following example shows a target that represents the GPU on an nvidia TX2 board.

{
    "id": "cuda",
    "tag": "nvidia/tx2-cudnn",
    "keys": ["cuda", "gpu"],
    "libs": ["cudnn"],
    "target_host": {
        "id": "llvm",
        "system_lib": True,
        "mtriple": "aarch64-linux-gnu",
        "mattr": "+neon"
    }
}

The top-level fields of this target include:

id(string): specifies the kind of the target
tag, optional(string): is a special attribute that specifies the alias of the target(think of tag in docker images).
- A tag can be used to reconstruct the entire target uniquely.
- tag is also used as keys in the autotvm logs, to make sure tag is immutable, we can also hash the content of the target and record that in the log.
keys(Array<string>): List of keys that can be used to derive specific autotvm strategies.
- keys provides a more coarse grained information about the target than the tags
- An alternative is to simply unify the keys with the tag, by allowing special tags like cuda that does not corresponds to a concrete target
attrs(Map<String, ObjectRef>): Other optional attributes
- target_host(Target): The host execution target, if different from the current host.
- libs(Array<string>): List of additional extensions that can affect lowering.

In the c++ side, we can store the special attributes as typed fields and additional attributes in a Map.

class TargetNode {
 public:
  // id
  TargetId id;
  // special attributes
  String tag;
  Array<String> keys;
  // other attributes
  Map<String, ObjectRef> attrs;
};

TargetId Registry

To support C2, we introduce a registry for per target-id global attributes. The following code block shows how to register the target specific passes and attribute options for the cuda target.

TVM_REGISTER_TARGET_ID("cuda")
.set_attr<TPreCodegenPass>("TPreCodegenPass", [](Target target) -> Array<Pass> {
    //...
})
// add target host as an additional option.
.add_attr_option<Target>("target_host");

Target Tag

The tag is a short alias for the target. We will maintain a collection of tags and the corresponding target configurations in the project. Users can directly specify their target via a tag. We can create multiple aliases can for each tag. However, there should always be a canonical tag that all aliases maps to. The canonical tag is used in the tag field of the target.

Here is an example list of tags. A typical naming choices for tags include the <vendor>/<soc-name>[-device] for an SOC and <cloud-provider>/<instance-type>[-device] for a cloud instance.

nvidia/gtx2080ti
apple/iphone8-cpu
aws/c4.xlarge
rockchip/rk3399-gpu

It is also useful to create a hierachy among tags. For example, rockchip/rk3399-gpu is a special case of mali GPU, and its performance information can be useful for other SoCs with the same type of GPU. This information can be captured by keys(feel free to suggest alternative names) .

Depending on the need, we could also optionally attach a version suffix in the end, apple/iphone8-cpu:v1.0, this convention might be useful for upgrading the target spec of a given tag.

Schema Validation

The new target specification is quite flexible, as we can introduce new attributes to the target. However, additional flexibility also increases the chance of misconfigurations and typos. We propose to validate the keys of the attributes, as well as the type of the corresponding values when constructing from a json object using information registered to the target registry. The following example shows how to register attribute options for the LLVM(llvm cpu) target.

TVM_REGISTER_TARGET_ID("llvm")
.add_attr_option<Bool>("system_lib");
.add_attr_option<String>("mtriple");
.add_attr_option<String>("mattr");

Composite Target

Under the current proposal, we can introduce a special target id to represent a composite target. The composite target lowering invokes partition passes to search and partition the function into several functions. Each of the partitioned functions corresponds to a primitive target in the composition. The compiler will call into the custom lowering pass of each specific target, then link everything together.

{
    "id": "composite",
    "targets": [ { "id": "llvm" }, { "id": "cuda" } ]
}

Notably, we can also introduce multiple (named) composite targets if we need to customize the lowering strategy.

Bring your own codegen

Under the new proposal, BYOC can be supported naturally via the target-specific lowering infrastructure. A customized backend can register its own target id and a codegen generator attribute.

TVM_REGISTER_TARGET_ID("dnnl")
.set_attr<TRelayCodegen>(DNNLRelayCodegen);

Discussions

The hierarchical nature of composite targets brings many new challenges. For example, we can no longer use a plain string option format for general targets. It would be great to bring more discussions, possibly around the following, but not restricted to:

Naming alternatives:
- id: (target_key, target_type, name , kind)
- tag: (hardware, device)
- keys: (category, groups )
TargetNode c++ design choices:
- N0: keep typed special attribute fields
- N1: fold most fields into attrs

Target host covention

T0: keep target_host as an optional field in the device target. Rationale: host driving code is a special part of the device driver code. It is easier to pick up the keys in the top-level target when we run auto-tvm, we need to preserve target host info util the very end.

{ "id": "cuda", "target_host": { "id": llvm }}

T1: treat target host configuration as a composite-style target configuration. Rationale: it is not that different from composite.

{ "id": "composite", 
	"target_host": { "id": llvm }, 
	"targets": [ {"id": "opencl"}]
}

junrushao · May 29, 2020, 10:58pm

Thank you for the RFC!

Also thank you for listing the naming alternatives. I think the name “keys” is a bit less informative than others, because anything can be a key. Probably a better naming could be “category” or “categories” or “cats”, etc.

About the version suffix, though looks nice, it is worth thinking about possible scenarios in which they can be useful. Would love to hear about the usecases from others

kevinthesun · May 30, 2020, 3:56am

Thanks for bringing this up. I have a question about the functionality of target tag: does it mainly serve as the name to differentiate autotvm logs for specific platform? For x86 cpu, this name might not be very important since we mainly care about the ISA supported by the cpu.

tqchen · May 30, 2020, 6:22am

it serves as a way for user to quickly specify the target. We could imagine that the aws target tag is an alias of intel/skylake

ANSHUMAN.TRIPATHY · May 30, 2020, 3:10pm

Thanks for the RFC. This really is a way forward for TVM CodeGen as i perceive.

I have few points to share as below:

P0: I believe we can merge these two as < TAG > + < ID > . Also it would be good to use numeric rather than target name in ID.

P1: I could not understand clearly the purpose of Keys. If we have Tag & Targets, why Keys are required? I believe we should have a TAG as master record to deduce everything in order to identify a particular Target. I think it is simpler too.

P2: If we maintain everything in Attrs as plain Map, we might lose the intention to achieve hierarchical structure as in case of Composite Targets. So maybe we want another structure like below:

TargetNodes {

String tag;

Array < TargetNode> targets;

}

Thanks!

kevinthesun · May 30, 2020, 9:23pm

Does it mean we will have a set of predefined tag names so that user can directly refer to specific hardware platform, instead of having to setup architecture details such as “mcpu”? This can make it a lot easier for tvm beginners.

tqchen · May 30, 2020, 9:28pm

Yes, that is the goal

tqchen · June 1, 2020, 3:37pm

We will need the separate target ID because that is the key to target registry. So does not make sense to make the id numeric as the names directly corresponds to the backends in many cases(cuda, llvm). keys are needed for generic strategy reuse as in the current autotvm.

comaniac · June 1, 2020, 7:01pm

Thanks for the RFC. Some comments and questions:

While I agree that the JSON encoding is powerful to cover all possibilities, I am a bit confusing about the usage from a user’s perspective. Are we expecting users to write a separate JSON file like target.json and use tgt = tvm.target.create('target.json')? Or could you elaborate some expected usages?
Similar to the first question, the flexibility of JSON also implies complication, as no one knows what would be the available keys and values. Even in the current target system, we’ve seen may complains from users about how to correctly use targets (e.g., -libs can enable 3rd party library). For example, a user wants to use llvm, how can she know all the available options (i.e., system_lib, mtriple, mattr)?
The value of id looks a bit weird to me, because id is usually unique in a data model. Accordingly, the naming alternative (e.g., id: (target_key, target_type, name , kind)) in the discussion makes more sense.
I like the idea of composite target and believe it would be useful for BYOC. We could ask the developers to first register a target ID and create a specialized tag to specify the desire usage of the customized codegen (e.g., 1) offload to DNNL and fallback to LLVM if not suppport; 2) offload to TensorRT and fallback to CUDA if not support; 3) offload to an accelerator, fallback to DNNL or LLVM if not support.)

tqchen · June 1, 2020, 10:31pm

The json format is an analogy, and we certainly do not have to strictly use a json file or string. For example, in the python API, we could directly from a dictionary(in the style of json).

target = tvm.target.create({
   "id": "cuda", 
   "target_host": {"id" : "llvm"}
});

We can also put first level API wrappers around to create the object.

target = tvm.target.cuda(target_host="xyz")

Alternatively, we should allow user to directly use a tag for target creation. e.g.

target = tvm.target.create("nvidia/gtx1080ti")

The schema registration and validation are exactly designed to tame the complexity. While right now we only focuses on validation, we could certainly imagine generating docs that include these options. Alternatives, we can simply document the corresponding fields in a specific doc about the targets or as arguments of the convenient functions(e.g. tvm.target.cuda), love to hear more thoughts.

I am not particularly attached to the choice of id, so feel free to suggest alternatives.

kparzysz · June 2, 2020, 1:57pm

Scope of a Target: Why do linker options belong to runtime configuration? Linking is a step in the module creation process, and it does affect the contents of the module. Are there any implicit assumption made about modules?

Composite Target: From the point of view of modularity, having a target host as a member of a composite target looks like a better solution. As the next step I suggest that we drop the target host completely. A function can be split into multiple parts meant for different targets. Instead of explicitly designating a certain target as a target host, we can use the same mechanism that assigns the individual parts to their targets to assign the “host” part to its target. This would remove the distinction between a host and a device code for the purposes of code generation.

tqchen · June 2, 2020, 4:30pm

I agree that it is attempting to remove target host completely. However doing so may bring some trouble to our analysis in the early stage.

Specifically, when we talk about say a “GPU program”, there are two kinds of mindsets here.

In the high level(relay, topi), we would make a composite GPU kernel(e.g. softmax) as a “gpu program”. The softmax could actually contain multiple kernel launches and needs host code for dimension calculations, but because the code itself only reads/writes GPU mem, we view such kernel as a GPU program. It is also useful to view it in that way, because the ML Kernel writer and scheduler view them as GPU program, rather than heterogeneous program in high level scheduling.

At the lowest level, the “GPU program” only refers to the device code, but not the host code that drives the program.

So we can find the design choice really boils down to how do we view the device program:

V0: a gpu(device) program is a program that involves a single device target and related host code to drive that target.
V1: a gpu(device) program is a program that only involves device code but not the host driving part.

While from the low-level driver’s PoV it is certainly easier to take the V1 view. The V0 view can be more useful in the following regard:

Provide a useful device key for dispatching high level schedules.
It is the natural way high level developers use to think about program with a single device target.
It offers simplicity for users who want to specify the target(e.g. they don’t have to specify cuda as a composite target).

It also acknowledges the fact that there is a difference between a single target(host/device mix) program and multiple device targets program. We can still use the composite target for the later ones. That does mean though usually such per target split could happen earlier in the graph stage instead of the later stage.

As some additional fruits for thoughts, V0 and V1 also corresponds to two different kinds of mindsets that the CUDA programming model and OpenCL programming model advocated for. As we know nvcc allows GPU kernels to directly blend into “cu files” and to programmers the cu files becomes what we know GPU program. The OpenCL model is closer to the V1. As we know that CUDA model “won” the GPGPU programming over the other one, in my opinion, due to the mindset offered in V0

kparzysz · June 3, 2020, 2:58pm

V0 is not really well defined. Consider some hardware that has both a GPU, and a DSP (as well as some host CPU). If you write a program for this system, is it a GPU program, or a DSP program? What target will TOPI assume for the operators in this program?

When you consider an example for CUDA, it will only be concerned about that one GPU device. Everything that doesn’t run on it will be assumed to be “target host”. This binary view doesn’t extend into situations with multiple devices present in the system.

We need to decide how we’re going to handle multi-targeted functions first. This discussion (how to describe a target) is secondary to this.

tqchen · June 3, 2020, 3:21pm

If a program contains both a GPU and DSP, then the target is composite(which is supported), with both of the device target’s target_host being points to the same host. Given that the target host is optional, we could also not specify the target host in this case assuming the host is clear in the composite.

The discussion is not to aim to remove composite targets, but to clarify how do we deal with the single device target case specifically. As in our current analysis, we might do the multiple targets(DSP GPU) split first, run quite a few transformations, then do the host device split in the very end.

So it is just to acknowledge that the analysis might be different for the single device target case, and multiple device target case

kparzysz · June 3, 2020, 3:30pm

I’m not opposed to composite targets, I’m arguing that the way we handle composite targets should not depend on what targets are members of the composite target. Whether it’s “CPU+GPU” or “CPU+GPU+DSP”, the logic of the analysis should be the same. The decisions it makes can be different, obviously, but the way we deal with both of these targets should be the same.

tqchen · June 3, 2020, 3:57pm

I do not disagree. The sticky pt is how do we categorize the “host driving”(memory allocation, kernel launch parameter computation) part of the target program.

We do not intent to categorize arbitrary CPU + GPU program as “gpu program”.

Under V0, a device target(with target host) program can include:

Host code that calls into device workspace allocation.
Launch of device kernels

It cannot include:

Write/Read to memory related to another device(including the host CPU).

To give some examples:

E0

This is not a “gpu program”, because it contains a cuda part and cpu part.

fn cpugpumix() {
   // cpu part
   a = cpu_malloc(size);
   for i:
      a[i] = ...
   b = cuda_malloc(size)
   memcpy_to_cuda(b, a)
   // cuda part
   launch cuda kernel 1 {
   }
   launch  cuda kernel 2 { 
   }
}

E1

This is a gpu program

fn () {
   // cuda part
   b = alloc("global", size)
   launch cuda kernel 1 {
   }
   launch  cuda kernel 2 { 
   }
}

E2

fn multidevice() {
   // cuda program starts
   a = alloc(size, scope="global")
   launch kernel 1{
      a[threadIdx.x] = ...
   } 
   // cpu region
   memcpy_cuda_to_cpu(mcpu, a)
   memcpy_cuda_to_cpu(b, mcpu)

   // dsp program region
   launch dsp kernel {
   }
}

Discussions

In the case of E1, one can imagine that we can completely move all the code, including workspace allocation into the device itself(this is what can happen in the DSP case). That is why it can be categorized as a single target gpu program. So from the programming model’s PoV, it would be useful to annotate the region using a non composite target. The reality is that the alloc and kernel launch parameter calculation happens on the host. The target host is a useful way to specify how to generate the host support code for this kind of kernels.

kparzysz · June 3, 2020, 5:39pm

If it’s possible that the entire E1 can be compiled for a single device then it makes sense to treat it as device code. In such case, moving alloc to the host could be treated as an “optimization” that is specific to this target. However, if E1 has a non-composite target, how would that optimization know what “target host” to use? Where would it get it from?

tqchen · June 3, 2020, 5:55pm

That is why we might want to have a target host field in the device target as in the above examples. The split host device pass can pick up the target host field and split out the host driving part into a program set to have the target_host.

Due to the restrictions of the target device(e.g. alloc only happens in host), the split host device can be viewed as a Legalization pass from the “virtual target”(that allows alloc and launch params calculation to be part of the program) to the real device target.

kparzysz · June 3, 2020, 6:11pm

In such case we can have

GPU_target = [ id: "cuda" ]                // non-composite, no target_host
System_target = [ id: "cuda", id: "cpu" ]  // composite
optimize(E1, GPU_target)                   // func, target
set<pair<func, target_id>> S = legalize(E1, System_target)
for s in S:
  low_level_optimize(s.first, s.second)    // func, target

In cases where the (larger) system target is not needed, we could use the same target for optimization and legalization. As a result, all pairs in the set S would have the same second element.

tqchen · June 3, 2020, 7:05pm

Thanks for the example. One of our goal is to consolidate the setting into a single target so that the configuration becomes simple. In this case it should be the system target.

I still think it is useful to allow an optionally target_host(we can also change the name if we find a better alternative) for the following reasons:

It can be viewed as a “compilation option” for the “virtual target”(about how to generate the driving code).
It ensures the conciseness of the target specification when the we target a single gpu(which is the most common setting). In most cases target_host is implied to be the system cpu target, but in the case of cross compilation to a different board we will need to specify it.
We can differentiate from the real CPU + GPU mix setting(E0), as the composite target fits the later case, and we can provide specific optimizations for the programs in the form of E1.

Of course, it is important for the code generator to:

validate the target host matches the system cpu target if a composite target is specified
Fill in the target host using the target in the composition if target host is not specified in the multiple device case.