[RFC] Canonicalizing AutoTVM Log Format

anwang · June 19, 2020, 11:45pm

Currently AutoTVM logs are serialized very casually here and the exact format is often dependent on how individual classes scattered across the codebase choose their string representations.

This may lead to a scenario where the serialization could implicitly change when an unrelated change is made to a field involved in AutoTVM log encoding.

We propose a solution to canonicalize and add defined structure to the AutoTVM Log Format, by constructing typed python classes for producing logs as a programmatically defined schema. Encoding method definitions are omitted.

from abc import ABC, abstractmethod

class AutoTVMLog:
  input: Input
  config: Config
  result: Results
  version: str
  tvm_version: str

class Input:
  target: str
  task_name: str
  args: List[Argument]
  kwargs: Dict[str, Any]

class Argument(ABC):
  # no fields

class Tensor(Argument):
  name: str
  shape: List[int]
  dtype: str

class Tuple(Argument):
  values: List[Any]

class String(Argument):
  value: str

class Config:
  code_hash: str
  entity: List[Entity]
  index: int

class Entity:
  knob_name: str
  knob_type: str,
  entity_repr: Union[SplitEntity, ReorderEntity, AnnotateEntity, OtherOptionEntity]

class Results:
  costs: List[float]
  error_no: int
  all_cost: float
  timestamp: float

Note that the addition of this code is not necessarily intended to change the output log schema in any way, it is more intended to clarify the schema so there is a single source for future log modifications. However, below I have listed some possible cosmetic changes that it may be nice to consider resolving as a drive-by.

Clarifications and fixes:

tvm_version is a float in the codebase, we will correct this to str. We will need to bump the schema version to “0.3”.
“Config” is currently represented in list form e.g. [["tile_oh", "ot", 1], .... We can modify this format to something more readable but longer, like [{knob_name: "tile_oh", knob_type: "ot", "entity_repr": 1}, ...]. Thoughts about keeping the original representation vs. changing to a longer more readable format?
autotvm.task.kwargs is unused in AutoTVM as indicated here. Keep or remove AutotvmRecord.Input.kwargs from the record schema?
In my own experiments with AutoTVM I have consistently observed code_hash's value is null in output logs. This is also true for all schedules on tophub. Remove or file issue to fix?
AutoTVMRecord or AutoTVMLog? Other naming concerns?

anwang · June 19, 2020, 8:33pm

@jroesch @merrymercy @tqchen

Here are my own takes:

I prefer the more readable but longer format.
I prefer to remove kwargs as it’s unused for cleanliness.
I’m unsure how useful the idea of code_hash is for developers currently.
I think “Log” is more canonical in the tvm community but will defer to other takes.

mdw-octoml · June 19, 2020, 8:50pm

Since this is intended as an interchange format, I would strongly advocate for using an industry standard, rather than relying on serialized Python objects. Protobuf would be my first choice. This enables ingestion and processing of AutoTVM logs by tools not written in Python, and provides for versioning, the ability to add, change, and deprecate fields, strong typing, and so forth – all things that may not be on the radar screen right now, but no doubt will be as the content of the logs evolves over time.

tqchen · June 19, 2020, 9:08pm

Given that the format will likely evolve in ansor. We might need to leave certain fields opaque, and keep things in the top level for now.

In particular, the current top level fields include:

input (describes the computation, or workload)
config (decribes the set of schedule configs to apply to the workload)
results (performance result)
version

The specific definition of input, config can change as we evolve from AutoTVM to Ansor. The results remains relatively stable so that is what we can discuss and nail down. It might be interesting to also think about what are the non-opaque part from the input. Perhaps we can first agree on the top level schema and as the config/input becomes more cleared in Ansor, we refine.

comaniac · June 19, 2020, 9:19pm

Please also refer to this topic that moves AutoTVM log version to 0.2. Some issues have been discussed there: AutoTVM log format

anwang · June 19, 2020, 11:41pm

@mdw-octoml I don’t think there’s currently enough interest to justify adding Protobuf as a dependency in TVM. TVM users are used to readable json for their autotvm logs. If there is more interest from the broader community, we can revisit this.

@tqchen I feel we may not need to re-discuss the schema heavily since it was already discussed very recently in the RFC @comaniac shared and, as you say, ansor will likely introduce modifications. Maybe instead of schema specifics this RFC should be more about adding code structure to log production so that there is a single source for future log modifications (I have no problems with preserving the current log format exactly as it is). I will clarify this in the original post.

haichen · June 20, 2020, 12:46am

I agree with @tqchen. Probably we should wait and see how Ansor log looks like and include it into the design. We could have @merrymercy comment on this.

In the high level, I suggest we have five fields: target, workload, config, results, version. The only change is taking the target out of the original input field, while having the workload describes the computation.

I agree that protobuf is good for this purpose. But I’d prefer that we still output the log into a text format so that it’ll be easy to quickly check the details.

anwang · June 22, 2020, 8:11pm

My thoughts are that the suggested change to add Python structure shouldn’t necessarily depend on what the log format will look like, so I don’t think there is a need to wait for the Ansor log format. (I imagine Ansor coders have their hands full, and that they’d prefer to consider polish later in their process).

The main value add of this proposal is to enable clearer conversations about schema changes in the future.

For example, @haichen is this an accurate summary of your suggested changes?

class AutoTVMLog:
  target: str                     # added
  workload: Workload              # modified from "input: Input"
  config: Config
  result: Results
  version: str
  tvm_version: str

class Workload:                   # added
  task_name: str
  args: List[Argument]
  kwargs: Dict[str, Any]

haichen · June 22, 2020, 9:31pm

Probably we can canonicalize the target (e.g., a protobuf buffer) instead of a string as well. We can refer the target format to [RFC] TVM Target Specification. @tqchen

anwang · June 23, 2020, 9:25pm

I’ve thought about this some more, and I’m changing my stance with respect to ProtoBuf. While adding a Python class schema is a less invasive change than introducing ProtoBuf and allows us to stick to the current log format exactly, protos do have the added benefit of being language-neutral. Also, it will also be likely moving forward that sticking to “industry standard” practices (as @mdw-octoml indicated) will enable even more clarity around schema changes, and enforce to some extent more backwards compatibility than we’ve seen so far.

To that end, here is a resummarization of the proposed schema in .proto. Comments are left for modifications. Note this will certainly require an update from 0.2 -> 0.3 schema format and implementation details may change slightly. I would also send a PR to tophub accordingly if people agree to this change.

syntax = "proto3";
package autotvm.log;
import "google/protobuf/any.proto";

message Target {
  // For now this is the string representation of a target; e.g. "llvm -mcpu=broadwell"
  // This should be replaced once the rfc "TVM Target specification" is finalized
  string target_string = 1;
}

message AutoTVMLog {
  Target target = 1;
  Workload workload = 2;
  Config config = 3;
  Result result = 4; 
  string version = 5;
  string tvm_version = 6;
}

message Workload {
  string task_name = 1;
  repeated Argument args = 2;
  // kwargs is no longer included as it is unused
}

message Argument {
  oneof arg {
    Tensor tensor = 1;
    // Possible tuple values are not well specified and may require more sorting out
    // https://github.com/apache/incubator-tvm/blob/master/python/tvm/autotvm/task/task.py#L43-L63
    Tuple tuple = 2;
    string value = 3;
  }
}

message Tensor {
  string name = 1;
  repeated uint32 shape = 2;
  string dtype = 3;
}

message Tuple {
  repeated google.protobuf.Any values = 1;
}

message Config {
  string code_hash = 1;
  repeated Entity entities = 2;
  uint32 index = 3;
}

message Entity {
  // Entities are previously output as `[["tile_ow", "sp", [-1, 1]], <other_entities>]`
  // The proposed encoding clarifies entity type in the schema itself instead of as a string
  string knob_name = 1;
  oneof entity {
    SplitEntity split = 2;
    ReorderEntity reorder = 3;
    AnnotateEntity annotate = 4;
    OtherOptionEntity other_option = 5;
  }
}

message SplitEntity {
  repeated int32 size = 1;
}

message ReorderEntity {
  repeated uint32 order = 1;
}

message AnnotateEntity {
  repeated string annotations = 1;
}

message OtherOptionEntity {
  google.protobuf.Any value = 1;
}

message Result {
  repeated float costs = 1;
  int32 error_no = 2;
  float all_cost = 3;
  float timestamp = 4;
}

As an example, the json will look like

{
  "target": {
    "target_string": "llvm -mcpu=broadwell"
  },  
  "workload": {
    "task_name": "conv2d_x86_64",
    "args": [{"tensor": {"name": "tensor_name","shape": [1,2,3],"dtype": "float32"}}]
  },  
  "config": {
    "code_hash": "codehashtest",
    "entities": [{"knob_name": "tile_ic","split": {"size": [4,32]}}],
    "index": 1
  },  
  "version": "0.3",
  "tvm_version": "todo get tvm version"
}

To avoid breaking workflows that assume readable log output by default, I suggest we simply add “protobuf” as an encode/decode/file logging option in https://github.com/apache/incubator-tvm/blob/master/python/tvm/autotvm/record.py. The default serialization format will still be “json”, but all serialization schemes will be backed with the proto-generated schema. @haichen @jroesch @tqchen what do you think?

tqchen · June 23, 2020, 10:02pm

The proposal looks good. notably, the config will need to evolve as we migrate to ansor, so perhaps we could try to keep it opaque, or find a way to upgrade later.

anwang · June 25, 2020, 12:31am

I think the main benefit of keeping the ProtoBuf opaque is avoiding the unnecessary effort of fleshing out a schema that will change very soon. However, since I have a full specification described here already, I prefer to go ahead with it, unless there other concerns I have missed.

I suggest that the process for upgrading this schema should be opening an RFC like this one (ideally linking a PR with the desired .proto changes).

I would also like to point out some caveats with ProtoBuf usage.

It’s highly encouraged that proto fields are never removed, but instead marked with a “deprecated” flag unless you are aware you will break backwards compatibility.

For the ansor changes, if we are deprecating autotvm 1.0 entirely, I think it would be ok to remove fields as needed. If that’s the case, the case for a fully specified schema as the resolution for this RFC makes more sense, as it would be good for people to have an explicit schema to refer to for pre-ansor logs.

tqchen · June 25, 2020, 1:09am

cc @merrymercy @zhiics @haichen @FrozenGene @comaniac @ajtulloch @antinucleon @junrushao

mdw-octoml · June 25, 2020, 4:16am

The proto representation looks good to me. I have a couple of suggestions based on prior experience designing proto-based data formats.

I recommend the use of enums rather than strings for values that are constrained to a small, fixed-size set. For example, the dtype field in the Tensor message should probably (I think!) be an enum.
I don’t know the use case for the google.protobuf.Any fields in the spec, but in general I would recommend making these specific message types or ‘oneof’ fields whenever possible.
There may be places that you wish to tighten up the semantics of the existing log format, rather than simply encoding the existing format as a proto. For example, I would recommend being explicit about the meaning of the ‘version’ field (e.g., should this be a SemVer-type version string?). Likewise, use of a float value for timestamps can lead to imprecision, unless timestamp means something different here than it does in most other systems – uint64 storing microseconds since the epoch, or a string holding an ISO-8601 formatted timestamp would be better.
For the case of the Config message, if you believe it will soon change or differ based on new functionality coming along, consider using a oneof field with a single submessage for the existing Config.

tqchen · June 25, 2020, 3:12pm

Some comments on the dtype, the dtype field in Tensor is actually quite flexible(goes beyond the enumeration since arbitary vector length, bitwidth and customized data type is also allowed). So perhaps string, or making a structured variant makes sense. So we can continue use string for simplicity and consistency with the python side of the repr, alternatively one could design a further composite encoding, but that will involves parsing printing of the typestr, which could be overkill here.

mdw-octoml · June 25, 2020, 3:59pm

I see. In my experience, it is worth making this a structured type, even if it seems painful at first. In the long run, having to maintain custom parsing logic for just one of your fields (where the others are all structured) ends up being a maintenance burden. I’m a strong advocate for using structured types as they were intended to be used.

tqchen · June 25, 2020, 4:01pm

In this case the parsing is already necessary and builtin, because the numpy convention uses the string for dtype. So we are trying to build compatibility for interpolating with something that already exists. The types on the c++ side is structured.

mdw-octoml · June 25, 2020, 4:16pm

Gotcha. In that case I think it’s important to document that the format of the field is the type string used by numpy.

merrymercy · June 26, 2020, 3:46am

Difference between the logs for Ansor and AutoTVM

There are two major differences between ansor’s log and autotvm’s log

The workload for Ansor is a subgraph defined by multiple tvm.compute, while the workload for autotvm is a single operator. To index log quickly, Ansor stores a hash value of the subgraph as the workload key.
Ansor saves the whole serialized schedule as config (in json format), while autotvm only stores the parameters.

However, Ansor’s new log format can still fit into the @tqchen 's design of top-level fields.

Other thoughts

The current log file is an append-able text file, where one line corresponds to one log item. I can edit it with a text editor. If we use a binary format, I want this property to be preserved.
If we make the log longer and more readable, there will be a lot of redundancy in the file. For example, for a single tuning job, the same long target string will appear in all lines. Do we have methods to compress it?

comaniac · June 26, 2020, 5:33pm

General Comments

IMHO, @merrymercy’s comments on log files are valuable. Many users now look into the log file for the information they need, and even manually modify some logs for experiments or optimizations. This can be achieved because 1) the log files are in text format, and 2) one config (line) in a log file is in a reasonable length. As a result, at high level I agree with @anwang’s proposal that keeps the log file in JSON format but uses proto-generated schema to (de)serialize it. IIUC, this approach still allows users to modify the log file manually if needed.

On the other hand, one point I have for the current proposal is for workload. In terms of the semantic, the workload mentioned in the proposal is more like a task, as it has task_name and args. A workload should be a list of input tensors which is independent to tasks. Here is a complete example of conv2d task:

"task": {
  "task_name": "conv2d_NCHWc.x86",
  "args": [{"tensor": {"name": "data","shape": [1,3,224,224],"dtype": "float32"}},
           {"tensor": {"name": "weight","shape": [32,3,3,3],"dtype": "float32"}},
           [1, 1], [1, 1, 1, 1], [1, 1], "NCHW", "NCHW", "float32"]
},

In addition, one problem is that args is just a list of task arguments, so it’s hard for people to understand the actual meaning. I’d be great if we could also improve the task initialization process to take keyword arguments instead of position arguments. As a result, we could have:

"task": {
  "task_name": "conv2d_NCHWc.x86",
  "args": {"data": {"tensor": {"name": "data","shape": [1,3,224,224],"dtype": "float32"}},
           "weight": {"tensor": {"name": "weight","shape": [32,3,3,3],"dtype": "float32"}},
           "strides": [1, 1],
           "pooling": [1, 1, 1, 1],
           "dilation": [1, 1],
           "data_layout": "NCHW",
           "output_layout": "NCHW",
           "dtype": "float32"}
},

Ansor’s Log Format

As @merrymercy mentioned, since Ansor is targeting to a subgraph instead of a single operator, the task_name would be an issue. The current approach using hashed subgraph is definitely not user friendly, and we cannot re-establish the subgraph by interpreting its hash value. A better solution would be providing a utility to serialize compute DAG as a string, and another utility to deserialize the string back to the compute DAG.