Unifying Object Protocol in the Stack


#1

Unifying Object Protocol in the Stack

We start to introduce many useful reference counted objects in the stack, including:

  • runtime::Object used by virtual machine, needed in runtime.
  • Node used to store ASTs.

These objects have quite a lot in common, as a matter of fact, the implementation
of Object and Node system is roughly the same as Node.
(or tag if we use the terminology tagged-union objects). Besides these two objects runtime::NDArray is also quite similar, and could potentially be unified.

This is an RFC to discuss for us to unify some of these(at least Node and Object) implementations into a single one. Doing so would give us a more unified runtime system that can help us to clearly define an extensive runtime for deep learning models.

It also allows us to have reusable containers such as Array and Map that can contain both runtime objects and Node.

Features and High-level Considerations

Features that benefits all objects

  • F1: reference counted object.
  • F2: self-managing (have a customized deleter) that can be called upon release.
  • F3: being future compatible to new allocation strategy(e.g. allocate object from a pool)
  • F4: runtime type casting: we want to be able to check the object type during runtime and cast them.
// Example of runtime type casting.
void Process(Expr ast) {
  if (const AddNode* ptr = ast.as<AddNode>()) {
    // Logic for add
  } else if (const SubNode* ptr = ast.as<SubNode>()) {
    // Logic for sub
  }
}

Features that needed by Node(AST), they are implemented by additional
virtual functions in the Node base class.

  • F5: Support dynamic type index allocation.
    For AST nodes, we want to add new ones as we go, we support that in Node
    by dynamically allocate the type index during runtime keyed by an unique type_key string.
  • F6: Support for recursive serialization and front-end access.
    We support serialization of arbitrary AST node via a visitor pattern that every node implements.

Features that needed by runtime Object. Note that one key requirement for runtime is minimum code size and dependency.

  • F7: static type index if possible. For a few core data structures, we want to statically
    decide their type_index as constant, to make sure we have the most efficient
    and minimum code.

Finally, we want to make objects close to POD-C types if possible. That will open path for more native language integrations in other languages(e.g. rust) that goes beyond a simple frontend API based integration. We can also start to think about introducing a DSL that allows us to generate the
implementation and code in different languages. Unifying the object protocol will open path towards
that goal, but automatic code generation for objects is beyond the scope of this RFC.

Object Protocol Proposal

This section summarizes the base object protocol proposal.
It contains three fields:

  • A type_index that can be used for runtime type checking(see more details about type checking in next section).
  • A 32 bit reference counter.
  • A deleter function
class Object {
 public:
  /*!
   * \brief Object deleter
   * \param self pointer to the Object.
   */
  typedef void (*FDeleter)(Object* self);
  /*!
   * Check if the object is an instance of TargetType.
   * \tparam TargetType The target type to be checked.
   * \return Whether the target type is true.
   */
  template<typename TargetType>
  inline bool IsInstance() const;

 protected:
  // The fields of the base object cell.
  /*! \brief Type index(tag) that indicate the type of the object. */
  uint32_t type_index_{0};
  /*! \brief The internal reference counter */
  RefCounterType ref_counter_{0};
  /*!
   * \brief deleter of this object to enable customized allocation.
   * If the deleter is nullptr, no deletion will be performed.
   * The creator of the object must always set the deleter field properly.
   */
  FDeleter deleter_ = nullptr;
};

The object header is in total 16 bytes. It is an intrusive pointer object with an additional deleter and type_index.

The existence of deleter_ gives more flexibility about where the object comes from. For example, if the object has a POD-C ABI signature, in theory we could allocate it from one DLL(or another language e.g. rust) pass to another dll, while still allows us to call the correct deleter.

Future Compatibility for Special Allocators

Although we do not do it now, we want to think about the future possibility of
introducing customized allocators. The following code gives an example of an object pool.

class ObjectPool {
 public:
   // make an new object.
   ObjectPtr<A> make();
   // release object to the pool.
   void Release(A* obj);
};

void Example(ObjectPool* pool) {
  // allocate from the pool
  ObjectPtr<A> ptr1 = pool->make();

  // Triggers ptr1.reset() when ptr1 goes out of scope.
  // instead of calling delete, we send the object goes back to obj_pool
  // Equivalent to calling: pool->Release(ptr1.get());
}

If there is a global singleton reference to the pool, we can simply set the deleter_ function to call CurrentPool()->Release. If the object pool is created during runtime, and we don’t have a global singleton, we need one additional field in addition to the object, and make the object pool allocate the chunk instead.

template<typename A>
class ObjectChunk {
 public:
   A data;
   ObjectPool* pool_;
}

template<typename A>
void Deleter(Object* obj) {
  auto* ptr = static_cast<ObjectChunk<A> >(obj);
  ptr->pool_->Release(ptr);
}

Note that we only need this if we want the object to go back to a specific pool. That is why the pool_ field is not introduced in the Object protocol, but can be added by a subsequent implementation of object pool if necessary.

Type Hierachy and Runtime Type Checking

One of the design choices that can be discussed is how to implement the type hierachy and type checking. There are several ways to do so and each of them has pros and cons.
Specifically, how can we implement the IsInstance API.

// Example of usage of IsInstance

class A : public Object {
};
class BaseB : public Object {
};
class C: public BaseB {
};

void Example() {
  // type-erased object
  ObjectPtr<Object> aptr = make_object<A>();
  ObjectPtr<Object> cptr = make_object<C>();
  // easy, check if aptr->type_index_ == A::type_index()
  CHECK(aptr->IsInstance<A>());
  CHECK(cptr->IsInstance<C>());
  // harder cannot simply do equivalence checking.
  CHECK(cptr->IsInstance<BaseB>());
}

See the above code example, things are easy if we just care about
whether an object is a terminal(final) type.
We can just check the type index.
It becomes harder if we want to support checking of an intermediate
base class (BaseB) in the above case.

There are three choices:

  • A1: Introduce _type_child_slots attributes for BaseB, make sure all its children’s type are in a fixed range.
    We can assign A::type_index = 1, BaseB::type_index=2, BaseB::type_child_slots=1, C::type_index=3
    Make sure the type indices in [2, 4) only contain sub-class of BaseB
    • Fast to type check, but has to declare the maximum number of children before-hand and pre-allocate the type index slots.
  • A2: Create a global type hierachy table, that has information about each type’s parent type
    • Works for all cases, no need to pre-declare number of child classes.
    • Might need locking the global type table if it can be dynamically updated during runtime.
  • A3: Create a chain of vtable during type construction.
    • Used by current Node
    • Need to create vtable, or introduce virtual function to the Node.

A1 gives the fatest code, but requires additional information(pre-reserve the child slots). Which could be fine for a project with a fixed type hierachy, but might be painful to allow new plugins that contains new types. A2 is the most flexible one, but could be slow(due to access to global type table and lock). A3 needs virtual function and vtable support(means additional fields in the Object),
which may or may not be what we want(if we want to move toward pure C ABI).

My current take is to support a mix of A1 and A2: allow each type declare _type_child_slots, but also allows
additional children that overflows the slots. During type checking, we first check if the type_index is
within the range(fast path), if not, we go to the global type table(slower path for rare cases).
It would be great to get more thoughts on this matter.

Unifying NDArray as an Object

Another debatable point is whether we want to unify NDArray as an Object. The advantage is clear, we could reuse the containers(Array) and remove runtime VM wrapper. This however, will mean we need to change the signature of NArray, it also brings an interesting question about how to handle sub-classes of NDArray and type checks them. We could defer the decision to a later point.

Path for Upgrade

I tried a rough prototype to get the basic mechanism setup, and can send an RFC PR for review
once we stablize the details. If we agree this is the path to go, here are the steps for an upgrade refactor.

  • Introduce the new object protocol, use it for the old runtime::Object.
  • Redirect Node to be sub-class of the new Object, redirect the functions.
  • Finish the tests and remove redundant API if any.

#2

Thank you for the proposal!

As the RFC implied, all the objects are reference counted. However, it occurs to me that there might be circular reference when RefRead/RefWrite exists. Therefore, I am wondering if it is in our scope of discussion?


#3

The user of the object will need to make sure there is no circular reference which can usually be done by creating a relation object that reference both sides.


#4

Manually guarantee no circular reference may be non-trivial. We should limit the use of mutation to a very limited scope. Otherwise, anything could happen.

For example,

A = RefCreate(...)
B = Tuple(A, ...) # B may ref to A
RefWrite(A, B)    # A refs to B

#5

In the above case, the ast (RefWrite) does not have circular deps. In the case of runtime, instead of retaining the Ref container, we could retain the container that is being referenced(if you want to do lazy eval) to break the cycle.


#6

If I understand correctly, most of the RFC discusses about type hierarchy and runtime type checking, without RTTI or virtual function.

Might be a silly question, but I was wondering what “support dynamic type index allocation” means. Is that “creating a new class which subtypes an existing one in runtime”?


#7

Thanks for the question! We assign a type_index to each class. One way to do that is to directly list all the type index before hand statically(as in type_code in TVM).

However, because we may not anticipate how many classes there are, and could have dependent modules creating their own object types. In the node system, we allow each sub-class to provide a unique str key(e.g relay.Function), and will use it to allocate a unique type index for the class during runtime.


#8

In the node system, we allow each sub-class to provide a unique str key(e.g relay.Function ), and will use it to allocate a unique type index for the class during runtime.

This is exactly what TVM Node system does, but this requires vtable. A common trick to avoid vtable is to explicitly alloc function points like how we use FDeleter. Is that a feasible solution?

However, because we may not anticipate how many classes there are, and could have dependent modules creating their own object types.

I see. So how many dependent modules there might be? If there are not many, we can re-calculate the euler-tour order every time we load the library.


#9

Yes, we have listed three possible options(A1, A2, A3). A3 is like a vtable based solution. The current proposal is a combination of A1, and A2.

Unfortunately it is a bit hard to re-allocate the type index, because we cache them in static variable for efficiency reasons and we cannot go back and correct the already allocated objects


#10

Unfortunately it is a bit hard to re-allocate the type index, because we cache them in static variable for efficiency reasons and we cannot go back and correct the already allocated objects

Why not we maintain an array like euler_order_range[type_index_]?


#11

This is what was proposed in A1(force type index of all the sub-classes to be in a range). The con is that we need to pre-declare the maximum amount of sub-classes. So we use A2 as a fallback to handle overflow (if it happens)


#12

This is what was proposed in A1(force type index of all the sub-classes to be in a range)

Hmmm…While euler tour of tree exactly means “force some index to be contiguous within a tree”, I am not saying that is type index. What I am trying to say is that we can map those immutable type_index to a mutable index using an array called euler_order_range.

In other words, we allow type_index to be arbitrary, but enforce $euler_order_range(type_index(child)) \subseteq euler_order_range(type_index(parent))$, where euler_order_range can be implemented as a global array. Every time after we load a new module, we can re-calculate this array.

Is that this solution much like A2?


#13

I am not sure if I understand correctly. So the basic reason that we need A1 + A2 is that A2 alone seems slow for 2 reasons.

  1. access global structure
  2. might need a global lock

For 2), I am not sure if we really need a lock, or we can ask developers to ensure modules are loaded sequentially. For 1), I am not sure this would cause slowdown…


#14

Interesting, I like your idea about having the new indirection. Right now I am just implementing a parent tree traversal for checks given that A1 is already fast for most cases.

The main reason for lock is to make things thread-safe and simple(because we are using vector). It could be possible that lock was not needed(but atomic is still needed to guarantee the effect being observed by other thread) if we ensure the update won’t happen in execution phase.


#15

Given the fact (?) that loading modules is rare, I think we could either enforce loading to be sequential and loading and reading shouldn’t happen concurrently; or we can have a reader-writer lock


#16

Actually dmlc registry is not quite thread safe either


#17

I can implement the idea that I proposed. Could be done in a single method within 100 lines without recursion.


#18

OK, I will try to send in a draft PR, once we land the first version, perhaps we can then improve it possibly with your suggestion, how does that sound? Implementation wise there is no breaking changes to interface. Now that I think about it, if we can remove locking, the cost of parent traversal might be OK because most class does not have a deep hierachy


#19

Sure. My code is available here. Should be correct as I tested against random generated tree of 10k nodes.


#20

I prefer A2 as its design is more clean and systematic. If we find performance issues later, we can then add back A1 for fast pass.