[DISCUSS] The meaning of "float" in Relay

Hi,

so here is something I bumped into: "float" means different things in different places, not always as expected. The background is that C/C++, PyTorch, and others will interpret float to mean 32 bit floating point numbers aka float32 and arguably float32 is the most common datatype in deep learning. NumPy, on the other hand, interprets it to be float64.

>>> torch.float                                                                                                                                                                                               
torch.float32

>>> numpy.ones(1, dtype='float').dtype                                                                                                                                                                         
dtype('float64')

This carries over to the PyTorch JIT, too - notice the Float:

>>> torch.jit.trace(lambda x: x, torch.ones(1, dtype=torch.float32)).graph                                                                                                                                      
graph(%0 : Float(1:1)):
  return (%0)

Now TVM relay is sympathetic to both views:

>>> tvm.relay.expr.const(1, dtype="float").data.dtype                                                                                                                                                          
'float64'

>>> tvm.relay.expr.const(1.).data.dtype                                                                                                                                                                        
'float32'

>>> tvm.relay.expr.var("x", "float")                                                                                                                                                                           
Var(x, ty=TensorType([], float32))

To the naive user that I am, there seems to be an inconsistency between what dtype=“float” means for const and var. In the ill-fated PR #5756 I proposed to make const - which currently defers to numpy for its decision - use consider float to mean float32.

This has, however met opposition and the request that I highlight it for discussion here.

This hit me while looking at the PyTorch frontend.

Some obvious potential routes of action w.r.t. the behaviour of the dtype argument in relay.expr.var vs relay.expr.const:

  • Keep as is,

  • standardize on float=float32,

  • standardize on float=float64,

  • prohibit passing float as dtype but insist on float32 or float64. This seems safest but would mean that we would want to fix everything using it.

    Variants could be with or without some deprecation warning as an intermediate step.

In terms of cleaning up code using “float”, I have a PR to submit today or tomorrow that attempts to clean up the use of types in the PyTorch frontend (mainly distinguishing more between “this is a TVM dtype” and “this is a PyTorch scalar type name”), but I won’t be not looking at other use(r)s.

I would like to add that I have no immediate interest in it - I thought that it would be useful to allow consistent use of float as dtype but I personally have since resolved that I should just not use float or store for identifying a dtype in TVM.

Best regards

Thomas

1 Like

cc @jroesch @zhiics @comaniac @liangfu @junrushao1994

CC: @antinucleon as well

The problem is whether we want to be 100% numpy compatible.

As far as I could understand, the goal and benefit of being XXX-compatible is to bring convenience to end users who is familiar with XXX.

As in this case, as a deep learning compiler stack, or any other DL framework, fp32 is used as default, because fp64 is way too expensive and not quite useful. This is the consensus to the end users that we are now facing.

Therefore, I would prefer use fp32 as default, although not compliant with numpy, this is a good choice that deep learning practitioners are comfortable with :slight_smile:

Agree with @junrushao1994. I think we should use fp32 as default instead of fp64 as it’s more common in deep learning.

I think fp32 makes sense, the problem is that NumPy defaults are arch. specific and on 32-bit platforms the default sizing for both integers (“int”) and floating point (“float”) are different. A problem that has plagued us repeatedly. The behavior of const iirc is trying to normalize away the platform specific behavior. In general I think its best to be a) consistent b) opinionated.

Any thoughts about disambiguate and force given users a warning when float is used(and ask them to use float32 or float64?

2 Likes

I agree with @tqchen that making float throw a warning is a good option, it’s an ambiguous declaration, even in C, it means different things on different systems and in different contexts. Being more precise is probably better.

+1 for making fp32 as default as fp64 may not be that useful and it could possibly increase memory footprint and reduce performance (i.e. occupying more SIMD lanes).

I also agree that we can make float more explicit.

Yah I agree with @tqchen and @mbrookhart because float is probably arch dependent, which is not desirable in most cases. Also, there might be a lot of uses of Float in the frontend, like importer from PyTorch, so we might have to keep the warning level low.

Agree to make float32 as the default and throw a warning if float is given without specifying the bit number. On the other hand, simply putting the warning at a lower level may not be effective. I believe most users do not change the log level during the development. I would prefer the way that @haichen used before for the warning of old AutoTVM format version.

1 Like

So it seems that “float = float32” but with a warning might be good?

Personally, I had been thinking of a Python warning, so anyone can decide to treat as error / ignore / …, but @comaniac, is this autotvm warning what you had in mind?

Best regards

Thomas

Here is another idea:

  • “float = float32” but with a warning
  • Add an env variable TVM_STRICT_MODE to force the usage of “float” to throw, and enable the flag in the CI, so that we fix all the usage in our current codebase

Is the TVM_STRICT_MODE fails the CI if throw warnings? It looks not sustainable to me because this is not in a normal logging system so people can easily forget it.

My understanding about how to determine the log messages is that if we hope to show them to end-users, then we should use INFO and WARNING. If we only want to show to developers, we should use DEBUG (FATAL is not a case in this discussion as it intentionally crashes the process). Interpreting float as float32 message is a message that we wish end-users to see (correct me if I am wrong), so it should be a WARNING. Here are two proposals to deal with this:

P1. We intentionally want to display that many warnings to users to let them know how many cases are processed when converting a model. This behavior is similar to other compilers like gcc. However, it could be annoying.

P2. A mechanism to show the same warnings only once. That’s why I pointed out the AutoTVM case (the link @t-vi posted). Maybe we could make this mechanism more general.

I actually meant TVM_STRICT_MODE that changes the "float" handling behavior to directly throw, not intercepting the warnings. This way we can cleanup the use of "float" in our own codebase but still allow users to use it

Ah I see. That makes sense. Then how about putting it to config.cmake to be something like SET(STRICT_MODE ON)?

Something along that direction, in the meanwhile, seems we are converging:

  • convert default to fp32 and add warning
  • fix the float occurence to use fp32

I agree with this. Default float64 will kill perf for most of beginner users, which is not friendly.

OK, to summarize, the actionable items include:

  • convert default to fp32
  • fix the float occurence to use fp32

@t-vi thanks for bringing up the topic, perhaps we can reopen your PR about fp32 default change?

So far the PR only changes the default. Is there an example of the strict mode that could be followed? Also my changes for the PyTorch backend intertwined with the fixes to deal with non-fp32 types in general (probably a property of my branch rather than a necessity), and I would not want to reopen that.

Best regards

Thomas