Improve testing times using pytest-xdist or pytest-parallel

ramana-arm · November 15, 2019, 11:15pm

Is there any reason why we don’t look to use pytest-xdist or pytest-parallel ? Currently it takes for ever to run all the tests before submitting , it does look like using pytest-xdist and the pytest -n option might help with improving time for test. We don’t need to use all cores - I suspect we will have issues with TVM_NUM_THREADS et al in some parts of our testing but still it feels quite wasteful for some of the relay tests to be running purely single core.

I don’t have any measurements yet, but playing with pytest -n 16 on my box shows quite a few speed ups in tests/scripts/task_python_integration.sh while running task_python_integration.sh.

Ramana

janimesh · November 15, 2019, 11:38pm

+1 Yes, we need this. Currently, a PR takes 6-7 hours to run including waiting times. This is a development bottleneck.

tqchen · November 16, 2019, 12:24am

The main bottleneck so far was sometimes due to the over-saturation of ci nodes, in which case running tests in parallel won’t help as much. We might need to look into add more machines. But I agree that for cases with local tests having parallel option would help.

In my past experiences with large projects, the CI time queuing inevitably goes to the hours range. The solution is usually do most of the test locally before send it upstream and stage changes in a way that merging can be batched. As long as we do not use CI as a test server(and do tests actively in local), the key bottleneck is still on getting PRs to be reviewed properly.

janimesh · November 16, 2019, 12:28am

That makes sense. Is it possible to have a sanity test script that generates a message (hash or something similar), that we can paste when we push a PR. Not pasting the message will not block the PR. But, indirectly, it will force us to get into a good habit of running tests locally first.

tqchen · November 16, 2019, 12:30am

Here is my current flow

Try to write unit-tests for all code that i wrote
Make sure unit tests and lint pass, run local tests
Hmm, perhaps time to try the CI server
Opps, test_x failed
Reproduce on the local server via docker/bash.sh, fixed the tests
Update the PR again

junrushao · November 16, 2019, 5:52am

This is inevitable that people would just commit a lot when work is still in progress. To reduce the cost, I think a better solution to would be

Configure the Jenkins to only automatically run the very basic checks (e.g. lints) automatically on each commit;
Require committers who approves a PR to manually trigger the heavy CI.

broune · November 18, 2019, 8:05pm

It would be very helpful in this discussion to know the number of cores and the number of executors per machine. Is there a way that we can get this information?

http://ci.tvm.ai:8080/job/tvm/ shows the executors and machines, though I’m not sure if some of the machines listed are actually the same machine (e.g. node.tvmci0.cuda0, node.tvmci0.cuda0 and node.tvmci0.cudabuild, is that 3 machines or 1, and 2 GPUs or 1 split-up one?), and there are no details listed for the machines (e.g. how many cores?).

I hear that aws.g4.n0.cuda0 has 2 cores, and only 1 executor, so there is potential there for speed-up. Do we know what happens if two tests both attempt to access the single GPU on a system at the same time? Do they take turns to use the GPU, do they run at the same time (with unpredictable results like OOM), does one error out saying the GPU is busy, or does something else happen? (this is relevant since I expect most GPU tests to only use the GPU for a smaller part of their runtime)

tqchen · November 18, 2019, 9:22pm

For now we assign one worker with one executor for each GPU in order to make sure GPU workloads have the sole access to the card to avoid interference as Cuda is really bad at multi processing support.

CPU workers can have more than one executors

broune · November 19, 2019, 7:27pm

Thanks. So node.tvmci0.cuda0, node.tvmci0.cuda0 and node.tvmci0.cudabuild are all the same machine, but exposing the first GPU as an executor, the second GPU as an executor and 4 CPU executors?

How many cores and hardware threads does this machine have?

tqchen · November 19, 2019, 7:37pm

The machine itself has 16 hw threads. We leave it to be 4 so that each worker can still make use of some multi-threading, without interfering with GPU workers. One could argue for an increase the amount of CPU workers.

The main bottleneck of CI pipelines is the availability GPU workers, so adding CPU workers won’t help.

Having a more fine grained GPU tests separated out might help in some extent, by skipping CPU related tests in GPU build, but perhaps it is not worth the effort to do so(assuming we get 2x at most).

broune · November 19, 2019, 8:45pm

So 16 hw threads that are 8 cores with hyperthreading to 16 hardware threads?

When running a test that uses the GPU, the CPU has to do a lot: compile for the GPU, generate test inputs randomly (this can be slow), compute a reference result (which a good GPU should be faster at), compare the reference and real outputs along with anything else the test does on the CPU. The GPU has to transfer the data (though ideally that is pipelined away) and run the program. It is quite likely that the CPU exceeds the GPU load by quite a bit, especially if the tests have not been optimized on CPU, which appears to be the case. A measurement would have to be made, but I would not be surprised if the GPU tests are actually CPU bound at 1:4 or 1:8 ratio of GPUs to CPU cores. That suggests a potential for a >4x speedup on the same hardware of the GPU tests versus 1:1 GPU:CPU ratio that is implied by no parallelization.

This would require using something like pytest-parallel or pytest-shard (we probably want both) and having those be able to coordinate access to the GPU(s), e.g. via a mutex or something more advanced, and the mutex would only be held while the GPU is in use. That’s ideal but is probably a larger change. Nvidia has support for virtualizing GPUs, so that might be an option that’s less ideal but easier. I wonder if there isn’t a better third option.

tqchen · November 19, 2019, 9:11pm

At the current moment, I tend to opt for no-op if the change will mean drastic change to the test infra.

We are already doing some of the CPU side optimizations, for example, all the test data are memoized which saves around 2x of the cost.

Having processes sharing GPUs inevitably causes problems like slow-down of the job(possibly due to problems in the Nvidia driver) and eat all GPU memories so that other jobs went OOM(in the case of TF runtime). Which leads me to believe that we should not go into that route. As a racing GPU program can causes unexpected failures which can be more annoying than a bit longer queuing time

broune · November 19, 2019, 9:43pm

I certainly agree that a test strategy based on race conditions is a poor idea. GPUs should be shared in a safe way, not an unsafe one. There are some examples of how that an be done above, e.g. a mutex makes the GPU shared across time, but not shared in the sense of having two tests run at the same time on the same GPU. Nvidia virtualization of GPUs makes the sharing well defined.

My concern is that the future ought to hold an expansion of testing for TVM, while at the same time tests already take quite a long time to run. The bigger question is how that will be resolved.

tqchen · November 20, 2019, 6:34am

The use of mutex was a bit too intrusive, as we are also running tutorials which cannot do explicit locks. NV GPU virtualization might be interesting if there are some experiments on it.

tqchen · November 20, 2019, 6:42am

A followup note. After thinking more about this topic.

When we eventually needs to face the GPU multiplexing problem(which I don’t think we have currently reached that point) we do already have a solution — RPC tracker

The current TVM RPC tracker already does resource multiplexing. By having each worker run job remotely through the tracker, the GPU will be time sliced and shared across multiple workers.

Although it would require quite some change to the test infra that I think we should defer to a later point when it really becomes necessary

broune · November 20, 2019, 8:12am

I believe that there are many reasons beyond parallelism to have tests that run graphs be expressed as providing the graph (input can be random or provided, output can be derived from a reference backend or provided), instead of as now having each test hook all of that up each time. At that point there’s a common place to hook in any desired general change to how tests run, be that running a test against a remote service like Neo, running against a model inference server, doing test augmentation, e.g. changing layouts or compiler settings, using a mutex for the device, or a wealth of other options, where the magic is that all such features can be implemented independently of the tests themselves. Tests will be a lot shorter and simpler that way, too. I believe that we need the same for benchmarks, which at that point become more similar to regular tests. Though that’s getting somewhat afield from the topic here, just relevant as it’s the place where parallelism becomes easy to plug in even with devices in the picture.

In any case, I’m hearing a decision that pytest-parallel or something like pytest-shard will not be accepted into TVM at this time, but maybe sometime later when test load is greater than today. Is that accurate?

tqchen · November 20, 2019, 4:12pm

OK me try to summarize the discussion:

A1: Currently the main bottleneck of the CI server is GPU utilization
A2: Because of GPU racing condition, having pytest-parallel will leads to potential errors, a solution(e.g. mutex on GPU is needed, which will introduce additional engineering complexity.
A3: The concern has things to do with both the memory usage(TF eats all memory) and device lockup.

This means that the current proposal as it is – directly enable pytest-parallel won’t help alleviate the situation atm. In contrary, we cannot do it as it is on GPU tests.

However, I am not against optimizing test infrastructure in general in a more organized way. In fact, we have already tried to use many of the proposals in this thread. For example, we have already use memoization to speedup the example code generation https://github.com/apache/incubator-tvm/blob/master/topi/tests/python/test_topi_bitserial_dense.py#L43

While the current proposal should not be accepted as it is, we do have hopes to see future proposals. For example, explicitly mark the device related section of the test code(which might allow automatic injection of the mutex as you mentioned), or provide a user agnostic session, that gives you (a possibly remote) GPU, while requiring minimum change that users might write. Note that such solution has to be an all-opt-in (we needs to convert all the test-cases into the style), which I think would worth the effort, but needs a more carful discussions of the design choices.

Let us open another RFC for these followup discussions:)

tqchen · November 20, 2019, 4:31pm

After looking into this thread again. I find that the communication could have been handled in a better way

In each of the thread, we are discussing motivations(the need to improve tests) solutions(pytest-parallel, mutex, virtualization, remote), problems(GPU racing, tutorials cannot be changed, Armdahl’s law).

In most cases we are “arguing” on things we already agree on. For example, we all agree that improvements on the test infra is a good thing. We also mostly agree on possible useful solutions(eg mutex).

The only disagreements is about how(the solution): i was pointing out directly add mutex code to GPU section maybe a bad taste in terms of engineering. However, I do think having a clear device annotation(which hides the GPU mutex code in the annotator) would help.

Finally, there is the accessment, only with respect to the narrowed combination: the pytest parallel solution is not good enough for solving our problems and can introduce additional racing atm.

Hope this clarifies the discussions. It also reminds us that in the future it would be better to dissect the discussions

broune · November 20, 2019, 6:46pm

Agreed that clear communication is valuable.

We also have C++ tests in the critical path on CI and locally, it’s just a bit surprising since it is embedded into the Jenkins stage that is called “build”, but that stage actually spends most of its time testing on CPU, e.g. here taking half an hour on the testing:

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-4277/1/pipeline/36

The rest of the test pipeline cannot proceed before this stage is done, so for this stage CPU tests are the bottleneck. After linting, this is a stage likely to yield errors on changes that have problems, so there is a greater benefit for fast results compared to later stages in CI. Also, as I think is pointed out above, locally run tests are often CPU tests. So I think that speaks in favor of using pytest-parallel on CPU tests and then we can consider expanding into other devices later.

Though I wonder what’s up with the machine load, since in some other runs this takes less time (but is still the bottleneck), e.g. here:

http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-4335/5/pipeline/36

This could be due to an overloaded machine, in which case parallelism is not helpful on that machine at that time, though could also be just due to landing on different machines with different performance (is there some way to tell which machine ran a job in Jenkins? Nothing obvious jumped out at me on that.)

What do you think?

tqchen · November 20, 2019, 7:54pm

There is definitely interference between CPU workers(resulting sometimes a longer build time). The reason to run light unit test at build stage is exactly try to catch some obvious errors in the build (e.g. symbols not available).

I am not familiar enough with Jenkins to tell which machine ran the job. Perhaps we could do that by changing Jenkinsfile config.

We can change CPU test cases to pytest-parallel, as long as we discuss the implications of interference and thread safety especially in the case of fully swarmed pipeline. In particular:

What will happen if there are multiple workers all run pytest-parallel, should we restrict the maximum parallelism to make sure the worst case.
Some of the test-cases will use multi-thread, in the case of GEMM(although we have changed the BIND_THREADS setting so that could be fine).

Looking into http://ci.tvm.ai:8080/job/tvm/job/master/ most of the timecost of build stage is around 14 to 20 min when the machine is not overloaded, but can grow to 30min when the machines are overloaded. It would be nice to have ways to conduct some controlled experiments to see the implications.