XGBoost's cuda acceleration

It seems XGBoost supports GPU acceleration via cuda (9?) with the gpu_hist parameter to xgb_params

In xgboost_code_model.py I added: ‘tree_method’: ‘gpu_hist’ and ran a few tests (16 core, 1080gtx)

WITH 'gpu_hist'

First run:

[Task 1/42] Current/Best: 178.70/2169.96 GFLOPS | Progress: (256/256) | 901.07 s Done.

Second run:

[Task 1/42] Current/Best: 1669.95/1804.79 GFLOPS | Progress: (256/256) | 904.57 s Done.

WITHOUT 'gpu_hist'

First run: [Task 1/42] Current/Best: 48.44/1714.60 GFLOPS | Progress: (256/256) | 980.04 s Done.

Second Run:

[Task 1/42] Current/Best: 113.77/1672.49 GFLOPS | Progress: (256/256) | 1038.44 s Done.

Even though I only run each test twice, you do see the ‘gpu_hist’ does complete a bit faster. I did see the cuda usage on my GPU use about 2 - 4% when running the xgboost cost model. Is this something that should be exposed in the public API or was there a reason why it was excluded?

Can someone else verify benefit?

I think u misunderstood the intention, xgb only for finding the better parmas for ops, so the 1st run can’t show your needed info, actually, graph tuner will really run with the parmas on your RPC to get the computation ability. Am I right ?@tqchen

I’m saying that enabling GPU acceleration is faster computing the jobs vs multicpu. Problem I’m finding is it hard crashes when it goes to the next task.

I don’t think that would help a lot tho. GPU acceleration for XGBoost is to accelerate the cost model training, but the auto-tuning bottleneck is compilation and measurement instead of determining the next batch of candidates. Such acceleration could be moderated easily by server status. For example, I have an experience of tuning an op for 2,000 trials on V100. The tuning used to be done in 2 hours (~3.6 sec / trial on average), but the same task can take almost 3 hours (~5 sec / trial on average) on the same server sometimes.