This is a second post on using OpenCL on the Chromebook ARM. The previous one gives instructions to install OpenCL drivers and SDK on the Samsung Chromebook ARM, without requiring to boot a separate Ubuntu, by using crouton. This post compares OpenMP and OpenCL performance of the Chromebook ARM with a 4-year old laptop.
In these tests, I compare my 4-year old Dell laptop with the Samsung ARM Chromebook. It's obviously not a very fair comparison: The laptop is quite obsolete now (and will actually be replaced soon). On the other hand, the Samsung ARM is a budget device, with a ridiculously low power consumption.
|Dell Latitude E6400||Samsung Chromebook ARM|
|CPU||Intel Core2 Duo T9950 @ 2.66Ghz|
|Samsung Exynos 5 Dual (5250)|
(Cortex A15; 1.7GHz dual core cpu)
|RAM||4 GB||2 GB|
|GPU||NVIDIA Corporation G98M|
(Quadro NVS 160M)
256MB dedicated RAM
|ARM Mali T604|
|Ubuntu 12.04 in crouton|
As a benchmark test suite, we use Rodinia, a CUDA/OpenMP/OpenCL test suite from the University of Virginia.
The test suite does not compile unmodified, and for some OpenCL tests, the number of threads need to be reduced to fit in the limited memory of both computers. Complete instructions and patches can be found in my github repository.
Test results - OpenMP
First we show comparisons using OpenMP, that only makes use of the CPU. We expect the Intel laptop to be far superior, and this is what we get:
Test results - OpenCL
We can then compare the GPUs, using OpenCL tests in Rodinia:
For some reason, I could not get the OpenCL code to compile on the Samsung ARM for LavaMD (it fails with CL_INVALID_KERNEL_ARGS), but I didn't try very hard. Let me know if you find a way! It would also be interesting to figure out why ParticleFilter is so slow.
Test results correctness
The benchmark timings need to be taken with a bit a precaution, in case the results are garbled. Some tests do not produce any output, so it's hard to tell if the computation is correct. On the other hand, hotspot produce some output that can be plotted:
As you can see, the results look identical in all cases.
To analyse the differences more precisely, we can measure some average numerical error between data results x and y of 2 different implementations, as follows:
And a summary of these errors, for 2 of the tests, where some output is created:
As you can see, running the OpenMP code on ARM and x86 gives identical results. The OpenCL results are also very close. That basically means the comparisons above between the 2 laptops is fair.
When it comes to differences between OpenMP and OpenCL code, the HotSpot test shows a good agreement between the 2 versions. On the other hand, the CFD test outputs very different results between the 2 implementations. This is worrisome as CFD is one of the tests that shows most improvement using OpenCL compared to OpenMP...
Test results - overall
The next graph shows all the results aggregated.
Assuming OpenCL and OpenMP implementations give similar results (which is actually doubtful in some cases, see the previous section), running OpenCL code on the Samsung Chromebook ARM can help a lot in terms of performance: On the Dell laptop, using OpenCL improves performance by a factor 2.25 on average. On the Chromebook ARM, the ratio is 4.1! And this is without any attempt at optimizing the code for the Mali architecture, which is quite different from a normal GPU (in particular, it has no local memory, so data does not need to be copied back and forth).
I'd also like to try some real applications, my next project is to get darktable running (a RAW photo developer). Do let me know if you have some real applications using OpenCL! I'll follow up with another post if I get them to work.