2019-04-29 SIMD Integration
Posted in : TerreSculptor on by : dgreen Comments: Tags: SIMD
TerreSculptor Version 2 is currently under development.
This new version includes more than 250 significant new features and major software updates.
Follow the Twitter feed for near-daily updates.
One of the major software features that is being developed for TerreSculptor is the utilization of processor SIMD instructions.
SIMD is an acronym for “Single Instruction Multiple Data”, which constitutes parallel processing of numerous mathematical algorithms such as ‘add’ and ‘multiply’ upon sets of data.
What does SIMD mean for TerreSculptor?
TerreSculptor already supports multi-threading on most of its devices and filters, so if you have a quad-core processor, the device or filter will execute approximately four times faster than the single thread version of the software.
Note that Hyper-threads result in very little performance gain so they will not be mentioned here.
The SIMD features vary by the processor and typically provide 128-bit and 256-bit parallel registers for math operations.
TerreSculptor internally manages most data as floating-point values which is 32-bit data.
So a SIMD 128-bit register can perform four floating-point operations simultaneously, resulting in a theoretical 400% performance increase.
And a SIMD 256-bit register can perform eight floating-point operations simultaneously, resulting in a theoretical 800% performance increase.
The introduction of SIMD on many of the TerreSculptor devices and filters on most current processors will result in as much as a four-times to eight-times performance boost.
Over the coming months SIMD will be implemented in the majority of TerreSculptor functions and methods, including arrays, devices (generators and modifiers), and filters.
On a 6-core AVX2-equipped processor, the multi-threading plus SIMD performance increase should be typically 48 times faster than the single thread execution of the algorithm.
A few generalized notes about the processor SIMD registers:
SIMD Register Sizes
|AVX2||256-bit (Haswell and later)|
Processor Family SIMD Support
Note: this is a general list and is not complete, see Intel for your specific processor.
|Intel iX-2000||Sandy Bridge||SSE4.1 SSE4.2 AVX|
|Intel iX-3000||Ivy Bridge||SSE4.1 SSE4.2 AVX|
|Intel iX-4000||Haswell||SSE4.1 SSE4.2 AVX2|
|Intel iX-5000||Broadwell||SSE4.1 SSE4.2 AVX2|
|Intel iX-6000||Broadwell E||SSE4.1 SSE4.2 AVX2|
|Intel iX-7000||Kaby Lake||SSE4.1 SSE4.2 AVX2|
|Intel iX-8000||Coffee Lake||SSE4.1 SSE4.2 AVX2|
|Intel iX-9000||Coffee Lake||SSE4.1 SSE4.2 AVX2|
Initial testing and benchmarking has already began in the TerreSculptor code-base.
Executing an SSE/AVX SIMD version of the Altitude Center algorithm results in a better than 300% performance boost.
This specific test’s results were conducted on a Windows 7 i7-2600K computer system.
Additional benchmarking on other systems will be conducted.
This test was performed on an 8192×8192 floating point datamap (heightmap).
The Altitude Top, Altitude Bottom, and the slower Altitude Center benchmark value of 300ms+ in the application logs is the current single-threaded code.
The 101ms benchmark value is with just the 128-bit SSE/AVX SIMD, which performs four simultaneous executions.
This value is expected to be almost half again with AVX2, and again down to another 25% on a quad-core, when this algorithm is multi-threaded.
This will be a performance increase from ~300ms down to ~75ms on a quad-core processor and down again to ~10ms on a quad-core AVX2 supported processor.
Additional preliminary performance tests have been completed using AVX2 instructions with 256-bit registers, allowing for 8 parallel operations.
The 8192×8192 floating-point datamap altitude modification that takes ~300ms single-threaded takes only ~45ms using AVX2 SIMD.
Additional tests will be completed shortly with full multi-threading plus SIMD on multiple hardware platforms.
I am interested to see how a 12-thread i7 with AVX2 SIMD performs, the theoretical performance gain is 12*8 or 96 times faster than single-threaded, although I would expect it to actually be around 56 times faster in reality.
The end goal with all of this work is to have all of the arrays, generators, modifiers, and filters execute at the highest performance speed, since upcoming features in TerreSculptor include the Terrain Stack and a Mask Editor with Filter Stack.
The goal with these features is to have them execute in real-time or near-real-time.
The Mask Editor, displaying a grayscale weightmap, and a stack of filters such as Brightness, Contrast, Intensity, Normalize, Gaussian Blur, that can execute in near-real-time while the filter parameters are being adjusted will be a nice feature.