SSSE3
On this page
SSSE3 (Supplemental Streaming SIMD Extension 3) is Intel’s name for the SSE instruction set’s fourth iteration. 16 new instructions, also available as MMX-extension with _m64 intrinsic datatype. SSSE3 was introduced in Intel’s Core Microarchitecture. SSSE3-intrinsic functions are available in Visual C [1] or Intel-C [2] .
SSSE3 instructions are not available for AMD processors until Bulldozer, which also implements SSE4 and AVX.
PH(ADD/SUB)W
Packed Horizontal Add/Subtract Words. Each of the eight shorts integers is the sum/difference between adjacent pairs of elements in the input parameters. Saturating versions, PHADDSW and PHSUBSW, are also available.
The primary downside of these instructions is that they tend to be very slow multiple-uop instructions on most CPUs, resulting in alternate instruction sequences almost always being faster.
Intrinsic
_m128i _mm_hadd_epi16 (_m128i a, _m128i b);
Pseudocode
PMADDUBSW
Packed Multiply and Add a vector of 16 unsigned bytes (char) with a vector of 16 signed bytes (not commutative! [3] ). Two consecutive products are added and the saturated signed 16-bit results are stored as vector of eight signed shorts.
Intrinsic
_m128i _mm_maddubs_epi16 (_m128i a, _m128i b);
Pseudocode
PMULHRSW
Packed Multiply High with Round and Scale is an instruction designed for fixed-point math. It is similar to the existing pmulhw, but only shifts right by 15 instead of 16, and adds a factor for correct rounding.
Intrinsic
_m128i _mm_mulhrs_epi16 (_m128i a, _m128i b);
Pseudocode
PSHUFB
Packed Shuffle Bytes is a very powerful instruction that can perform a fast arbitrary byte-shuffle of a register. It can also set some output bytes to zero instead of selecting them from the input. Packed Shuffle Bytes is used inside the SSSE3 Version of Hyperbola Quintessence to perform byte swaps. There might be some other interesting applications too, such as SSSE3 Population Count. VPPERM from XOP is an even more powerful variant on this instruction.
Intrinsic
_m128i _mm_shuffle_epi8 (_m128i a, _m128i b);
Pseudocode
PSIGN
Multiplies each element of one vector with the sign function {-1,0,1} of the second vector. The instruction is available for signed bytes (8-bit char psignb), signed words (16-bit short psignw) and double words (32-bit int psignd). If both input operands are equal, the result is a vector of absolute values, though PABS is probably preferred for this purpose.
Intrinsic
_m128i _mm_sign_epi8 (_m128i a, _m128i b); _m128i _mm_sign_epi16 (_m128i a, _m128i b); _m128i _mm_sign_epi32 (_m128i a, _m128i b);
Pseudocode
Applications
SSE3 Population Count
In 2008, Wojciech Muła introduced a SSSE3 Population Count performing a pair of PSHUFB 16 parallel nibble in-xmm lookups [4], in the meantime due to AVX2 or AVX-512 even possible with doubled or fourfold register widths, competing the native x86-64 popcount instruction [5]
Byte-wise Dot-Product
This SSSE3-dot-product multiplies a vector of 64 unsigned chars with a vector of 64 signed char, and adds all 64 intermediate signed 16-bit products with saturation. It uses the pmaddubsw and phaddsw SSSE3 instructions, in total 11 SSE instructions.
SSSE3 Hyperbola Quintessence
Following routine calculates bishop attacks performing the Hyperbola Quintessence. Both anti-diagonal and diagonal attacks are processed in parallel, using both halves of a 128-bit xmm register and pre-calculated arrays of bitboard pairs for line-masks, single- and eventually reversed bits. PSHUFB is used to swap bytes inside a bitboard [6] .
Intrinsic Version
Peshkov’s Optimization
The pioneer of Hyperbola Quintessence, Aleks Peshkov, applies a more sophisticated, optimized C++ approach, further utilizing disjoint ray-attacks, and xor instruction is own inverse and distributive over mirroring or flipping. Once per node he instantiates an occupied object based on a type-safe curiously recurring template pattern aka the Barton–Nackman trick to encapsulate the SSE intrinsic data types, to keep a twin of normal and flipped occupancy as base for further file-, diagonal or anti-diagonal attack generations, which then requires only one final shuffle per piece attack getter, whether it is a bishop or even a queen.
See also
External Links
- SSSE3 from Wikipedia
- SSEPlus Project Documentation
- Agner`s CPU blog by Agner Fog
- SSSE3: fast popcount by Wojciech Muła, May 24, 2008 » Population Count
- Intel Intrinsics Guide
References
- ↑ Supplemental Streaming SIMD Extensions 3 Instructions
- ↑ Intel C++ Compiler for Linux Intrinsics Reference (pdf)
- ↑ fixing SSSE3 pmaddubsw description from the mail archive of the [] mailing list for the GCC project
- ↑ SSSE3: fast popcount by Wojciech Muła, May 24, 2008
- ↑ Wojciech Muła, Nathan Kurz, Daniel Lemire (2016). Faster Population Counts Using AVX2 Instructions. arXiv:1611.07612
- ↑ Re: Plain and fancy magic on modern hardware by Robert Purves, CCC, August 23, 2010