Bit Manipulation Instruction Sets
Bit Manipulation Instructions Sets (BMI sets) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD. The purpose of these instruction sets is to improve the speed of bit manipulation. All the instructions in these sets are non-SIMD and operate only on general-purpose registers. There are two sets published by Intel: BMI (here referred to as BMI1) and BMI2; they were both introduced with the Haswell microarchitecture. Another two sets were published by AMD: ABM (Advanced Bit Manipulation, which is also a subset of SSE4a implemented by Intel as part of SSE4.2 and BMI1), and TBM (Trailing Bit Manipulation, an extension introduced with Piledriver-based processors as an extension to BMI1, but dropped again in Zen-based processors).[1]
ABM (Advanced Bit Manipulation)
ABM is only implemented as a single instruction set by AMD; all AMD processors support both instructions or neither. Intel considers POPCNT
as part of SSE4.2, and LZCNT
as part of BMI1. POPCNT
has a separate CPUID flag; however, Intel uses AMD's ABM
flag to indicate LZCNT
support (since LZCNT
completes the ABM).[2]
Instruction | Description[3] |
---|---|
POPCNT
|
Population count |
LZCNT
|
Leading zeros count |
LZCNT
is related to the Bit Scan Reverse (BSR
) instruction, but sets the ZF (if the result is zero) and CF (if the source is zero) flags rather than OF, and produces a defined result (the source operand size in bits) if the source operand is zero. For a non-zero argument, sum of LZCNT
and BSR
results is argument bit width minus 1 (for example, if 32-bit argument is 0x000f0000
, LZCNT gives 12, and BSR gives 19).
BMI1 (Bit Manipulation Instruction Set 1)
The instructions below are those enabled by the BMI
bit in CPUID. Intel officially considers LZCNT
as part of BMI, but advertises LZCNT
support using the ABM
CPUID feature flag.[2] BMI1 is available in AMD's Jaguar,[4] Piledriver[5] and newer processors, and in Intel's Haswell[6] and newer processors.
Instruction | Description[2] | Equivalent C expression[7] |
---|---|---|
ANDN
|
Logical and not | ~x & y |
BEXTR
|
Bit field extract (with register) | (src >> start) & ((1 << len) - 1) |
BLSI
|
Extract lowest set isolated bit | x & -x |
BLSMSK
|
Get mask up to lowest set bit | x ^ (x - 1) |
BLSR
|
Reset lowest set bit | x & (x - 1) |
TZCNT
|
Count the number of trailing zero bits | N/A |
TZCNT
is almost identical to the Bit Scan Forward (BSF
) instruction, but sets the ZF (if the result is zero) and CF (if the source is zero) flags rather than OF. For a non-zero argument, result of TZCNT
and BSF
is equal.
BMI2 (Bit Manipulation Instruction Set 2)
Intel introduced BMI2 together with BMI1 in its line of Haswell processors. Only AMD has produced processors supporting BMI1 without BMI2; BMI2 is supported by AMDs Excavator architecture and newer.[8]
Instruction | Description |
---|---|
BZHI
|
Zero high bits starting with specified bit position [src & (1 << inx)-1]; |
MULX
|
Unsigned multiply without affecting flags, and arbitrary destination registers |
PDEP
|
Parallel bits deposit |
PEXT
|
Parallel bits extract |
RORX
|
Rotate right logical without affecting flags |
SARX
|
Shift arithmetic right without affecting flags |
SHRX
|
Shift logical right without affecting flags |
SHLX
|
Shift logical left without affecting flags |
Parallel bit deposit and extract
The PDEP
and PEXT
instructions are new generalized bit-level compress and expand instructions. They take two inputs; one is a source, and the other is a selector. The selector is a bitmap selecting the bits that are to be packed or unpacked. PEXT
copies selected bits from the source to contiguous low-order bits of the destination; higher-order destination bits are cleared. PDEP
does the opposite for the selected bits: contiguous low-order bits are copied to selected bits of the destination; other destination bits are cleared. This can be used to extract any bitfield of the input, and even do a lot of bit-level shuffling that previously would have been expensive. While what these instructions do is similar to bit level gather-scatter SIMD instructions, PDEP
and PEXT
instructions (like the rest of the BMI instruction sets) operate on general-purpose registers.[9]
The instructions are available in 32-bit and 64-bit versions. An example using arbitrary source and selector in 32-bit mode is:
Instruction | Selector mask | Source | Destination |
---|---|---|---|
PEXT |
0xff00fff0 | 0x12345678 | 0x00012567 |
PDEP |
0xff00fff0 | 0x00012567 | 0x12005670 |
TBM (Trailing Bit Manipulation)
TBM consists of instructions complementary to the instruction set started by BMI1; their complementary nature means they do not necessarily need to be used directly but can be generated by an optimizing compiler when supported. AMD introduced TBM together with BMI1 in its Piledriver[5] line of processors; later AMD Jaguar and Zen-based processors do not support TBM.[4] No Intel processors (at least through Coffee Lake) support TBM.
Instruction | Description[3] | Equivalent C expression[10] |
---|---|---|
BEXTR
|
Bit field extract (with immediate) | (src >> start) & ((1 << len) - 1) |
BLCFILL
|
Fill from lowest clear bit | x & (x + 1) |
BLCI
|
Isolate lowest clear bit | x | ~(x + 1) |
BLCIC
|
Isolate lowest clear bit and complement | ~x & (x + 1) |
BLCMSK
|
Mask from lowest clear bit | x ^ (x + 1) |
BLCS
|
Set lowest clear bit | x | (x + 1) |
BLSFILL
|
Fill from lowest set bit | x | (x - 1) |
BLSIC
|
Isolate lowest set bit and complement | ~x | (x - 1) |
T1MSKC
|
Inverse mask from trailing ones | ~x | (x + 1) |
TZMSK
|
Mask from trailing zeros | ~x & (x - 1) |
Supporting CPUs
- Intel
- Intel Nehalem processors and newer (like Sandy Bridge, Ivy Bridge) (POPCNT supported)
- Intel Silvermont processors (POPCNT supported)
- Intel Haswell processors and newer (like Skylake, Broadwell) (ABM, BMI1 and BMI2 supported)[6]
- AMD
- K10-based processors (ABM supported)
- "Cat" low-power processors
- Bobcat-based processors (ABM supported)[11]
- Jaguar-based processors and newer (ABM and BMI1 supported)[4]
- Puma-based processors and newer (ABM and BMI1 supported)[4]
- "Heavy Equipment" processors
- Bulldozer-based processors (ABM supported)
- Piledriver-based processors (ABM, BMI1 and TBM supported)[1]
- Steamroller-based processors (ABM, BMI1 and TBM supported)
- Excavator-based processors and newer (ABM, BMI1, BMI2 and TBM supported)[8]
- Zen-based processors (ABM, BMI1 and BMI2 supported)
- Zen+-based processors (ABM, BMI1 and BMI2 supported)
- Zen 2 processors (ABM, BMI1 and BMI2 supported)
Note that instruction extension support means the processor is capable of executing the supported instructions for software compatibility purposes. The processor might not perform well doing so. For example Zen, Zen+ and Zen 2 processors implement PEXT and PDEP instructions using microcode resulting in the instructions executing significantly slower than the same behaviour recreated using other instructions[12]. For optimum performance it is recommended that compiler developers choose to use individual instructions in the extensions based on architecture specific performance profiles rather than on extension availability.
See also
- Advanced Vector Extensions (AVX)
- AES instruction set
- CLMUL instruction set
- F16C
- FMA instruction set
- Intel ADX
- XOP instruction set
- Intel BCD opcodes (also used for advanced bit manipulation techniques)
References
- ↑ 1.0 1.1 "New "Bulldozer" and "Piledriver" Instructions". http://developer.amd.com/wordpress/media/2012/10/New-Bulldozer-and-Piledriver-Instructions.pdf. Retrieved 2014-01-03.
- ↑ 2.0 2.1 2.2 "Intel Advanced Vector Extensions Programming Reference" (PDF). intel.com. Intel. June 2011. http://software.intel.com/file/36945. Retrieved 2014-01-03.
- ↑ 3.0 3.1 "AMD64 Architecture Programmer's Manual, Volume 3: General-Purpose and System Instructions" (PDF). amd.com. AMD. October 2013. http://support.amd.com/TechDocs/24594.pdf. Retrieved 2014-01-02.
- ↑ 4.0 4.1 4.2 4.3 "Family 16h AMD A-Series Data Sheet" (PDF). amd.com. AMD. October 2013. http://support.amd.com/TechDocs/52169_KB_A_Series_Mobile.pdf. Retrieved 2014-01-02.
- ↑ 5.0 5.1 Hollingsworth, Brent. "New "Bulldozer" and "Piledriver" instructions" (pdf). Advanced Micro Devices, Inc.. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/New-Bulldozer-and-Piledriver-Instructions.pdf. Retrieved 11 December 2014.
- ↑ 6.0 6.1 Locktyukhin, Max. "How to detect New Instruction support in the 4th generation Intel® Core™ processor family". Intel. https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family. Retrieved 11 December 2014.
- ↑ "bmiintrin.h from GCC 4.8". https://gcc.gnu.org/viewcvs/gcc/branches/gcc-4_8-branch/gcc/config/i386/bmiintrin.h?revision=201047&view=markup. Retrieved 2014-03-17.
- ↑ 8.0 8.1 "AMD Excavator Core May Bring Dramatic Performance Increases". X-bit labs. October 18, 2013. Archived from the original on October 23, 2013. https://web.archive.org/web/20131023074809/http://www.xbitlabs.com/news/cpu/display/20131018224745_AMD_Excavator_Core_May_Dramatic_Performance_Increases.html. Retrieved November 24, 2013.
- ↑ "A New Basis for Shifters in General-Purpose Processors for Existing and Advanced Bit Manipulations" (PDF). palms.princeton.edu. IEEE Transactions on Computers. August 2009. pp. 1035–1048. http://palms.princeton.edu/system/files/IEEE_TC09_NewBasisForShifters.pdf. Retrieved 2014-02-10.
- ↑ "tbmintrin.h from GCC 4.8". https://gcc.gnu.org/viewcvs/gcc/branches/gcc-4_8-branch/gcc/config/i386/tbmintrin.h?revision=196696&view=markup. Retrieved 2014-03-17.
- ↑ "BIOS and Kernel Developer's Guide for AMD Family 14h". http://developer.amd.com/wordpress/media/2012/10/43170_14h_Mod_00h-0Fh_BKDG.pdf. Retrieved 2014-01-03.
- ↑ "Dolphin Emulator" (in en-us). https://dolphin-emu.org/blog/2020/02/07/dolphin-progress-report-dec-2019-and-jan-2020/.
Further reading
- Warren Jr., Henry S. (2013). Hacker's Delight (2 ed.). Addison Wesley - Pearson Education, Inc.. ISBN 978-0-321-84268-8.
External links