x86 Bit manipulation instruction set#BMI1 (Bit Manipulation Instruction Set 1)

Bit manipulation instructions sets (BMI sets) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD. The purpose of these instruction sets is to improve the speed of bit manipulation. All the instructions in these sets are non-SIMD and operate only on general-purpose registers.

There are two sets published by Intel: BMI (now referred to as BMI1) and BMI2; they were both introduced with the Haswell microarchitecture with BMI1 matching features offered by AMD's ABM instruction set and BMI2 extending them. Another two sets were published by AMD: ABM (Advanced Bit Manipulation, which is also a subset of SSE4a implemented by Intel as part of SSE4.2 and BMI1), and TBM (Trailing Bit Manipulation, an extension introduced with Piledriver-based processors as an extension to BMI1, but dropped again in Zen-based processors).{{cite web|url=http://developer.amd.com/wordpress/media/2012/10/New-Bulldozer-and-Piledriver-Instructions.pdf|title=New "Bulldozer" and "Piledriver" Instructions|access-date=2014-01-03}}

{{anchor|ABM}}ABM (Advanced Bit Manipulation)

AMD was the first to introduce the instructions that now form Intel's BMI1 as part of its ABM (Advanced Bit Manipulation) instruction set, then later added support for Intel's new BMI2 instructions. AMD today advertises the availability of these features via Intel's BMI1 and BMI2 cpuflags and instructs programmers to target them accordingly.{{cite web|url=https://www.amd.com/system/files/TechDocs/24594.pdf|title=AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions|access-date=2022-07-20}}

While Intel considers POPCNT as part of SSE4.2 and LZCNT as part of BMI1, both Intel and AMD advertise the presence of these two instructions individually. POPCNT has a separate CPUID flag of the same name, and Intel and AMD use AMD's ABM flag to indicate LZCNT support (since LZCNT combined with BMI1 and BMI2 completes the expanded ABM instruction set).{{cite web |url=http://software.intel.com/file/36945 |title=Intel Advanced Vector Extensions Programming Reference |date=June 2011 |access-date=2014-01-03 |publisher=Intel |work=intel.com |format=PDF}}

class="wikitable"

class="wikitable"
Encoding ! Instruction ! Description{{cite web\|url=https://www.amd.com/system/files/TechDocs/24594.pdf\|title=AMD64 Architecture Programmer's Manual, Volume 3: General-Purpose and System Instructions\|date=March 2021 \|access-date=2021-04-08 \|publisher=AMD \|archive-url=https://web.archive.org/web/20210408181855/https://www.amd.com/system/files/TechDocs/24594.pdf \|archive-date=2021-04-08 \|url-status=live \|version=Revision 3.32}}
`F3 0F B8 /r` \| `POPCNT` \| Population count
`F3 0F BD /r` \| `LZCNT` \| Leading zeros count

Encoding

! Instruction

! Description{{cite web|url=https://www.amd.com/system/files/TechDocs/24594.pdf|title=AMD64 Architecture Programmer's Manual, Volume 3: General-Purpose and System Instructions|date=March 2021 |access-date=2021-04-08 |publisher=AMD |archive-url=https://web.archive.org/web/20210408181855/https://www.amd.com/system/files/TechDocs/24594.pdf |archive-date=2021-04-08 |url-status=live |version=Revision 3.32}}

F3 0F B8 /r

| POPCNT

| Population count

F3 0F BD /r

| LZCNT

| Leading zeros count

LZCNT is related to the Bit Scan Reverse (BSR) instruction, but sets the ZF (if the result is zero) and CF (if the source is zero) flags rather than setting the ZF (if the source is zero). Also, it produces a defined result (the source operand size in bits) if the source operand is zero. For a non-zero argument, sum of LZCNT and BSR results is argument bit width minus 1 (for example, if 32-bit argument is 0x000f0000, LZCNT gives 12, and BSR gives 19).

The encoding of LZCNT is such that if ABM is not supported, then the BSR instruction is executed instead.{{rp|227}}

{{anchor|BMI1}}BMI1 (Bit Manipulation Instruction Set 1)

The instructions below are those enabled by the BMI bit in CPUID. Intel officially considers LZCNT as part of BMI, but advertises LZCNT support using the ABM CPUID feature flag. BMI1 is available in AMD's Jaguar, Piledriver{{cite web|last1=Hollingsworth|first1=Brent|title=New "Bulldozer" and "Piledriver" instructions|url=http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/New-Bulldozer-and-Piledriver-Instructions.pdf|publisher=Advanced Micro Devices, Inc.|access-date=11 December 2014}} and newer processors, and in Intel's Haswell{{cite web|last1=Locktyukhin|first1=Max|title=How to detect New Instruction support in the 4th generation Intel® Core™ processor family|url=https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family|website=www.intel.com|publisher=Intel|access-date=11 December 2014}} and newer processors.

class="wikitable"

class="wikitable"
Encoding ! Instruction ! Description ! Equivalent C expression{{cite web\|url=https://gcc.gnu.org/viewcvs/gcc/branches/gcc-4_8-branch/gcc/config/i386/bmiintrin.h?revision=201047&view=markup\|title=bmiintrin.h from GCC 4.8\|access-date=2014-03-17}}{{Cite web\|url=https://www.sandpile.org/x86/bits.htm\|title = sandpile.org -- x86 architecture -- bits\|access-date=2014-03-17}}{{Cite web\|url=https://github.com/abseil/abseil-cpp/blob/ce4bc927755fdf0ed03d679d9c7fa041175bb3cb/absl/base/internal/bits.h#L188\|title = Abseil - C++ Common Libraries\|website = GitHub\|date = 4 November 2021}}
`VEX.LZ.0F38 F2 /r` \| `ANDN` \| Logical and not \| `~x & y`
`VEX.LZ.0F38 F7 /r` \| `BEXTR` \| Bit field extract (with register) \| `(src >> start) & ((1 << len) - 1)`
`VEX.LZ.0F38 F3 /3` \| `BLSI` \| Extract lowest set isolated bit \| `x & -x`
`VEX.LZ.0F38 F3 /2` \| `BLSMSK` \| Get mask up to lowest set bit \| `x ^ (x - 1)`
`VEX.LZ.0F38 F3 /1` \| `BLSR` \| Reset lowest set bit \| `x & (x - 1)`
`F3 0F BC /r` \| `TZCNT` \| Count the number of trailing zero bits \| {{sxhl\| 31 + (!x) - (((x & -x) & 0x0000FFFF) ? 16 : 0) - (((x & -x) & 0x00FF00FF) ? 8 : 0) - (((x & -x) & 0x0F0F0F0F) ? 4 : 0) - (((x & -x) & 0x33333333) ? 2 : 0) - (((x & -x) & 0x55555555) ? 1 : 0) \|c}}

Encoding

! Instruction

! Description

! Equivalent C expression{{cite web|url=https://gcc.gnu.org/viewcvs/gcc/branches/gcc-4_8-branch/gcc/config/i386/bmiintrin.h?revision=201047&view=markup|title=bmiintrin.h from GCC 4.8|access-date=2014-03-17}}{{Cite web|url=https://www.sandpile.org/x86/bits.htm|title = sandpile.org -- x86 architecture -- bits|access-date=2014-03-17}}{{Cite web|url=https://github.com/abseil/abseil-cpp/blob/ce4bc927755fdf0ed03d679d9c7fa041175bb3cb/absl/base/internal/bits.h#L188|title = Abseil - C++ Common Libraries|website = GitHub|date = 4 November 2021}}

VEX.LZ.0F38 F2 /r

| ANDN

| Logical and not

| ~x & y

VEX.LZ.0F38 F7 /r

| BEXTR

| Bit field extract (with register)

| (src >> start) & ((1 << len) - 1)

VEX.LZ.0F38 F3 /3

| BLSI

| Extract lowest set isolated bit

| x & -x

VEX.LZ.0F38 F3 /2

| BLSMSK

| Get mask up to lowest set bit

| x ^ (x - 1)

VEX.LZ.0F38 F3 /1

| BLSR

| Reset lowest set bit

| x & (x - 1)

F3 0F BC /r

| TZCNT

| Count the number of trailing zero bits

| {{sxhl|

31 + (!x)

- (((x & -x) & 0x0000FFFF) ? 16 : 0)

- (((x & -x) & 0x00FF00FF) ? 8 : 0)

- (((x & -x) & 0x0F0F0F0F) ? 4 : 0)

- (((x & -x) & 0x33333333) ? 2 : 0)

- (((x & -x) & 0x55555555) ? 1 : 0)

|c}}

TZCNT is almost identical to the Bit Scan Forward (BSF) instruction, but sets the ZF (if the result is zero) and CF (if the source is zero) flags rather than setting the ZF (if the source is zero). For a non-zero argument, the result of TZCNT and BSF is equal.

As with LZCNT, the encoding of TZCNT is such that if BMI1 is not supported, then the BSF instruction is executed instead.{{rp|352}}

{{anchor|BMI2}}BMI2 (Bit Manipulation Instruction Set 2)

Intel introduced BMI2 together with BMI1 in its line of Haswell processors. Only AMD has produced processors supporting BMI1 without BMI2; BMI2 is supported by AMDs Excavator architecture and newer.{{cite web |url=http://www.xbitlabs.com/news/cpu/display/20131018224745_AMD_Excavator_Core_May_Dramatic_Performance_Increases.html |title=AMD Excavator Core May Bring Dramatic Performance Increases |publisher=X-bit labs |date=October 18, 2013 |access-date=November 24, 2013 |url-status=dead |archive-url=https://web.archive.org/web/20131023074809/http://www.xbitlabs.com/news/cpu/display/20131018224745_AMD_Excavator_Core_May_Dramatic_Performance_Increases.html |archive-date=October 23, 2013 }}

class="wikitable"

class="wikitable"
Encoding ! Instruction ! Description
`VEX.LZ.0F38 F5 /r` \| `BZHI` \| Zero high bits starting with specified bit position [src & (1 << inx)-1];
`VEX.LZ.F2.0F38 F6 /r` \| `MULX` \| Unsigned multiply without affecting flags, and arbitrary destination registers
`VEX.LZ.F2.0F38 F5 /r` \| `PDEP` \| Parallel bits deposit
`VEX.LZ.F3.0F38 F5 /r` \| `PEXT` \| Parallel bits extract
`VEX.LZ.F2.0F3A F0 /r ib` \| `RORX` \| Rotate right logical without affecting flags
`VEX.LZ.F3.0F38 F7 /r` \| `SARX` \| Shift arithmetic right without affecting flags
`VEX.LZ.F2.0F38 F7 /r` \| `SHRX` \| Shift logical right without affecting flags
`VEX.LZ.66.0F38 F7 /r` \| `SHLX` \| Shift logical left without affecting flags

Encoding

! Instruction

! Description

VEX.LZ.0F38 F5 /r

| BZHI

| Zero high bits starting with specified bit position [src & (1 << inx)-1];

VEX.LZ.F2.0F38 F6 /r

| MULX

| Unsigned multiply without affecting flags, and arbitrary destination registers

VEX.LZ.F2.0F38 F5 /r

| PDEP

| Parallel bits deposit

VEX.LZ.F3.0F38 F5 /r

| PEXT

| Parallel bits extract

VEX.LZ.F2.0F3A F0 /r ib

| RORX

| Rotate right logical without affecting flags

VEX.LZ.F3.0F38 F7 /r

| SARX

| Shift arithmetic right without affecting flags

VEX.LZ.F2.0F38 F7 /r

| SHRX

| Shift logical right without affecting flags

VEX.LZ.66.0F38 F7 /r

| SHLX

| Shift logical left without affecting flags

= Parallel bit deposit and extract =

The PDEP and PEXT instructions are new generalized bit-level compress and expand instructions. They take two inputs; one is a source, and the other is a selector. The selector is a bitmap selecting the bits that are to be packed or unpacked. PEXT copies selected bits from the source to contiguous low-order bits of the destination; higher-order destination bits are cleared. PDEP does the opposite for the selected bits: contiguous low-order bits are copied to selected bits of the destination; other destination bits are cleared. This can be used to extract any bitfield of the input, and even do a lot of bit-level shuffling that previously would have been expensive. While what these instructions do is similar to bit level gather-scatter SIMD instructions, PDEP and PEXT instructions (like the rest of the BMI instruction sets) operate on general-purpose registers.{{Cite web

| url = http://palms.princeton.edu/system/files/IEEE_TC09_NewBasisForShifters.pdf

| title = A New Basis for Shifters in General-Purpose Processors for Existing and Advanced Bit Manipulations

| date = August 2009 | access-date = 2014-02-10

| author1 = Yedidya Hilewitz | author2 = Ruby B. Lee

| publisher = IEEE Transactions on Computers | work = palms.princeton.edu

| volume = 58 | number = 8 | pages = 1035–1048

}}

The instructions are available in 32-bit and 64-bit versions. An example using arbitrary source and selector in 32-bit mode is:

Instruction \|\| Selector mask \|\| Source \|\| Destination
class="wikitable"
`PEXT`	`0xff00fff0`	`0x12345678`	`0x00012567`
`PDEP`	`0xff00fff0`	`0x00012567`	`0x12005670`

AMD processors before Zen 3{{Cite web|url=https://en.wikichip.org/wiki/amd/microarchitectures/zen_3#Key_changes_from_Zen_2|title = Zen 3 - Microarchitectures - AMD - WikiChip}} that implement PDEP and PEXT do so in microcode, with a latency of 18 cycles{{cite web| url=https://www.agner.org/optimize/instruction_tables.pdf |title=Instruction tables |access-date=2023-09-09}} rather than (Zen 3) 3 cycles.{{Cite web |title=Software Optimization Guide for AMD Family 19h Processors |url=https://developer.amd.com/resources/developer-guides-manuals/ |access-date=2022-07-22 |website=AMD Developer Central}} As a result it is often faster to use other instructions on these processors.{{Cite web|title=Saving Private Ryzen: PEXT/PDEP 32/64b replacement functions for #AMD CPUs (BR/#Zen/Zen+/#Zen2) based on @zwegner's zp7|url=https://twitter.com/instlatx64/status/1322503571288559617|access-date=2022-02-21|website=Twitter|language=en}}

{{anchor|TBM}}TBM (Trailing Bit Manipulation)

TBM consists of instructions complementary to the instruction set started by BMI1; their complementary nature means they do not necessarily need to be used directly but can be generated by an optimizing compiler when supported. AMD introduced TBM together with BMI1 in its Piledriver line of processors; later AMD Jaguar and Zen-based processors do not support TBM.{{cite web |url=http://support.amd.com/TechDocs/52169_KB_A_Series_Mobile.pdf |title=Family 16h AMD A-Series Data Sheet |date=October 2013 |access-date=2014-01-02 |publisher=AMD |work=amd.com }} No Intel processors (at least through Alder Lake) support TBM.

class="wikitable"

class="wikitable"
Encoding ! Instruction ! Description ! Equivalent C expression{{cite web\|url=https://gcc.gnu.org/viewcvs/gcc/branches/gcc-4_8-branch/gcc/config/i386/tbmintrin.h?revision=196696&view=markup\|title=tbmintrin.h from GCC 4.8\|access-date=2014-03-17}}
`XOP.LZ.0A 10 /r id` \| `BEXTR` \| Bit field extract (with immediate) \| `(src >> start) & ((1 << len) - 1)`
`XOP.LZ.09 01 /1` \| `BLCFILL` \| Fill from lowest clear bit \| `x & (x + 1)`
`XOP.LZ.09 02 /6` \| `BLCI` \| Isolate lowest clear bit \| `x \| ~(x + 1)`
`XOP.LZ.09 01 /5` \| `BLCIC` \| Isolate lowest clear bit and complement \| `~x & (x + 1)`
`XOP.LZ.09 02 /1` \| `BLCMSK` \| Mask from lowest clear bit \| `x ^ (x + 1)`
`XOP.LZ.09 01 /3` \| `BLCS` \| Set lowest clear bit \| `x \| (x + 1)`
`XOP.LZ.09 01 /2` \| `BLSFILL` \| Fill from lowest set bit \| `x \| (x - 1)`
`XOP.LZ.09 01 /6` \| `BLSIC` \| Isolate lowest set bit and complement \| `~x \| (x - 1)`
`XOP.LZ.09 01 /7` \| `T1MSKC` \| Inverse mask from trailing ones \| `~x \| (x + 1)`
`XOP.LZ.09 01 /4` \| `TZMSK` \| Mask from trailing zeros \| `~x & (x - 1)`

Encoding

! Instruction

! Description

! Equivalent C expression{{cite web|url=https://gcc.gnu.org/viewcvs/gcc/branches/gcc-4_8-branch/gcc/config/i386/tbmintrin.h?revision=196696&view=markup|title=tbmintrin.h from GCC 4.8|access-date=2014-03-17}}

XOP.LZ.0A 10 /r id

| BEXTR

| Bit field extract (with immediate)

| (src >> start) & ((1 << len) - 1)

XOP.LZ.09 01 /1

| BLCFILL

| Fill from lowest clear bit

| x & (x + 1)

XOP.LZ.09 02 /6

| BLCI

| Isolate lowest clear bit

| x | ~(x + 1)

XOP.LZ.09 01 /5

| BLCIC

| Isolate lowest clear bit and complement

| ~x & (x + 1)

XOP.LZ.09 02 /1

| BLCMSK

| Mask from lowest clear bit

| x ^ (x + 1)

XOP.LZ.09 01 /3

| BLCS

| Set lowest clear bit

| x | (x + 1)

XOP.LZ.09 01 /2

| BLSFILL

| Fill from lowest set bit

| x | (x - 1)

XOP.LZ.09 01 /6

| BLSIC

| Isolate lowest set bit and complement

| ~x | (x - 1)

XOP.LZ.09 01 /7

| T1MSKC

| Inverse mask from trailing ones

| ~x | (x + 1)

XOP.LZ.09 01 /4

| TZMSK

| Mask from trailing zeros

| ~x & (x - 1)

Supporting CPUs

Intel
Intel Nehalem processors and newer (like Sandy Bridge, Ivy Bridge) (POPCNT supported)
Intel Silvermont processors (POPCNT supported)
Intel Haswell processors and newer (like Skylake, Broadwell) (ABM, BMI1 and BMI2 supported)
AMD
K10-based processors (ABM supported)
"Cat" low-power processors
Bobcat-based processors (ABM supported){{cite web|url=http://developer.amd.com/wordpress/media/2012/10/43170_14h_Mod_00h-0Fh_BKDG.pdf|title=BIOS and Kernel Developer's Guide for AMD Family 14h|access-date=2014-01-03}}
Jaguar-based processors and newer (ABM and BMI1 supported)
Puma-based processors and newer (ABM and BMI1 supported)
"Heavy Equipment" processors
Bulldozer-based processors (ABM supported)
Piledriver-based processors (ABM, BMI1 and TBM supported)
Steamroller-based processors (ABM, BMI1 and TBM supported)
Excavator-based processors and newer (ABM, BMI1, BMI2 and TBM supported; microcoded PEXT and PDEP)
Zen-based, Zen+-based, and Zen 2-based{{cite web|url=https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/6|title=AMD Zen 3 Ryzen Deep Dive Review: 5950X, 5900X, 5800X and 5600X Tested|access-date=2021-12-26}} processors (ABM, BMI1 and BMI2 supported; microcoded PEXT and PDEP)
Zen 3 processors and newer (ABM, BMI1 and BMI2 supported; full hardware implementation)

Note that instruction extension support means the processor is capable of executing the supported instructions for software compatibility purposes. The processor might not perform well doing so. For example, Excavator through Zen 2 processors implement PEXT and PDEP instructions using microcode resulting in the instructions executing significantly slower than the same behaviour recreated using other instructions.{{Cite web|url=https://dolphin-emu.org/blog/2020/02/07/dolphin-progress-report-dec-2019-and-jan-2020/|title=Dolphin Progress Report: December 2019 and January 2020|website=Dolphin Emulator|date=7 February 2020|language=en-us|access-date=2020-02-07}} (A software method called "zp7" is, in fact, faster on these machines.){{cite web |last1=Wegner |first1=Zach |title=zwegner/zp7 |website=GitHub |url=https://github.com/zwegner/zp7 |date=4 November 2020}} For optimum performance it is recommended that compiler developers choose to use individual instructions in the extensions based on architecture specific performance profiles rather than on extension availability.

References

External links

[https://software.intel.com/sites/landingpage/IntrinsicsGuide/ Intel Intrinsics Guide]

Category:X86 instructions

Category:AMD technologies

x86 Bit manipulation instruction set#BMI1 (Bit Manipulation Instruction Set 1)

{{anchor|ABM}}ABM (Advanced Bit Manipulation)

{{anchor|BMI1}}BMI1 (Bit Manipulation Instruction Set 1)

{{anchor|BMI2}}BMI2 (Bit Manipulation Instruction Set 2)

= Parallel bit deposit and extract =

{{anchor|TBM}}TBM (Trailing Bit Manipulation)

Supporting CPUs

See also

References

Further reading

External links