CLMUL instruction set

{{Short description|Extension to the x86 instruction set}}

Carry-less Multiplication (CLMUL) is an extension to the x86 instruction set used by microprocessors from Intel and AMD which was proposed by Intel in March 2008{{cite web | url=http://softwareprojects.intel.com/avx/ | title=Intel Software Network | publisher=Intel | accessdate=2008-04-05 | url-status=dead | archiveurl=https://web.archive.org/web/20080407095317/http://softwareprojects.intel.com/avx/ | archivedate=2008-04-07 }} and made available in the Intel Westmere processors announced in early 2010. Mathematically, the instruction implements multiplication of polynomials over the finite field GF(2) where the bitstring a_0a_1\ldots a_{63} represents the polynomial a_0 + a_1X + a_2X^2 + \cdots + a_{63}X^{63}. The CLMUL instruction also allows a more efficient implementation of the closely related multiplication of larger finite fields GF(2k) than the traditional instruction set.{{cite web

|url=http://software.intel.com/en-us/articles/intel-carry-less-multiplication-instruction-and-its-usage-for-computing-the-gcm-mode/

|archive-url=https://web.archive.org/web/20190806061845/https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf

|title=Intel Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode – Rev 2.02

|publisher=Intel|author1=Shay Gueron|author2=Michael E. Kounavis|date=2014-04-20|archive-date=2019-08-06}}

One use of these instructions is to improve the speed of applications doing block cipher encryption in Galois/Counter Mode, which depends on finite field GF(2k) multiplication. Another application is the fast calculation of CRC values,{{cite web|url=http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf|title=Fast CRC Computation for Generic Polynomials Using PCLMULQDQ}} including those used to implement the LZ77 sliding window DEFLATE algorithm in zlib and pngcrush.{{cite web|url=https://blog.cloudflare.com/cloudflare-fights-cancer/|title=Fighting Cancer: The Unexpected Benefit Of Open Sourcing Our Code|author=Vlad Krasnov|publisher=CloudFlare|date=2015-07-08|accessdate=2016-09-04}}

ARMv8 also has a version of CLMUL. SPARC calls their version XMULX, for "XOR multiplication".

New instructions

The instruction computes the 128-bit carry-less product of two 64-bit values. The destination is a 128-bit XMM register. The source may be another XMM register or memory. An immediate operand specifies which halves of the 128-bit operands are multiplied. Mnemonics specifying specific values of the immediate operand are also defined:

class="wikitable"
Instruction

! Opcode

! Description

{{nowrap|PCLMULQDQ xmmreg,xmmrm,imm}}{{nowrap|[rmi: 66 0f 3a 44 /r ib]}}

| Perform a carry-less multiplication of two 64-bit polynomials over the finite field GF(2)[X].

PCLMULLQLQDQ xmmreg,xmmrm[rm:  66 0f 3a 44 /r 00]

| Multiply the low halves of the two registers.

PCLMULHQLQDQ xmmreg,xmmrm[rm:  66 0f 3a 44 /r 01]

| Multiply the high half of the destination register by the low half of the source register.

PCLMULLQHQDQ xmmreg,xmmrm[rm:  66 0f 3a 44 /r 10]

| Multiply the low half of the destination register by the high half of the source register.

PCLMULHQHQDQ xmmreg,xmmrm[rm:  66 0f 3a 44 /r 11]

| Multiply the high halves of the two registers.

A EVEX vectorized version (VPCLMULQDQ) is seen in AVX-512.

CPUs with CLMUL instruction set

  • Intel
  • Westmere processor (March 2010).
  • Sandy Bridge processor
  • Ivy Bridge processor
  • Haswell processor
  • Broadwell processor (with increased throughput and lower latency{{cite web|url=http://www.anandtech.com/show/10158/the-intel-xeon-e5-v4-review |title=The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads |page=3 |author=Johan De Gelas |work=Anandtech |date=2017-03-31}})
  • Skylake (and later) processor
  • Goldmont processor
  • AMD:
  • Jaguar-based processors and newer {{cite web

| title = Slide detailing improvements of Jaguar over Bobcat

| date = 29 August 2012

| publisher = AMD

| url = http://www.slideshare.net/AMDPhil/bobcat-to-jaguarv2

| accessdate = August 3, 2013

}}

  • Puma-based processors and newer
  • "Heavy Equipment" processors
  • Bulldozer-based processors {{cite web |url=http://developer.amd.com/2009/05/06/striking-a-balance/ |title=Striking a balance |date=6 May 2009 |author=Dave Christie |publisher=AMD Developer blogs |accessdate=2011-03-11 |url-status=dead |archiveurl=https://archive.today/20131109140737/http://developer.amd.com/2009/05/06/striking-a-balance/ |archivedate=9 November 2013 }}
  • Piledriver-based processors
  • Steamroller-based processors
  • Excavator-based processors and newer
  • Zen processors
  • Zen+ processors
  • Zen2 (and later) processors

The presence of the CLMUL instruction set can be checked by testing one of the CPU feature bits.

See also

References

{{reflist|30em}}

{{AMD technology}}

{{Intel technology}}

{{Multimedia extensions|state=uncollapsed}}

Category:X86 architecture

Category:X86 instructions