FMA instruction set

{{short description|X86 instruction set extension developed by Intel}}

The FMA instruction set is an extension to the 128- and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations.{{cite web|last=Woltmann|first=George (Prime95)|title=Intel AVX and GIMPS|url=http://www.mersenneforum.org/showthread.php?t=14335&highlight=fused+multiply+add|work=mersenneforum.org|publisher=Great Internet Mersenne Prime Search (GIMPS) project|access-date=27 July 2011 |quote="FMA3 and FMA4 are not instruction sets, they are individual instructions -- fused multiply add. They could be quite useful depending on how Intel and AMD implement them"}} There are two variants:

FMA4 is supported in AMD processors starting with the Bulldozer architecture. FMA4 was performed in hardware before FMA3 was. Support for FMA4 has been removed since Zen 1.
FMA3 is supported in AMD processors starting with the Piledriver architecture and Intel starting with Haswell processors and Broadwell processors since 2014.

Instructions

FMA3 and FMA4 instructions have almost identical functionality, but are not compatible. Both contain fused multiply–add (FMA) instructions for floating-point scalar and SIMD operations, but FMA3 instructions have three operands, while FMA4 ones have four. The FMA operation has the form d = round(a · b + c), where the round function performs a rounding to allow the result to fit within the destination register if there are too many significant bits to fit within the destination.

The four-operand form (FMA4) allows a, b, c and d to be four different registers, while the three-operand form (FMA3) requires that d be the same register as a, b or c. The three-operand form makes the code shorter and the hardware implementation slightly simpler, while the four-operand form provides more programming flexibility.

See XOP instruction set for more discussion of compatibility issues between Intel and AMD.

FMA3 instruction set

=CPUs with FMA3=

AMD
Piledriver (2012) and newer microarchitectures{{cite web|last=Maffeo|first=Robin|title=AMD and the Visual Studio 11 Beta|url=http://developer.amd.com/community/blog/2012/03/01/amd-and-the-visual-studio-11-beta/|publisher=AMD|date=March 1, 2012|archive-url=https://archive.today/20131109140742/http://developer.amd.com/community/blog/2012/03/01/amd-and-the-visual-studio-11-beta/|archive-date=November 9, 2013|url-status=dead|access-date=2018-11-07}}
2nd gen APUs, "Trinity" (32nm), May 15, 2012
2nd gen "Bulldozer" (bdver2) with Piledriver cores, October 23, 2012
Intel
Haswell (2013) and newer processors, except Pentiums and Celerons{{cite web |title=CPU-Z - ID : y5z6gq |url=https://valid.x86.fr/cache/screenshot/y5z6gq.png |access-date=2022-05-01}}{{cite web |title=CPU-Z - ID : kr2mlx |url=https://valid.x86.fr/cache/screenshot/kr2mlx.png |access-date=2022-05-01}}

=Excerpt from FMA3=

Supported commands include

class="wikitable" ! Mnemonic !! Operation !! Mnemonic !! Operation
VFMADD	`result = + a · b + c`	rowspan="2" \| VFMADDSUB	rowspan="2" \| `result = a · b + c` for i = 1, 3, ... `result = a · b − c` for i = 0, 2, ...
VFNMADD	`result = − a · b + c`
VFMSUB	`result = + a · b − c`	rowspan="2" \| VFMSUBADD	rowspan="2" \| `result = a · b − c` for i = 1, 3, ... `result = a · b + c` for i = 0, 2, ...
VFNMSUB	`result = − a · b − c`

;Note:

VFNMADD is result = − a · b + c, not result = − (a · b + c).
VFNMSUB generates a −0 when all inputs are zero.

Explicit order of operands is included in the mnemonic using numbers "132", "213", and "231":

class="wikitable" ! Postfix 1 !! Operation !! possible memory operand !! overwrites
style="text-align:center;"\| 132	`a = a · c + b`	`c` (factor)	`a` (other factor)
style="text-align:center;"\| 213	`a = b · a + c`	`c` (summand)	`a` (factor)
style="text-align:center;"\| 231	`a = b · c + a`	`c` (factor)	`a` (summand)

as well as operand format (packed or scalar) and size (single or double).

class="wikitable" ! Postfix 2 !! precision !! size !! Postfix 2 !! precision !! size
SS	rowspan="4" \| Single	{{0\|00× }}32 bit	SD	rowspan="4" \| Double	{{0\|0× }}64 bit
PSx	{{0}}4× 32 bit	PDx	2× 64 bit
PSy	{{0}}8× 32 bit	PDy	4× 64 bit
PSz	16× 32 bit	PDz	8× 64 bit

This results in

class="wikitable"

class="wikitable"
Encoding ! Mnemonic ! Operands ! Operation
`VEX.256.66.0F38.W1 98 /r` \|VFMADD132PDy \|rowspan="2"\|ymm, ymm, ymm/m256 \|rowspan="6"\|`a = a · c + b`
`VEX.256.66.0F38.W0 98 /r` \|VFMADD132PSy
`VEX.128.66.0F38.W1 98 /r` \|VFMADD132PDx \|rowspan="2"\|xmm, xmm, xmm/m128
`VEX.128.66.0F38.W0 98 /r` \|VFMADD132PSx
`VEX.LIG.66.0F38.W1 99 /r` \|VFMADD132SD \|xmm, xmm, xmm/m64
`VEX.LIG.66.0F38.W0 99 /r` \|VFMADD132SS \|xmm, xmm, xmm/m32
`VEX.256.66.0F38.W1 A8 /r` \|VFMADD213PDy \|rowspan="2"\|ymm, ymm, ymm/m256 \|rowspan="6"\|`a = b · a + c`
`VEX.256.66.0F38.W0 A8 /r` \|VFMADD213PSy
`VEX.128.66.0F38.W1 A8 /r` \|VFMADD213PDx \|rowspan="2"\|xmm, xmm, xmm/m128
`VEX.128.66.0F38.W0 A8 /r` \|VFMADD213PSx
`VEX.LIG.66.0F38.W1 A9 /r` \|VFMADD213SD \|xmm, xmm, xmm/m64
`VEX.LIG.66.0F38.W0 A9 /r` \|VFMADD213SS \|xmm, xmm, xmm/m32
`VEX.256.66.0F38.W1 B8 /r` \|VFMADD231PDy \|rowspan="2"\|ymm, ymm, ymm/m256 \|rowspan="6"\|`a = b · c + a`
`VEX.256.66.0F38.W0 B8 /r` \|VFMADD231PSy
`VEX.128.66.0F38.W1 B8 /r` \|VFMADD231PDx \|rowspan="2"\|xmm, xmm, xmm/m128
`VEX.128.66.0F38.W0 B8 /r` \|VFMADD231PSx
`VEX.LIG.66.0F38.W1 B9 /r` \|VFMADD231SD \|xmm, xmm, xmm/m64
`VEX.LIG.66.0F38.W0 B9 /r` \|VFMADD231SS \|xmm, xmm, xmm/m32

Encoding

! Mnemonic

! Operands

! Operation

VEX.256.66.0F38.W1 98 /r

|VFMADD132PDy

|rowspan="2"|ymm, ymm, ymm/m256

|rowspan="6"|a = a · c + b

VEX.256.66.0F38.W0 98 /r

|VFMADD132PSy

VEX.128.66.0F38.W1 98 /r

|VFMADD132PDx

|rowspan="2"|xmm, xmm, xmm/m128

VEX.128.66.0F38.W0 98 /r

|VFMADD132PSx

VEX.LIG.66.0F38.W1 99 /r

|VFMADD132SD

|xmm, xmm, xmm/m64

VEX.LIG.66.0F38.W0 99 /r

|VFMADD132SS

|xmm, xmm, xmm/m32

VEX.256.66.0F38.W1 A8 /r

|VFMADD213PDy

|rowspan="2"|ymm, ymm, ymm/m256

|rowspan="6"|a = b · a + c

VEX.256.66.0F38.W0 A8 /r

|VFMADD213PSy

VEX.128.66.0F38.W1 A8 /r

|VFMADD213PDx

|rowspan="2"|xmm, xmm, xmm/m128

VEX.128.66.0F38.W0 A8 /r

|VFMADD213PSx

VEX.LIG.66.0F38.W1 A9 /r

|VFMADD213SD

|xmm, xmm, xmm/m64

VEX.LIG.66.0F38.W0 A9 /r

|VFMADD213SS

|xmm, xmm, xmm/m32

VEX.256.66.0F38.W1 B8 /r

|VFMADD231PDy

|rowspan="2"|ymm, ymm, ymm/m256

|rowspan="6"|a = b · c + a

VEX.256.66.0F38.W0 B8 /r

|VFMADD231PSy

VEX.128.66.0F38.W1 B8 /r

|VFMADD231PDx

|rowspan="2"|xmm, xmm, xmm/m128

VEX.128.66.0F38.W0 B8 /r

|VFMADD231PSx

VEX.LIG.66.0F38.W1 B9 /r

|VFMADD231SD

|xmm, xmm, xmm/m64

VEX.LIG.66.0F38.W0 B9 /r

|VFMADD231SS

|xmm, xmm, xmm/m32

FMA4 instruction set

=CPUs with FMA4=

AMD
"Heavy Equipment" processors
Bulldozer-based processors, October 12, 2011{{cite web | url=http://support.amd.com/TechDocs/43479.pdf | title=AMD64 Architecture Programmer's Manual Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions | date=May 1, 2009 | publisher=AMD}}
Piledriver-based processors{{cite web | url=http://developer.amd.com/wordpress/media/2012/10/New-Bulldozer-and-Piledriver-Instructions.pdf | title=New "Bulldozer" and "Piledriver" Instructions A step forward for high performance software development | date=October 2012 | publisher=AMD}}
Steamroller-based processors
Excavator-based processors (including "v2")
Zen: WikiChip's testing shows FMA4 still appears to work (under the conditions of the tests) despite not being officially supported and not even reported by CPUID. This has also been confirmed by Agner Fog.{{Cite web|url=https://www.agner.org/optimize/blog/read.php?i=838|title=Agner's CPU blog - Test results for AMD Ryzen|date=2017-05-02}} But other tests gave wrong results. AMD Official Web Site FMA4 Support Note ZEN CPUs = AMD ThreadRipper 1900x, R7 Pro 1800, 1700, R5 Pro 1600, 1500, R3 Pro 1300, 1200, R3 2200G, R5 2400G.{{cite web | url=https://products.amd.com/en-us/search/cpu#Default=%7B%22k%22%3A%22%22%2C%22r%22%3A%5B%7B%22n%22%3A%22FMAOWSCHCS%22%2C%22t%22%3A%5B%22%5C%22%C7%82%C7%82464d4134%5C%22%22%5D%2C%22o%22%3A%22OR%22%2C%22k%22%3Afalse%2C%22m%22%3A%7B%22%5C%22%C7%82%C7%82464d4134%5C%22%22%3A%22FMA4%22%7D%7D%5D%7D#2d521741-4cc8-44d2-aa87-874f9bb51787=%7B%22k%22%3A%22%22%7D | title=www.amd.com, FMA4 support model list }}{{cite web | url=https://products.amd.com/en-us/search/APU/AMD-Ryzen™-Processors/AMD-Ryzen™-5-Processor-with-Radeon™-Vega-Graphics/AMD-Ryzen™-5-2400G/243 | title=www.amd.com, FMA4 support model list }}{{cite web | url=https://products.amd.com/en-us/search/APU/AMD-Ryzen™-Processors/AMD-Ryzen™-3-Processor-with-Radeon™-Vega-Graphics/AMD-Ryzen™-3-2200G/244 | title=www.amd.com, FMA4 support model list }}
Intel
Intel has not released CPUs with support for FMA4.

=Excerpt from FMA4=

class="wikitable"
Mnemonic (AT&T) ! Operands ! Operation
VFMADDPDx \|xmm, xmm, xmm/m128, xmm/m128 \| rowspan=6 \| a = b·c + d
VFMADDPDy \|ymm, ymm, ymm/m256, ymm/m256
VFMADDPSx \|xmm, xmm, xmm/m128, xmm/m128
VFMADDPSy \|ymm, ymm, ymm/m256, ymm/m256
VFMADDSD \|xmm, xmm, xmm/m64, xmm/m64
VFMADDSS \|xmm, xmm, xmm/m32, xmm/m32

class="wikitable"

Mnemonic (AT&T)

! Operands

! Operation

VFMADDPDx

|xmm, xmm, xmm/m128, xmm/m128

| rowspan=6 | a = b·c + d

VFMADDPDy

|ymm, ymm, ymm/m256, ymm/m256

VFMADDPSx

|xmm, xmm, xmm/m128, xmm/m128

VFMADDPSy

|ymm, ymm, ymm/m256, ymm/m256

VFMADDSD

|xmm, xmm, xmm/m64, xmm/m64

VFMADDSS

|xmm, xmm, xmm/m32, xmm/m32

History

The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time. The history can be summarized as follows:

August 2007: AMD announces the SSE5 instruction set, which includes 3-operand FMA instructions. A new coding scheme (DREX) is introduced for allowing instructions to have three operands.{{cite web|url=http://developer.amd.com/SSE5 |title=128-Bit SSE5 Instruction Set |publisher=AMD Developer Central |access-date=2008-01-28 |archive-url=https://web.archive.org/web/20080115163416/http://developer.amd.com/SSE5 |archive-date=2008-01-15 |url-status=dead }}
April 2008: Intel announces their AVX and FMA instruction sets, including 4-operand FMA instructions. The coding of these instructions uses the new VEX coding scheme,{{cite web | url=http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVX-Programming-Reference-31943302.pdf | title=Intel Advanced Vector Extensions Programming Reference | publisher=Intel | access-date=2008-04-05 }}{{dead link|date=September 2017 |bot=InternetArchiveBot |fix-attempted=yes }} which is more flexible than AMD's DREX scheme.
December 2008: Intel changes the specification for their FMA instructions from 4-operand to 3-operand instructions. The VEX coding scheme is still used.{{cite web | url=http://software.intel.com/en-us/avx/ | title=Intel Advanced Vector Extensions Programming Reference | publisher=Intel | access-date=2009-05-06}}
May 2009: AMD changes the specification of their FMA instructions from the 3-operand DREX form to the 4-operand VEX form, compatible with the April 2008 Intel specification rather than the December 2008 Intel specification.{{cite web | url=http://blogs.amd.com/developer/2009/05/06/striking-a-balance/ | title=Striking a balance | date=May 6, 2009 | publisher=Dave Christie, AMD Developer blogs | archive-url=https://archive.today/20120708101459/http://blogs.amd.com/developer/2009/05/06/striking-a-balance/ | archive-date=July 8, 2012 | url-status=dead | access-date=2018-11-07}}
October 2011: AMD Bulldozer processor supports FMA4.{{cite web|title=New Bulldozer and Piledriver Instructions |url=http://developer.amd.com/wordpress/media/2012/10/New-Bulldozer-and-Piledriver-Instructions.pdf|publisher=AMD|access-date=25 July 2013}}
January 2012: AMD announces FMA3 support in future processors codenamed Trinity and Vishera; they are based on the Piledriver architecture.{{cite web|title=Software Optimization Guide for AMD Family 15h Processors|url=http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf|publisher=AMD|access-date=19 April 2012}}
May 2012: AMD Piledriver processor supports both FMA3 and FMA4.
June 2013: Intel Haswell processor supports FMA3.{{cite web|title=Intel Architecture Instruction Set Extensions Programming Reference|url=http://software.intel.com/sites/default/files/319433-015.pdf|publisher=Intel|access-date=25 July 2013}}
February 2017: The first generation of AMD Ryzen processors officially supports FMA3, but not FMA4 according to the CPUID instruction.{{cite web | url=http://www.agner.org/optimize/microarchitecture.pdf | title=The microarchitecture of Intel, AMD and VIA CPUs An optimization guide for assembly programmers and compiler makers | access-date=2017-05-02}} There has been confusion regarding whether FMA4 was implemented or not on this processor due to errata in the initial patch to the GNU Binutils package that has since been rectified.{{Cite web |last=Gopalasubramanian |first=Ganesh |url=https://sourceware.org/ml/binutils/2015-03/msg00078.html |title=[PATCH] add znver1 processor. |date=2015-03-10 |access-date=2022-05-01}}{{Cite web |last=Pawar |first=Amit |url=https://sourceware.org/ml/binutils/2015-08/msg00039.html |title=[PATCH] Remove CpuFMA4 from Znver1 CPU Flags |date=2015-08-07 |access-date=2022-05-01}} One unconfirmed report of wrong results{{cite web|url=https://www.reddit.com/r/Amd/comments/68s4bj/ryzen_has_undocumented_support_for_fma4/dh0y353/|title=Discussion – Ryzen has undocumented support for FMA4|access-date=2017-05-10}} led to some doubt, but Mysticial (Alexander Yee, developer of y-cruncher) debunked it:{{cite web|url=https://stackoverflow.com/questions/57055756/arbitrary-position-2-input-shuffling-using-sse|title=Stack Overflow comment by Mysticial|date=2019-07-16|access-date=2023-09-01|archive-date=2019-08-22|archive-url=https://web.archive.org/web/20190822063407/https://stackoverflow.com/questions/57055756/arbitrary-position-2-input-shuffling-using-sse|url-status=bot: unknown}} FMA4 worked for bit-exact bignum calculations on his Zen 1 system for years, and the one report on Reddit never had any followup investigation to rule out mistakes in the testing software before being widely repeated. The initial Ryzen CPUs could be crashed by a particular sequence of FMA3 instructions, but updated CPU microcode fixes the problem.{{cite web|url=https://www.techpowerup.com/231536/amd-ryzen-machine-crashes-to-a-sequence-of-fma3-instructions|title=AMD Ryzen Machine Crashes to a Sequence of FMA3 Instructions|date=16 March 2017 |access-date=2017-09-10}}
July 2019: AMD Zen 2 and later Ryzen processors don't support FMA4 at all.{{cite web|url=https://stackoverflow.com/questions/57055756/arbitrary-position-2-input-shuffling-using-sse#comment100648429_57057094|title=Stack Overflow comment by Mysticial|date=2019-07-16|access-date=2023-09-01}} They continue to support FMA3. Only Zen 1 and Zen+ have unofficial FMA4 support.

Compiler and assembler support

Different compilers provide different levels of support for FMA:

GCC supports FMA4 with -mfma4 since version 4.5.0{{cite web|url=http://www.theinquirer.net/inquirer/news/2124866/amd-bulldozer-fma4-xop-instructions-supported-gcc|archive-url=https://web.archive.org/web/20111117001441/http://www.theinquirer.net/inquirer/news/2124866/amd-bulldozer-fma4-xop-instructions-supported-gcc|url-status=unfit|archive-date=November 17, 2011| title=AMD Bulldozer only FMA4 and XOP instructions are supported by GCC Intel still mute|work=The Inquirer|first=Lawrence |last=Latif|date=Nov 14, 2011}} and FMA3 with -mfma since version 4.7.0.
Microsoft Visual C++ 2010 SP1 supports FMA4 instructions.{{cite web|url=http://msdn.microsoft.com/en-us/library/vstudio/gg445134(v=vs.100).aspx|title=FMA4 Intrinsics Added for Visual Studio 2010 SP1|date=4 February 2013 }}
Microsoft Visual C++ 2012 supports FMA3 instructions (if the processor also supports AVX2 instruction set extension).
Microsoft Visual C++ since VC 2013
PathScale supports FMA4 with -mfma.{{cite web|url=http://www.pathscale.com/node/272|title=EKOPath man doc|access-date=2013-07-24|archive-url=https://web.archive.org/web/20160623224118/http://www.pathscale.com/node/272|archive-date=2016-06-23|url-status=dead}}
LLVM 3.1 adds FMA4 support,{{cite web|url=http://llvm.org/releases/3.1/docs/ReleaseNotes.html|title=LLVM 3.1 Release Notes}} along with preliminary FMA3 support.{{cite web|url=http://llvm.org/viewvc/llvm-project?view=revision&revision=155618|title=Enable detection of AVX and AVX2 support through CPUID|date=2012-04-26|work=LLVM}}
Open64 5.0 adds "limited support".
Intel compilers support only FMA3 instructions.
NASM supports FMA3 instructions since version 2.03 and FMA4 instructions since 2.06.
FASM supports both FMA3 and FMA4 instructions.

References

Category:X86 instructions

Category:SIMD computing

Category:AMD technologies