Half-precision floating-point format

{{Distinguish|text = bfloat16, a different 16-bit floating-point format}}

In computing, half precision (sometimes called FP16 or float16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular image processing and neural networks.

Almost all modern uses follow the IEEE 754-2008 standard, where the 16-bit base-2 format is referred to as binary16, and the exponent uses 5 bits. This can express values in the range ±65,504, with the minimum value above 1 being 1 + 1/1024.

Depending on the computer, half-precision can be over an order of magnitude faster than double precision, e.g. 550 PFLOPS for half-precision vs 37 PFLOPS for double precision on one cloud provider.{{Cite web|url=https://abci.ai/en/about_abci/|title=About ABCI - About ABCI {{!}} ABCI|website=abci.ai|access-date=2019-10-06}}

History

Several earlier 16-bit floating point formats have existed including that of Hitachi's HD61810 DSP of 1982 (a 4-bit exponent and a 12-bit mantissa),{{cite web|url=https://archive.org/details/bitsavers_hitachidatlSignalProcessorUsersManual_4735688 |title=hitachi :: dataBooks :: HD61810 Digital Signal Processor Users Manual |website=Archive.org |access-date=2017-07-14}} Thomas J. Scott's WIF of 1991 (5 exponent bits, 10 mantissa bits){{cite book|last1=Scott|first1=Thomas J.|title=Proceedings of the twenty-second SIGCSE technical symposium on Computer science education - SIGCSE '91 |chapter=Mathematics and computer science at odds over real numbers |date=March 1991|volume=23|issue=1|pages=130–139|doi=10.1145/107004.107029|isbn=0897913779|s2cid=16648394|doi-access=free}} and the 3dfx Voodoo Graphics processor of 1995 (same as Hitachi).{{cite web|url=http://www.gamers.org/dEngine/xf3D/glide/glidepgm.htm |title=/home/usr/bk/glide/docs2.3.1/GLIDEPGM.DOC |website=Gamers.org |access-date=2017-07-14}}

ILM was searching for an image format that could handle a wide dynamic range, but without the hard drive and memory cost of single or double precision floating point.{{cite web |url=http://www.openexr.com/about.html |title=OpenEXR |publisher=OpenEXR |access-date=2017-07-14 |archive-date=2013-05-08 |archive-url=https://web.archive.org/web/20130508221152/http://www.openexr.com/about.html |url-status=dead }} The hardware-accelerated programmable shading group led by John Airey at SGI (Silicon Graphics) used the s10e5 data type in 1997 as part of the 'bali' design effort. This is described in a SIGGRAPH 2000 paper{{cite web|url=https://people.csail.mit.edu/ericchan/bib/pdf/p425-peercy.pdf |title=Interactive Multi-Pass Programmable Shading |author1=Mark S. Peercy |author2=Marc Olano |author3=John Airey |author4=P. Jeffrey Ungar |website=People.csail.mit.edu |access-date=2017-07-14}} (see section 4.3) and further documented in US patent 7518615.{{cite web|url=https://patents.google.com/patent/US7518615 |title=Patent US7518615 - Display system having floating point rasterization and floating point ... - Google Patents |website=Google.com |access-date=2017-07-14}} It was popularized by its use in the open-source OpenEXR image format.

Nvidia and Microsoft defined the half datatype in the Cg language, released in early 2002, and implemented it in silicon in the GeForce FX, released in late 2002.{{cite web|title=vs_2_sw|url=https://developer.download.nvidia.com/cg/vs_2_sw.html|website=Cg 3.1 Toolkit Documentation|publisher=Nvidia|access-date=17 August 2016}} However, hardware support for accelerated 16-bit floating point was later dropped by Nvidia before being reintroduced in the Tegra X1 mobile GPU in 2015.

The F16C extension in 2012 allows x86 processors to convert half-precision floats to and from single-precision floats with a machine instruction.

== IEEE 754 half-precision binary floating-point format: binary16 ==

The IEEE 754 standard{{Cite book |title=IEEE Standard for Floating-Point Arithmetic |publisher=IEEE STD 754-2019 (Revision of IEEE 754-2008) |date=July 2019 |pages=1–84 |url=https://ieeexplore.ieee.org/document/8766229 |doi=10.1109/ieeestd.2019.8766229|isbn=978-1-5044-5924-2 }} specifies a binary16 as having the following format:

Sign bit: 1 bit
Exponent width: 5 bits
Significand precision: 11 bits (10 explicitly stored)

The format is laid out as follows:

File:IEEE 754r Half Floating Point Format.svg

The format is assumed to have an implicit lead bit with value 1 unless the exponent field is stored with all zeros. Thus, only 10 bits of the significand appear in the memory format but the total precision is 11 bits. In IEEE 754 parlance, there are 10 bits of significand, but there are 11 bits of significand precision (log₁₀(2¹¹) ≈ 3.311 decimal digits, or 4 digits ± slightly less than 5 units in the last place).

= Exponent encoding =

The half-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 15; also known as exponent bias in the IEEE 754 standard.

E_min = 00001₂ − 01111₂ = −14
E_max = 11110₂ − 01111₂ = 15
Exponent bias = 01111₂ = 15

Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 15 has to be subtracted from the stored exponent.

The stored exponents 00000₂ and 11111₂ are interpreted specially.

Exponent	Significand = zero	Significand ≠ zero	Equation
class="wikitable" style="text-align:center"
00000₂	zero, −0	subnormal numbers	(−1)^signbit × 2⁻¹⁴ × 0.significantbits₂
00001₂, ..., 11110₂	colspan=2\| normalized value	(−1)^signbit × 2^{exponent−15} × 1.significantbits₂
11111₂	±infinity	NaN (quiet, signaling)

The minimum strictly positive (subnormal) value is

2⁻²⁴ ≈ 5.96 × 10⁻⁸.

The minimum positive normal value is 2⁻¹⁴ ≈ 6.10 × 10⁻⁵.

The maximum representable value is (2−2⁻¹⁰) × 2¹⁵ = 65504.

= Half precision examples =

These examples are given in bit representation

of the floating-point value. This includes the sign bit, (biased) exponent, and significand.

Binary	Hex	Value	Notes
class="wikitable"
{{mono\|0 00000 0000000000}}	{{mono\|0000}}	{{math\|0}}
{{mono\|0 00000 0000000001}}	{{mono\|0001}}	{{math\|2⁻¹⁴ × (0 + {{sfrac\|1\|1024}} ) ≈ 0.000000059604645}}	smallest positive subnormal number
{{mono\|0 00000 1111111111}}	{{mono\|03ff}}	{{math\|2⁻¹⁴ × (0 + {{sfrac\|1023\|1024}} ) ≈ 0.000060975552}}	largest subnormal number
{{mono\|0 00001 0000000000}}	{{mono\|0400}}	{{math\|2⁻¹⁴ × (1 + {{sfrac\|0\|1024}} ) ≈ 0.00006103515625}}	smallest positive normal number
{{mono\|0 01101 0101010101}}	{{mono\|3555}}	{{math\|2⁻² × (1 + {{sfrac\|341\|1024}} ) ≈ 0.33325195}}	nearest value to 1/3
{{mono\|0 01110 1111111111}}	{{mono\|3bff}}	{{math\|2⁻¹ × (1 + {{sfrac\|1023\|1024}} ) ≈ 0.99951172}}	largest number less than one
{{mono\|0 01111 0000000000}}	{{mono\|3c00}}	{{math\|1=2⁰ × (1 + {{sfrac\|0\|1024}} ) = 1}}	one
{{mono\|0 01111 0000000001}}	{{mono\|3c01}}	{{math\|2⁰ × (1 + {{sfrac\|1\|1024}} ) ≈ 1.00097656}}	smallest number larger than one
{{mono\|0 11110 1111111111}}	{{mono\|7bff}}	{{math\|1=2¹⁵ × (1 + {{sfrac\|1023\|1024}} ) = 65504}}	largest normal number
{{mono\|0 11111 0000000000}}	{{mono\|7c00}}	{{math\|∞}}	infinity
{{mono\|1 00000 0000000000}}	{{mono\|8000}}	{{math\|−0}}
{{mono\|1 10000 0000000000}}	{{mono\|c000}}	{{math\|1=(−1)¹ × 2¹ × (1 + {{sfrac\|0\|1024}} ) = −2}}
{{mono\|1 11111 0000000000}}	{{mono\|fc00}}	{{math\|−∞}}	negative infinity

By default, 1/3 rounds down like for double precision, because of the odd number of bits in the significand. The bits beyond the rounding point are {{mono|0101}}... which is less than 1/2 of a unit in the last place.

= Precision limitations =

class="wikitable" style="text-align:right" ! Min !! Max !! interval
0	2⁻¹³	2⁻²⁴
2⁻¹³	2⁻¹²	2⁻²³
2⁻¹²	2⁻¹¹	2⁻²²
2⁻¹¹	2⁻¹⁰	2⁻²¹
2⁻¹⁰	2⁻⁹	2⁻²⁰
2⁻⁹	2⁻⁸	2⁻¹⁹
2⁻⁸	2⁻⁷	2⁻¹⁸
2⁻⁷	2⁻⁶	2⁻¹⁷
2⁻⁶	2⁻⁵	2⁻¹⁶
2⁻⁵	2⁻⁴	2⁻¹⁵
2⁻⁴	{{sfrac\|1\|8}}	2⁻¹⁴
{{sfrac\|1\|8}}	{{sfrac\|1\|4}}	2⁻¹³
{{sfrac\|1\|4}}	{{sfrac\|1\|2}}	2⁻¹²
{{sfrac\|1\|2}}	1	2⁻¹¹
1	2	2⁻¹⁰
2	4	2⁻⁹
4	8	2⁻⁸
8	16	2⁻⁷
16	32	2⁻⁶
32	64	2⁻⁵
64	128	2⁻⁴
128	256	{{sfrac\|1\|8}}
256	512	{{sfrac\|1\|4}}
512	1024	{{sfrac\|1\|2}}
1024	2048	1
2048	4096	2
4096	8192	4
8192	16384	8
16384	32768	16
32768	65520	32
65520	∞	∞

65520 and larger numbers round to infinity. This is for round-to-even; other rounding strategies will change this cut-off.

ARM alternative half-precision

ARM processors support (via a floating-point control register bit) an "alternative half-precision" format, which does away with the special case for an exponent value of 31 (11111₂).{{cite book |chapter-url=http://infocenter.arm.com/help/topic/com.arm.doc.dui0205j/CIHGAECI.html |title=RealView Compilation Tools Compiler User Guide |chapter= Half-precision floating-point number support |date=10 December 2010 |access-date=2015-05-05}} It is almost identical to the IEEE format, but there is no encoding for infinity or NaNs; instead, an exponent of 31 encodes normalized numbers in the range 65536 to 131008.

Uses of half precision

Half precision is used in several computer graphics environments to store pixels, including MATLAB, OpenEXR, JPEG XR, GIMP, OpenGL, Vulkan,{{cite web |last1=Garrard |first1=Andrew |title=10.1. 16-bit floating-point numbers |url=https://registry.khronos.org/DataFormat/specs/1.2/dataformat.1.2.html#16bitfp |website=Khronos Data Format Specification v1.2 rev 1 |publisher=Khronos |access-date=2023-08-05}} Cg, Direct3D, and D3DX. The advantage over 8-bit or 16-bit integers is that the increased dynamic range allows for more detail to be preserved in highlights and shadows for images, and avoids gamma correction. The advantage over 32-bit single-precision floating point is that it requires half the storage and bandwidth (at the expense of precision and range).

Half precision can be useful for mesh quantization. Mesh data is usually stored using 32-bit single-precision floats for the vertices, however in some situations it is acceptable to reduce the precision to only 16-bit half-precision, requiring only half the storage at the expense of some precision. Mesh quantization can also be done with 8-bit or 16-bit fixed precision depending on the requirements.{{cite web |title=KHR_mesh_quantization |url=https://github.com/KhronosGroup/glTF/blob/main/extensions/2.0/Khronos/KHR_mesh_quantization/README.md |website=GitHub |publisher=Khronos Group |access-date=2023-07-02}}

Hardware and software for machine learning or neural networks tend to use half precision: such applications usually do a large amount of calculation, but don't require a high level of precision. Due to hardware typically not supporting 16-bit half-precision floats, neural networks often use the bfloat16 format, which is the single precision float format truncated to 16 bits.

If the hardware has instructions to compute half-precision math, it is often faster than single or double precision. If the system has SIMD instructions that can handle multiple floating-point numbers within one instruction, half precision can be twice as fast by operating on twice as many numbers simultaneously.{{cite web |url=https://www.comp.nus.edu.sg/~wongwf/papers/HPEC2017.pdf |title=Exploiting half precision arithmetic in Nvidia GPUs |last1=Ho |first1=Nhut-Minh |last2=Wong |first2=Weng-Fai |date=September 1, 2017 |publisher=Department of Computer Science, National University of Singapore |access-date=July 13, 2020 |quote=Nvidia recently introduced native half precision floating point support (FP16) into their Pascal GPUs. This was mainly motivated by the possibility that this will speed up data intensive and error tolerant applications in GPUs.}}

Support by programming languages

Zig provides support for half precisions with its f16 type.{{cite web |title=Floats |url=https://ziglang.org/documentation/master/#Floats |website=ziglang.org |access-date=7 January 2024}}

.NET 5 introduced half precision floating point numbers with the System.Half standard library type.{{Cite web |title=Half Struct (System) |url=https://learn.microsoft.com/en-us/dotnet/api/system.half |access-date=2024-02-01 |website=learn.microsoft.com |language=en-us}}{{Cite web |last=Govindarajan |first=Prashanth |date=2020-08-31 |title=Introducing the Half type! |url=https://devblogs.microsoft.com/dotnet/introducing-the-half-type/ |access-date=2024-02-01 |website=.NET Blog |language=en-US}} {{As of|January 2024}}, no .NET language (C#, F#, Visual Basic, and C++/CLI and C++/CX) has literals (e.g. in C#, 1.0f has type System.Single or 1.0m has type System.Decimal) or a keyword for the type.{{Cite web |date=2022-09-29 |title=Floating-point numeric types ― C# reference |url=https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/floating-point-numeric-types |access-date=2024-02-01 |website=learn.microsoft.com |language=en-us}}{{Cite web |date=2022-06-15 |title=Literals ― F# language reference |url=https://learn.microsoft.com/en-us/dotnet/fsharp/language-reference/literals |access-date=2024-02-01 |website=learn.microsoft.com |language=en-us}}{{Cite web |date=2021-09-15 |title=Data Type Summary — Visual Basic language reference |url=https://learn.microsoft.com/en-us/dotnet/visual-basic/language-reference/data-types/ |access-date=2024-02-01 |website=learn.microsoft.com |language=en-us}}

Swift introduced half-precision floating point numbers in Swift 5.3 with the [https://developer.apple.com/documentation/swift/float16 Float16] type.{{cite web |title=swift-evolution/proposals/0277-float16.md at main · apple/swift-evolution |url=https://github.com/apple/swift-evolution/blob/main/proposals/0277-float16.md |website=github.com |access-date=13 May 2024}}

OpenCL also supports half-precision floating point numbers with the half datatype on IEEE 754-2008 half-precision storage format.{{cite web|title=cl_khr_fp16 extension|url=https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#cl_khr_fp16|website=registry.khronos.org|access-date=31 May 2024}}

{{As of|2024}}, Rust is currently working on adding a new f16 type for IEEE half-precision 16-bit floats.{{cite web |last1=Cross |first1=Travis |title=Tracking Issue for f16 and f128 float types |url=https://github.com/rust-lang/rust/issues/116909 |website=GitHub |access-date=2024-07-05}}

Julia provides support for half-precision floating point numbers with the Float16 type.{{Cite web |title=Integers and Floating-Point Numbers · The Julia Language |url=https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/ |access-date=2024-07-11 |website=docs.julialang.org}}

C++ introduced half-precision since C++23 with the std::float16_t type.{{Cite web |title=P1467R9: Extended floating-point types and standard names |url=https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1467r9.html |access-date=2024-10-18 |website=www.open-std.org}} GCC already implements support for it.{{Cite web |title=106652 – [C++23] P1467 - Extended floating-point types and standard names |url=https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106652 |access-date=2024-10-18 |website=gcc.gnu.org}}

Hardware support

Several versions of the ARM architecture have support for half precision.{{cite web |title=Half-precision floating-point number format |url=https://developer.arm.com/documentation/100067/0607/Other-Compiler-specific-Features/Half-precision-floating-point-number-format |website=ARM Compiler armclang Reference Guide Version 6.7 |publisher=ARM Developer |access-date=13 May 2022}}

Support for conversions with half-precision floats in the x86 instruction set is specified in the F16C instruction set extension, first introduced in 2009 by AMD and fairly broadly adopted by AMD and Intel CPUs by 2012. This was further extended up the AVX-512_FP16 instruction set extension implemented in the Intel Sapphire Rapids processor.{{cite web |last1=Towner |first1=Daniel |title=Intel® Advanced Vector Extensions 512 - FP16 Instruction Set for Intel® Xeon® Processor Based Products |url=https://builders.intel.com/docs/networkbuilders/intel-avx-512-fp16-instruction-set-for-intel-xeon-processor-based-products-technology-guide-1651874188.pdf |website=Intel® Builders Programs |access-date=13 May 2022}}

On RISC-V, the Zfh and Zfhmin extensions provide hardware support for 16-bit half precision floats. The Zfhmin extension is a minimal alternative to Zfh.{{cite web |title=RISC-V Instruction Set Manual, Volume I: RISC-V User-Level ISA |url=https://five-embeddev.com/riscv-isa-manual/latest/zfh.html |website=Five EmbedDev |access-date=2023-07-02}}

On Power ISA, VSX and the not-yet-approved SVP64 extension provide hardware support for 16-bit half-precision floats as of PowerISA v3.1B and later.{{cite web |title=OPF_PowerISA_v3.1B.pdf |url=https://files.openpower.foundation/s/dAYSdGzTfW4j2r2 |website=OpenPOWER Files |publisher=OpenPOWER Foundation |access-date=2023-07-02}}{{cite web |title=ls005.xlen.mdwn |url=https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/rfc/ls005.xlen.mdwn;hb=5e573680771f7a041d93d394003d6f9f08177a98#l131 |website=libre-soc.org Git |access-date=2023-07-02}}

Support for half precision on IBM Z is part of the Neural-network-processing-assist facility that IBM introduced with Telum. IBM refers to half precision floating point data as NNP-Data-Type 1 (16-bit).

References

External links

[https://www.mrob.com/pub/math/floatformats.html#minifloat Minifloats] (in Survey of Floating-Point Formats)
[http://www.openexr.org/ OpenEXR site]
[https://technet.microsoft.com/en-us/library/bb147247(v=vs.85).aspx Half precision constants] from D3DX
[https://web.archive.org/web/20170531074746/http://oss.sgi.com/projects/ogl-sample/registry/ARB/half_float_pixel.txt OpenGL treatment of half precision]
[http://www.fox-toolkit.org/ftp/fasthalffloatconversion.pdf Fast Half Float Conversions]
[https://web.archive.org/web/20090419063321/http://www.analog.com/static/imported-files/processor_manuals/ADSP_2136x_PGR_rev1-1.pdf Analog Devices variant] (four-bit exponent)
[https://www.mathworks.com/matlabcentral/fileexchange/23173 C source code to convert between IEEE double, single, and half precision can be found here]
[https://stackoverflow.com/a/6162687/237321 Java source code for half-precision floating-point conversion]
[https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html Half precision floating point for one of the extended GCC features]

Category:Binary arithmetic

Category:Floating point types

Binary	Hex	Value	Notes
class="wikitable"
{{mono\|0 00000 0000000000}}	{{mono\|0000}}	{{math\|0}}
{{mono\|0 00000 0000000001}}	{{mono\|0001}}	{{math\|2⁻¹⁴ × (0 + {{sfrac\|1\|1024}} ) ≈ 0.000000059604645}}	smallest positive subnormal number
{{mono\|0 00000 1111111111}}	{{mono\|03ff}}	{{math\|2⁻¹⁴ × (0 + {{sfrac\|1023\|1024}} ) ≈ 0.000060975552}}	largest subnormal number
{{mono\|0 00001 0000000000}}	{{mono\|0400}}	{{math\|2⁻¹⁴ × (1 + {{sfrac\|0\|1024}} ) ≈ 0.00006103515625}}	smallest positive normal number
{{mono\|0 01101 0101010101}}	{{mono\|3555}}	{{math\|2⁻² × (1 + {{sfrac\|341\|1024}} ) ≈ 0.33325195}}	nearest value to 1/3
{{mono\|0 01110 1111111111}}	{{mono\|3bff}}	{{math\|2⁻¹ × (1 + {{sfrac\|1023\|1024}} ) ≈ 0.99951172}}	largest number less than one
{{mono\|0 01111 0000000000}}	{{mono\|3c00}}	{{math\|1=2⁰ × (1 + {{sfrac\|0\|1024}} ) = 1}}	one
{{mono\|0 01111 0000000001}}	{{mono\|3c01}}	{{math\|2⁰ × (1 + {{sfrac\|1\|1024}} ) ≈ 1.00097656}}	smallest number larger than one
{{mono\|0 11110 1111111111}}	{{mono\|7bff}}	{{math\|1=2¹⁵ × (1 + {{sfrac\|1023\|1024}} ) = 65504}}	largest normal number
{{mono\|0 11111 0000000000}}	{{mono\|7c00}}	{{math\|∞}}	infinity
{{mono\|1 00000 0000000000}}	{{mono\|8000}}	{{math\|−0}}
{{mono\|1 10000 0000000000}}	{{mono\|c000}}	{{math\|1=(−1)¹ × 2¹ × (1 + {{sfrac\|0\|1024}} ) = −2}}
{{mono\|1 11111 0000000000}}	{{mono\|fc00}}	{{math\|−∞}}	negative infinity