CDNA (microarchitecture)

{{Short description|AMD compute-focused GPU microarchitecture}}

{{Use American English|date=April 2024}}

{{Use mdy dates|date=November 2022}}

{{Infobox graphics processing unit

| name = AMD CDNA

| image = AMD CDNA wordmark.png

| img_w =

| caption =

| date =

| created = {{start date and age|2020|Nov|16|p=y|br=y}}

| manufacturer =

| designfirm = AMD

| codename =

| process = {{ubl|TSMC N7|TSMC N6|TSMC N5{{Cite web |last=Smith |first=Ryan |date=June 9, 2022 |title=AMD: Combining CDNA 3 and Zen 4 for MI300 Data Center APU in 2023 |url=https://www.anandtech.com/show/17445/amd-combining-cdna-3-and-zen-4-for-mi300-data-center-apu-in-2023 |access-date=December 20, 2022 |website=AnandTech}}}}

| fab =

| predecessor = AMD FirePro

| variant = RDNA (consumer, professional)

| successor =

}}

CDNA (Compute DNA) is a compute-centered graphics processing unit (GPU) microarchitecture designed by AMD for datacenters. Mostly used in the AMD Instinct line of data center graphics cards, CDNA is a successor to the Graphics Core Next (GCN) microarchitecture; the other successor being RDNA (Radeon DNA), a consumer graphics focused microarchitecture.

The first generation of CDNA was announced on March 5th, 2020,{{Cite web |last=Smith |first=Ryan |title=AMD Unveils CDNA GPU Architecture: A Dedicated GPU Architecture for Data Centers |url=https://www.anandtech.com/show/15593/amd-unveils-cdna-gpu-architecture-a-dedicated-gpu-architecture-for-data-centers |access-date=2022-09-20 |website=www.anandtech.com}} and was featured in the AMD Instinct MI100, launched November 16th, 2020.{{Cite web |title=GPU Database: AMD Radeon Instinct MI100 |url=https://www.techpowerup.com/gpu-specs/radeon-instinct-mi100.c3496 |access-date=2022-09-20 |website=TechPowerUp}} This is CDNA 1's only produced product, manufactured on TSMC's N7 FinFET process.

The second iteration of the CDNA line implemented a multi-chip module (MCM) approach, differing from its predecessor's monolithic approach. Featured in the AMD Instinct MI250X and MI250, this MCM design used an elevated fanout bridge (EFB){{Cite web |last=Smith |first=Ryan |title=AMD Announces Instinct MI200 Accelerator Family: Taking Servers to Exascale and Beyond |url=https://www.anandtech.com/show/17054/amd-announces-instinct-mi200-accelerator-family-cdna2-exacale-servers |access-date=2022-09-21 |website=www.anandtech.com}} to connect the dies. These two products were announced November 8th, 2021, and launched November 11th. The CDNA 2 line includes an additional latecomer using a monolithic design, the MI210.{{Cite web |last=Smith |first=Ryan |title=AMD Releases Instinct MI210 Accelerator: CDNA 2 On a PCIe Card |url=https://www.anandtech.com/show/17326/amd-releases-instinct-mi210-accelerator-cdna-2-on-a-pcie-card |access-date=2022-09-21 |website=www.anandtech.com}} The MI250X and MI250 were the first AMD products to use the Open Compute Project (OCP)'s OCP Accelerator Module (OAM) socket form factor. Lower wattage PCIe versions are available.

The third iteration of CDNA switches to a MCM design utilizing different chiplets manufactured on multiple nodes. Currently consisting of the MI300X and MI300A, this product contains 15 unique dies and is connected with advanced 3D packaging techniques. The MI300 series was announced on January 5, 2023, and launched in H2 2023.

CDNA 1

{{Infobox graphics processing unit

| name = AMD CDNA 1

| image =

| caption =

| codename =

| created = {{start date and age|2020|Nov|16|p=y|br=y}}

| fab =

| process = TSMC N7 (FinFET)

| predecessor = AMD FirePro

| successor = CDNA 2

}}

The CDNA family consists of one die, named Arcturus. The die is 750 square millimetres, contains 25.6 billion transistors and is manufactured on TSMC's N7 node.{{Cite web |last=Kennedy |first=Patrick |date=2020-11-16 |title=AMD Instinct MI100 32GB CDNA GPU Launched |url=https://www.servethehome.com/amd-radeon-instinct-mi100-32gb-cdna-gpu-launched/ |access-date=2022-09-22 |website=ServeTheHome |language=en-US}} The Arcturus die possesses 120 compute units and a 4096-bit memory bus, connected to four HBM2 placements, giving the die 32 GB of memory, and just over 1200 GB/s of memory bandwidth. Compared to its predecessor, CDNA has removed all hardware related to graphics acceleration. This removal includes but is not limited to: graphics caches, tessellation hardware, render output units (ROPs), and the display engine. CDNA retains the VCN media engine for HEVC, H.264, and VP9 decoding.{{Cite web |date=2020-03-05 |title=AMD CDNA Whitepaper |url=https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf |access-date=2022-09-22 |website=amd.com}} CDNA has also added dedicated matrix compute hardware, similar to those added in Nvidia's Volta Architecture.

= Architecture =

The 120 compute units (CUs) are organized into 4 asynchronous compute engines (ACEs), each ACE maintaining its own independent command execution and dispatch. At the CU level, CDNA compute units are organized similarly to GCN units. Each CU contains four SIMD16, that each execute their 64-thread wavefront (Wave64) over four cycles.

== Memory system ==

CDNA has a 20% clock bump for the HBM, resulting in a roughly 200 GB/s bandwidth increase vs. Vega 20 (GCN 5.0). The die has a shared 4 MB L2 cache that puts out 2 KB per clock to the CUs. At the CU level, each CU has its own L1 cache, a local data store (LDS) with 64 KB per CU and a 4 KB global data store (GDS), shared by all CUs. This GDS can be used to store control data, reduction operations or act as a small global shared surface.{{Cite web |date=2020-12-14 |title="AMD Instinct MI100" Instruction Set Architecture, Reference Guide |url=https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf |access-date=2022-09-22 |website=developer.amd.com}}

=== Experimental PIM implementation ===

In October 2022, Samsung demonstrated a Processing-In-Memory (PIM) specialized version of the MI100. In December 2022 Samsung showed off a cluster of 96 modified MI100s, boasting large increases in processing throughput for various workloads and significant reduction in power consumption.{{Cite web |author1=Aaron Klotz |date=2022-12-14 |title=Samsung Soups Up 96 AMD MI100 GPUs With Radical Computational Memory |url=https://www.tomshardware.com/news/samsung-modifies-amd-mi100-accelerator-gpus-with-pim |access-date=2022-12-23 |website=Tom's Hardware |language=en}}

= Changes from GCN =

The individual compute units remain highly similar to GCN but with the addition of 4 matrix units per CU. Support for more datatypes were added, with BF16, INT8 and INT4 being added. For an extensive list of operations utilizing the matrix units and new datatypes, please reference the [https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf CDNA ISA Reference Guide].

= Products =

{{AMD CDNA Products}}

CDNA 2

{{Infobox graphics processing unit

| name = AMD CDNA 2

| image =

| caption =

| codename =

| created = {{start date and age|2021|Nov|8|p=y|br=y}}

| fab =

| process = TSMC N6

| predecessor = CDNA 1

| successor = CDNA 3

}}

Like CDNA, CDNA 2 also consists of one die, named Aldebaran. This die is estimated to be 790 square millimetres, and contains 28 billion transistors while being manufactured on TSMC's N6 node.{{Cite web |author1=Anton Shilov |date=2021-11-17 |title=AMD's Instinct MI250X OAM Card Pictured: Aldebaran's Massive Die Revealed |url=https://www.tomshardware.com/news/amd-instinct-mi250x-pictured |access-date=2022-11-20 |website=Tom's Hardware |language=en}} The Aldebaran die contains only 112 compute units, a 6.67% decrease from Arcturus. Like the previous generation, this die contains a 4096-bit memory bus, now using HBM2e with a doubling in capacity, up to 64 GB. The largest change in CDNA 2 is the ability for two dies to be placed on the same package. The MI250X consists of 2 Aldebaran dies, 220 CUs (110 per die) and 128 GB of HBM2e. These dies are connected with 4 Infinity Fabric links, and addressed as independent GPUs by the host system.

= Architecture =

The 112 CUs are organized similarly to CDNA, into 4 asynchronous compute engines, each with 28 CUs, instead of the prior generations 30. Like CDNA, each CU contains four SIMD16 units executing a 64-thread wavefront across 4 cycles. The 4 matrix engines and vector units have added support for full rate FP64, enabling significant uplift over the prior generation. CDNA 2 also revises multiple internal caches, doubling bandwidth across the board.

== Memory system ==

The memory system in CDNA 2 sports across the board improvements. Starting with the move to HBM2e, doubling the quantity to 64 GB, and increasing bandwidth by roughly one third (from ~1200 GB/s to 1600 GB/s). At the cache level. Each GCD has a 16-way, 8 MB L2 cache that is partitioned into 32 slices. This cache puts out 4 KB per clock, 128 B per clock per slice, which is a doubling of the bandwidth from CDNA. Additionally, the 4 KB Global Data Store was removed. All caches, including the L2 and LDS have support added for FP64 data.

== Interconnect ==

CDNA 2 brings forth the first product with multiple GPUs on the same package. The two GPU dies are connected by 4 Infinity Fabric links, with a total bidirectional bandwidth of 400 GB/s.{{Cite web |title=INTRODUCING AMD CDNA™ 2 ARCHITECTURE |url=https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf |access-date=November 20, 2022 |website=AMD.com}} Each die contains 8 Infinity Fabric links, each physically implemented with a 16-lane Infinity Link. When paired with an AMD processor, this will act as Infinity Fabric. if paired with any other x86 processor, this will fallback to 16 lanes of PCIe 4.0.

= Changes from CDNA =

The largest up front change is the additional of full rate FP64 support across all compute elements. This results in a 4x increase FP64 matrix calculations, with large increases in FP64 vector calculations.{{Cite web |date=2022-09-18 |title=Hot Chips 34 – AMD's Instinct MI200 Architecture |url=https://chipsandcheese.com/2022/09/18/hot-chips-34-amds-instinct-mi200-architecture/ |access-date=2022-11-10 |website=Chips and Cheese |language=en-US}} Additionally support for packed FP32 operations were added, with opcodes like 'V_PK_FMA_F32' and 'V_PK_MUL_F32'.{{Cite web |date=2022-02-04 |title="AMD Instinct MI200" Instruction Set Architecture |url=https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf |access-date=2022-10-11 |website=developer.amd.com}} Packed FP32 operations can enable up to 2x throughput, but do require code modification. As with CDNA, for further information on CDNA 2 operations, please reference the [https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf CDNA 2 ISA Reference Guide].

= Products =

Products

class="wikitable"

|+AMD Instinct CDNA 2 GPU generations MI-2xx

! rowspan="2" |Accelerator

! rowspan="2" |Launch date

! rowspan="2" |Architecture

! rowspan="2" |Lithography

! rowspan="2" |Compute Units

! colspan="3" |Memory

! rowspan="2" |PCIe support

! rowspan="2" |Form factor

! colspan="8" |Processing power

! rowspan="2" |TBP

Size

! Type

!Bandwidth (GB/s)

! FP16

! BF16

! FP32

! FP32 matrix

! FP64 performance

! FP64 matrix

! INT8

! INT4

MI210

|2022-03-22{{Cite web |last=Smith |first=Ryan |title=AMD Releases Instinct MI210 Accelerator: CDNA 2 On a PCIe Card |url=https://www.anandtech.com/show/17326/amd-releases-instinct-mi210-accelerator-cdna-2-on-a-pcie-card |access-date=2024-06-03 |website=www.anandtech.com}}

| rowspan="3" |CDNA 2

| rowspan="3" |6 nm

| 104

| 64 GB

| rowspan="3" |HBM2E

|1600

|

|

| colspan="2" |181 TFLOPS

| 22.6 TFLOPS

| 45.3 TFLOPS

| 22.6 TFLOPS

| 45.3 TFLOPS

| colspan="2" |181 TOPS

| 300 W

MI250

| rowspan="2" |2021-11-08{{Cite web |last=Smith |first=Ryan |title=AMD Announces Instinct MI200 Accelerator Family: Taking Servers to Exascale and Beyond |url=https://www.anandtech.com/show/17054/amd-announces-instinct-mi200-accelerator-family-cdna2-exacale-servers |access-date=2024-06-03 |website=www.anandtech.com}}

| 208

| rowspan="2" |128 GB

| rowspan="2" |3200

| rowspan="2" |OAM

|

| colspan="2" |362.1 TFLOPS

| 45.3 TFLOPS

| 90.5 TFLOPS

| 45.3 TFLOPS

| 90.5 TFLOPS

| colspan="2" |362.1 TOPS

| 560 W

MI250X

| 220

|

| colspan="2" |383 TFLOPS

| 47.92 TFLOPS

| 95.7 TFLOPS

| 47.9 TFLOPS

| 95.7 TFLOPS

| colspan="2" |383 TOPS

| 560 W

CDNA 3

{{Infobox graphics processing unit

| name = AMD CDNA 3

| image =

| caption =

| codename =

| created = {{start date and age|2023|Dec|6|p=y|br=y}}

| fab =

| process = TSMC N5 & N6

| predecessor = CDNA 2

| successor =

}}

Unlike its predecessors, CDNA 3 consists of multiple dies, used in a multi-chip system, similar to AMD's Zen 2, 3 and 4 line of products. The MI300 package is comparatively massive, with nine chiplets produced on 5 nm, placed on top of four 6 nm chiplets.{{Cite web |last=Smith |first=Ryan |title=CES 2023: AMD Instinct MI300 Data Center APU Silicon In Hand - 146B Transistors, Shipping H2'23 |url=https://www.anandtech.com/show/18721/ces-2023-amd-instinct-mi300-data-center-apu-silicon-in-hand-146b-transistors-shipping-h223 |access-date=2023-01-22 |website=www.anandtech.com}} This is all combined with 128 GB of HBM3, using eight HBM placements.{{Cite web |author1=Paul Alcorn |date=2023-01-05 |title=AMD Instinct MI300 Data Center APU Pictured Up Close: 13 Chiplets, 146 Billion Transistors |url=https://www.tomshardware.com/news/amd-instinct-mi300-data-center-apu-pictured-up-close-15-chiplets-146-billion-transistors |access-date=2023-01-22 |website=Tom's Hardware |language=en}} This package contains an estimated 146 billion transistors. It comes in the form of the Instinct MI300X and MI300A, the latter being an APU. These products were launched on December 6, 2023.{{Cite web |last=Kennedy |first=Patrick |date=2023-12-06 |title=AMD Instinct MI300X GPU and MI300A APUs Launched for AI Era |url=https://www.servethehome.com/amd-instinct-mi300x-gpu-and-mi300a-apus-launched-for-ai-era/ |access-date=2024-04-15 |website=ServeTheHome |language=en-US}}

= Products =

class="wikitable"

|+AMD Instinct CDNA 3 GPU generations - MI-3xx

! rowspan="2" |Accelerator

! rowspan="2" |Launch date

! rowspan="2" |Architecture

! rowspan="2" |Lithography

! rowspan="2" |Compute Units

! colspan="3" |Memory

! rowspan="2" |PCIe support

! rowspan="2" |Form factor

! colspan="8" |Processing power

! rowspan="2" |TBP

Size

! Type

!Bandwidth (GB/s)

! FP16

! BF16

! FP32

! FP32 matrix

! FP64 performance

! FP64 matrix

! INT8

! INT4

MI300A

| rowspan="2" |2023-12-06{{Cite web |last=Bonshor |first=Ryan Smith, Gavin |title=The AMD Advancing AI & Instinct MI300 Launch Live Blog (Starts at 10am PT/18:00 UTC) |url=https://www.anandtech.com/show/21181/the-amd-advancing-ai-live-blog-starts-at-10am-pt1800-utc |access-date=2024-06-03 |website=www.anandtech.com}}

| rowspan="3" |CDNA 3

| rowspan="3" |6 & 5 nm

| 228

| 128 GB

| rowspan="2" |HBM3

| rowspan="2" |5300

| rowspan="3" |5.0

| APU SH5 socket

| colspan="2" |980.6 TFLOPS
1961.2 TFLOPS (with Sparsity)

| colspan="2" |122.6 TFLOPS

| 61.3 TFLOPS

| 122.6 TFLOPS

| 1961.2 TOPS
3922.3 TOPS (with Sparsity)

| N/A

| 550 W
760 W (with liquid cooling)

MI300X

| rowspan="2" | 304

| 192 GB

| rowspan="2" | OAM

| colspan="2" rowspan="2" |1307.4 TFLOPS
2614.9 TFLOPS (with Sparsity)

| colspan="2" rowspan="2" |163.4 TFLOPS

| rowspan="2" | 81.7 TFLOPS

| rowspan="2" | 163.4 TFLOPS

| rowspan="2" | 2614.9 TOPS
5229.8 TOPS (with Sparsity)

| rowspan="2" | N/A

| rowspan="2" | 750 W

MI325X

|2024-10-10{{Cite web |last=Smith |first=Ryan |title=AMD Plans Massive Memory Instinct MI325X for Q4'24, Lays Out Accelerator Roadmap to 2026 |url=https://www.anandtech.com/show/21422/amd-instinct-mi325x-reveal-and-cdna-architecture-roadmap-computex |access-date=2024-06-03 |website=www.anandtech.com}}

|256 GB

|HBM3E

|6000

Product Comparisons

{{AMD Instinct Comparisons}}

See also

References

{{reflist}}