Better, more capable than expected: RDNA 4 architecture deep dive

RDNA 4 archiecture: New features and innovations in new Radeon GPUs detailed

Unofficial leaks from the past initially didn’t paint the RDNA 4 architecture as a major new design, suggesting that it’s more akin to RDNA 3 bugfix – except for new ray tracing units. But it turns out that was a big misconception, as RDNA 4 is a significant upgrade that leaves no GPU subsystems untouched, far beyond just adding new ray tracing units. It also brings enhanced AI acceleration and redesigned compute units (shaders).

In the end, RDNA 4 appears to be a full-fledged new-generation architecture, much like Nvidia’s Blackwell (though AMD has not introduced support for next-generation memory). The fact that the generation is going to forever going to have an image issue due to being a mainstream-only series (because of AMD having cancelled the more powerful chiplet-based variants, for reasons that remain undisclosed), doesn’t really change it.

According to odler leaked information (not officially confirmed), the original plan was for the RDNA 4 generation only using monolithic die for the smaller cheaper Navi 44 GPU, while cards like the Radeon RX 9070 and 9070 XT were supposed to feature a chiplet-based GPU (essentially a successor to Navi 32). It seems that after scrapping the chiplet GPUs, AMD designed a monolithic GPU as a replacement as a plan B – it seems by effectively doubling the specifications of Navi 44, possibly to speed up development. It is this GPU, Navi 48, that is now hitting the market.

Navi 48

The Navi 48 chip is a monolithic piece of silicon manufactured on TSMC’s 4nm process node, though the exact variant of this node used remains unspecified (Chips and Cheese reports it’s N4P). N4 nodes offer slight improvements over the 5nm process node used in RDNA 3 GPUs (except for the 6nm Navi 33), but its benefits are limited and cannot be compared to the advancements a 3nm process would bring.

According to AMD, the chip has a die size of 356.5 mm², making it smaller than Nvidia’s GB203, which powers the RTX 5080 (the fully enabled version) and RTX 5070 Ti (a cut-down version that allows for harvesting defective chips). The GB203 die is 6% larger at 378 mm², containing 45.6 billion transistors. However, somewhat unexpectedly, Navi 48 packs more transistors – 53.9 billion – resulting in higher transistor density. This difference likely shows the effects of various architectural design approaches being different from Nvidia’s. For example, lower density in certain areas can improve clock speeds, while higher density may be something chosen with cost-effectiveness in mind. However transistor counts may not be directly comparable across different companies, so it’s unclear what to really take from this information.

CUs and shaders

The RDNA 4 architecture delivers significantly higher performance per compute unit (according to AMD, up to 40% uplift), largely due to much higher clock speeds. However, architectural improvements also play a significant role. These improvements start with a redesigned Compute Unit (CU). The RDNA 4 architecture retains the same CU structure as its predecessor, featuring 64 shaders (two SIMD32 engines) and maintaining the dual-issue capability from RDNA 3, allowing certain operations to be executed at a rate of two instructions per cycle instead of one.

Out of Order Memory

AMD has also overhauled the memory subsystem within the CU, which is basically the equivalent of load/store units found in CPU cores. The new system can now process memory requests from different shaders out of order. This improvement has a significant impact on performance, particularly in scenarios where shaders are waiting for data. In previous RDNA 3 architectures, a cache miss (when data needs to be fetched from graphics memory) caused execution to stall until the data arrived.

RDNA 4 introduces out-of-order queues for memory requests, allowing one SIMD32 unit within a CU to continue executing instructions while the other waits for data. This should enhance real-world performance per CU. The improvements will likely be noticeable across a wide range of workloads, though some shaders may not see significant gains, if they are already optimized to minimize memory latency impact. However, AMD notes that ray tracing performance is particularly sensitive to memory latency, so this enhancement should lead to noticeable improvements in ray tracing workloads.

Dynamic Registers

A similar optimization, which does not change theoretical peak performance but allows for better practical utilization of the existing resources, has happened in register allocation. Previous RDNA architectures used fixed register allocation for shader programs (which means any compute tasks running on the GPU), the allocation being “worst-case” based, reflecting the maximum number of registers required by a shader at any point in execution. This could result in inefficient utilization, as some registers might remain unused at times while preventing the GPU from processing additional shaders due to lack of free registers.

RDNA 4 introduces dynamic register allocation, allowing the GPU to process more shaders in parallel under certain conditions, increasing overall utilization. Instead of a fixed worst-case allocation, shaders receive a lower base register allocation and can dynamically request and release additional registers as needed during execution.

According to AMD, dynamic allocation can significantly improve compute unit efficiency. This applies not only to traditional shaders but also to workloads such as ray tracing. For example, during BVH traversal, fewer registers are required, but their demand increases when processing the final ray-tracing results. With fixed allocation, such shader tasks would block a large number of registers throughout their relatively long-lasting execution time, whereas RDNA 4’s dynamic allocation might allow other tasks to run simultaneously.

However, this optimization does not happen automatically for shaders that do not explicitly implement dynamic allocation, as requesting and releasing registers must be handled by the shader program itself. It is unclear whether drivers will be able to automatically introduce dynamic allocation during shader compilation (which GPU drivers always do in games).

The RDNA 4 architecture should also feature an improved scalar unit within the CU, which now supports FP32 data. What is also said to be improved is scheduling of ops to compute units. RDNA 4 introduces new barrier ops, faster data loading into registers (fill) and writing back to memory (spill), as well as enhanced prefetching. Prefetching is a crucial component for both CPUs and GPUs, as it allows data to be loaded from memory in advance, ensuring that the running program has it available when requested, rather than stalling execution while waiting for it.

2× higher-performance, advanced ray tracing

The biggest highlight of the RDNA 4 architecture is the improved ray tracing acceleration through Ray Accelerators. This is something that was already somewhat teased by the PlayStation 5 Pro, whose semicustom APU (processor with integrated GPU) was partially based on a development version of RDNA 4.

The first change is that the performance of Ray Accelerators (specialized units for ray tracing, analogous to Nvidia’s “RT cores”) in analyzing (testing) intersections has been doubled – which is achieved by duplicating the engine used for these operations. RDNA 4 can handle 8 intersections of a ray with the bounding box structure (BVH), which envelops objects in the scene and serves as a tool for locating the triangle of the object itself that the ray collides with. After that, the intersection of the ray with the triangle is analyzed, and RDNA 4 can analyze 2 such ray/triangle intersections per cycle. In RDNA 2 and 3, the performance was 4× ray/box and 1× ray/triangle per cycle.

It’s interesting to compare this with competing architectures. Ada Lovelace GPUs tested (it seems) 4× boxes and 4× triangles per cycle, while Blackwell likely handles 8× box/ray and 8× triangle/ray intersections (though recently the company has only provided figures for the latter type of ray testing, so we can’t be certain). It seems Nvidia has chosen a significantly different balance. By the way, Intel’s previous Xe HPG “Alchemist” GPU architecture could already handle 12× box/ray and 1× triangle/ray intersection tests. The Xe2 architecture used in the Arc B580 and B570 “Battlemage” graphics cards ups it to 18× box/ray and 2× triangle/ray tests per cycle. It’s interesting that Intel’s balance of these units is more similar to AMD’s, or even goes further.

The Ray Accelerators also have other improvements. There’s now a specialized hardware for performing Instance Transform operations and managing the ray tracing operation stack. Additionally, ray tracing performance benefits from some general improvements in shader performance (which, in all GPU architectures, handles some portion of ray tracing graphics work, fixed function hardware only covering some). As mentioned earlier, AMD highlights the positive impact of the ability to handle memory requests out of order and dynamically allocate registers, in this context. The result is that Ray Accelerators can actually realise their full potential more often, thanks to reduced idle pipeline bubbles and cache misses.

Oriented Bounding Boxes

Additionally, RDNA 4 introduces improvements when working with the BVH structure. It can now be built to handle eight intersections per cycle (BVH8), which is said to be more efficient than currently used BVH4. Notably, RDNA 4 also allows the boxes around objects to be rotated. This feature (caled Oriented Bounding Boxes) can reduce the number of boxes generated for certain objects in a scene if they are regular but tilted relative to global axis. Instead of boxes with edges that are perpendicular to global axis, RDNA 4 enables the use of tilted boxes, which store data about their rotation (the ray’s orientation is then transformed during intersection testing to match). This can reducee the number of boxes generated during BVH buildup, allowing BVH analysis to be completed with lower performance required.

Rotated boxes also envelop objects more tightly, reducing false positives where a ray collides with the box but not the triangle the box surrounds. The usefulness of this feature will naturally vary depending on how objects in the scene are oriented – its impact is significant for tilted objects but smaller or even negligible for the rest of the scene, if it is all right-angle. Overall, AMD claims this feature can improve ray tracing performance by up to 10%.

In the architecture presentation, AMD includes a graph illustrating how individual ray tracing acceleration improvements contribute to the overall performance of RDNA 4. Cumulatively, all these enhancements lead to roughly double the ray tracing performance compared to the RDNA 3 architecture (so simply doubling the intersection testing performance is only part of the recipe – the actual real-world performance won’t scale linearly).

Lower memory usage with ray tracing

Nvidia, with its Blackwell architecture, advertised BVH compression as one of the new features, referring to reduction of the in-memory footprint of the bounding volume box structure that games generate around scene objects as an aid for analysis. These data require memory space, and Blackwell adds the ability to compress them (Nvidia stated that in ray tracing games, this could reduce memory usage by a few hundred MB).

According to the architecture reveal, RDNA 4 also uses BVH compression (also referred to as primitive node compression), so these GPUs should also save some memory in games (freeing up some memory compared to gaming on older GPUs) – but only when ray tracing is enabled, just like with Blackwell. AMD doesn’t provide a specific figure in megabytes, only an approximate estimate suggesting that the BVH could occupy about 60% of memory compared to RDNA 3. However, this is a combined figure for using BVH compression and the effect of using the BVH8 hierarchy instead of BVH4, not just the isolated effect of compression.

AI Acceleration

RDNA architectures are often said to lack dedicated AI accelerators similar to Nvidia’s tensor cores. However, it’s not entirely true that these GPUs have no hardware AI acceleration system. AMD has chosen a path where matrix operation acceleration is built into the same compute units that handle all programmable shaders, but they still achieve higher performance than using just general-purpose shaders.

With RDNA 4, the company states that these GPUs feature the second generation of this acceleration. The performance per compute unit has significantly increased, so labeling this built-in support as “AI Accelerators” (with the GPU having two per CU) isn’t totally unjustified.

The AI acceleration in RDNA 4 offers double the raw performance per CU compared to RDNA 3 for 16-bit data types (FP16, Bfloat16), and quadruple the performance for 8-bit and 4-bit data types. Among 8-bit ones, GPUs now support FP8 and Bfloat8 calculations. The FP4 type, for which Nvidia introduced support in its Blackwell architecture, is not supported by RDNA 4.

In addition to this raw increase in compute performance, RDNA 4 also adds support for 4:2 Structured Sparsity technology (something Nvidia introduced with its tensor cores in the Ampere architecture and still uses today). This feature eliminates some calculations in sparse matrices, resulting in double the effective AI performance. When factoring in Structured Sparsity, RDNA 4 thus offers quadruple the performance in FP16 and Bfloat16, and eight times the performance in INT8, FP8 (and Bfloat8), and INT4 compared to its predecessor. Which is clearly far from lacking AI acceleration capabilities.

For orientation – the theoretical AI performance for the Radeon RX 9070 XT model is 779 TOPS in 8-bit precision and 1557 TOPS in 4-bit precision, while the Radeon RX 9070 achieves 578 TOPS and 1156 TOPS, respectively. The competing Nvidia GeForce RTX 5070 Ti is reported to deliver 703 TOPS in INT8/FP8 and 1406 TOPS in FP4 operations (these values also include 2× performance gains through sparsity, and derived from the officially stated boost – so in reality, the performance value could be slightly higher, as Nvidia tends to understate boost clock speeds).

GDDR6 memory remains

One area where AMD has not introduced any innovations is memory. The Navi 48 GPU still utilizes GDDR6 with a 256-bit bus width. The cards operate it at an effective clock speed of 20.0 GHz, a specification already employed in the Radeon RX 7900 XTX two years ago. Fan speculation has emerged that the chip might secretly support GDDR7 as well, which could be utilized in a future refresh. However, there is currently no evidence to support this claim, so it is best to consider it wishful thinking for now. Historically, AMD has rarely used dual-standard memory controllers in its GPUs, it has to be said.

The physical bandwidth, however, should be utilized more efficiently, as AMD has highlighted improvements in data compression for the RDNA 4 architecture (this should affect the framebuffer and “pixel data,” with textures already being separately compressed within the game itself, using various lossy schemes). While the physical bandwidth remains at 640 GB/s, almost the same as the Radeon RX 7800 XT (624 GB/s), the Radeon RX 9070 (XT) should benefit slightly more from it.

A similar effect could be attributed to the Infinity Cache, now in its third generation. Its capacity remains at 64 MB (the same as the Radeon RX 7800 XT with the full Navi 32 chip), but it is expected to feature architectural enhancements that could boost its effectiveness.

Interestingly AMD has expanded the L2 cache, which is one level closer to the execution. The Navi 31 as the biggest RDNA 3 GPU had just 6MB L2 Cache (of course, that was coupled with large capacity Infinity Cache). RDNA 4 uses 8MB L2 Cache for Navi 48, even if it is a one-category smaller GPU, relatively speaking. This could again improve the effective performance extracted from the given memory configuration.

New encoders and decoders, dual multimedia engine

Navi 48 and the RDNA 4 architecture also introduce an entirely new multimedia engine (or rather, two engines, as Navi 48 features a dual configuration of this block). This engine is said to include enhanced decoders and encoders, optimized for low-latency streaming. AV1 and VP9 decoding performance is expected to improve by 50%, which should significantly boost energy efficiency during battery-powered playback.

AV1 encoding is expected to deliver up to double the performance (likely measured in maximum FPS), while H.264 (AVC) encoding is said to offer up to 25% improved image quality. This is measured using the VMAF metric. The comparison is based on very low bitrates (1080p at 500 Kb/s) and low-latency encoding, which is harsh scenario which will really stress the difference to an older lesser-quality implementation. With less starved bitrates, the difference might not be as drastic. However, AMD also demonstrated noticeable improvements in visual quality at 6000 Kb/s.

AMD also promises improved encoding quality for HEVC (+11% VMAF) and AV1, though no specific figure was provided for the latter. The improvements to AV1 encoding are said to stem from usage of B-frames (frames that use bidirectional motion prediction based on both past and future frames).

What the new architecture does not support are 4:2:2 colorspaces, that Nvidia added to their media engines in Blackwell.

DisplayPort 2.1a, HDMI 2.1b

The GPU also features a new display output block (referred to as the Radiance Display Engine). This includes an enhanced scaling block (used, for example, during video playback if the player lacks its own implementation) and video sharpening. The new block supports updated versions of standards – HDMI 2.1b and DisplayPort 2.1a – though these are only minor improvements over the RDNA 3 generation. DisplayPort 2.1b (announced by Nvidia) is not supported, but it seems that it only adds compatibility with more expensive active cables.

A limitation is that both the Radeon RX 9070 XT and RX 9070 still only support DisplayPort 2.1a UHBR 13.5 mode, the same as RDNA 3. This mode offers roughly double the data bandwidth of the older DisplayPort 1.4a. GeForce RTX 5000 graphics cards, however, support UHBR 20, providing approximately triple the bandwidth of DP 1.4a. That said, Nvidia’s documentation indicates that this mode always requires expensive active cables, without which the field seems to be even. AMD already does support UHBR 20m but only in its workstation RDNA 3 GPUs (Radeon Pro), which may also apply to RDNA 4-based workstation cards.

Lower idle power consumption?

According to the RDNA 4 architecture presentation, the new GPUs should feature adjustments aimed at reducing idle power consumption in certain scenarios where there were likely inefficiencies before. AMD claims that the Radeon RX 9070 and 9070 XT will consume less power in most dual-monitor configurations and when FreeSync is enabled on the monitor.

RDNA 4 introduces support for hardware “Flip Queue,” which shifts the rendering of frames prepared by software entirely to the GPU. A side effect of this is that video playback may have lower CPU overhead in drivers, though this is unlikely to be a noticeable gamechanger for most users, as modern processors already have ample performance even for very high-resolution video playback (perhaps this could matter for very high refresh rate video?).

PCI Express 5.0

Navi 48 is the first GPU to support PCI Express 5.0 ×16 connectivity. This means double the data bandwidth between the CPU and GPU compared to older GPU generations, although in practice, the benefits of faster PCIe lanes shows in gaming are often minimal (performance on a PCIe 4.0 ×16 board will likely remain practically unchanged).

This capability will likely be more significant for the lower-end Navi 44 chip, which is expected to feature only an 8-lane interface. However, on a PCIe 5.0-compatible motherboard, it will be able to achieve the same bandwidth as previous PCIe 4.0 cards with 16 lanes.

Outlook for other models: Navi 44?

For now, only the Radeon RX 9070 and RX 9070 XT cards are officially launching as the sole graphics cards based on this architecture. According to unofficial leaks, a second, more affordable chip labelled Navi 44 is expected to follow at some point. There are no officially confirmed details about it yet, but it appears that Navi 44 could essentially be a halved version of Navi 48, featuring 32 CUs, 2048 shaders, 32 Ray Accelerators, 64 AI Accelerators, 32MB of Infinity Cache, and a 128-bit memory interface. It’s still unclear which cards will be based on this chip – perhaps something like the Radeon RX 9060 (XT)?

To conclude, we’ve included the full presentation on the new architecture below, in case you’d like to review the individual slides.

NAVI 4 Architecture Deep Dive

Source: AMD

English translation and edit by Jozef Dudáš

⠀

Back to: RDNA 4 archiecture: New features and innovations in new Radeon GPUs detailed

Flattr this!

AMD AMD RDNA 4 architecture artificial intelligence PCI Express 5.0 Radeon ray tracing

Bufo on BeQuiet! Silent Loop 3 (BW025): Founded on elite fanssorry, but I don't see any difference between SW4 Pro and SW4 HS on the...
Ľubomír Samák on BeQuiet! Silent Loop 3 (BW025): Founded on elite fansJeanfi was likely referring mainly to the impeller's aerodynamic design. But yes, it's kind of...
Ľubomír Samák on BeQuiet! Silent Loop 3 (BW025): Founded on elite fansThanks for the comment. Of course. For accurate assessment, all details would need to be...
Bufo on BeQuiet! Silent Loop 3 (BW025): Founded on elite fansHello Jeanfl, despite what BeQuiet! says in the description: https://www.bequiet.com/en/watercooler/5398 They do have a PRO...
Jeanfi on BeQuiet! Silent Loop 3 (BW025): Founded on elite fansHello, thanks for this review. I don't think that the fans are the Silent Wings...
M on Arctic’s new „pro“ variant of Liquid Freezer III (Pro) AIOIn their announcement post on Reddit (https://www.reddit.com/r/arcticcooling/comments/1jl57ad/essential_cooling_pro_performance/), it is stated that "More sizes (240, 280...
Ľubomír Samák on Arctic’s new „pro“ variant of Liquid Freezer III (Pro) AIOWe don't have any information about a 140mm version yet. A larger rotor hub would...
Ľubomír Samák on Arctic’s new „pro“ variant of Liquid Freezer III (Pro) AIOLet’s believe it's only a matter of time before these new fans become available standalone....
the patient on Arctic’s new „pro“ variant of Liquid Freezer III (Pro) AIOIt looks like the Pros have a bigger hub, which could lead to interesting results...

Better, more capable than expected: RDNA 4 architecture deep dive