Oryon, the Nuvia ARM core of Snapdragon X: Architecture analysis

Second (third?) coming of ARM to Windows: The CPU core

At Computex 2024, Intel introduced the new Lion Cove and Skymont architectures, which we covered in detail. AMD also shared a peek at their competing Zen 5 core, but with little detail, so we’ll have to wait with our analysis of the architecture. But there’s a new ARM-based challenger entering the fray – the Snapdragon X Elite currently coming to laptops. And Qualcomm has now also finally teased its “Nuvia” Oryon architecture.

When Qualcomm formally unveiled these processors last October (quite a long time before real availability, which began thisweek), they announced they will bear the Snapdragon X Elite brand while the cheaper SKUs are called Snapdragon X Plus). The CPU core architecture itself is referred to as Oryon and there is just one performance core architecture in this family, these Qualcomm processors don’t use the big.LITTLE hybrid scheme.

Read more: “Nuvia” Snapdragon X Elite unveiled: The CPU beatig both Intel and Apple

This core is the very architecture that has been “hyped” since 2019 as a design that will completely knock-out x86 processors (often by people forgetting that it would not be ready until several years pass, during which time the competition will also necessarily raise the bar). Back then, it was called the “Phoenix” core, and was developed by Nuvia, a company founded by engineers previously designing Apple’s ARM cores. Nuvia was later acquired by Qualcomm including the team, and the project was retargetted to Snapdragon mobile SoC, producing today’s Oryon architecture.

Nuvia Phoenix Qualcomm Oryon: Architecture analysis

Qualcomm kept the design features under wrap until now, but those details were revealed last week so the design can finally be compared with the competition. The core implements the ARMv8.7 instruction set which gives it a bit older ISA level than what is implemented by the current licensable Cortex cores with ARMv9 architecture. For example, Oryon does not support the SVE and SVE2 SIMD instructions, but their function is handled by the older and still widely used Neon SIMD extension.

In practice, this may not matter too much. That is, unless Microsoft decides to bump-up Windows requirements  in the future and make ARMv9 the new minimum (or make SVE2 mandatory). Something like this has happened recently, but hopefully ARMv8.7 will keep being supported for many years to come.

It is a question whether Qualcomm is even currently planning to switch to ARMv9 at all, as ARM may possibly associate this technology with more restrictive licensing conditions and higher fees. Lately, ARM seems to frown upon companies designing their own architectures instead of licensing the full-package Cortex core solution from them, and has been engaged in a rather hostile lawsuit with Qualcomm, where it actually demanded that Qualcomm ceases the Oryon core development and destroys all of the already developed IP.

Quad-core basic block with shared L2 cache

Oryon cores are combined in the processor into quad-core clusters that form a basic building block potentially portable to another chip (Qualcomm might reportedly be cooking a cheaper eight-core “Purwa” SoC in addition to the current 12-core “Hamoa” SoC). Each quad-core cluster can be independently put to sleep and have independently adjustable clock speeds (the clock speeds of the individual cores in the cluster appear to be tied to each other).

The four Oryon cores share L2 cache, similar to what Intel does with its E-Cores. However, this L2 cache is significantly larger, with a 12MB capacity with 12-way associativity.
The latency is said to be 17 cycles on average, similar to the L2 cache of Intel Lion Cove (which has the same 3MB L2 cache per core, but it’s private per-core instead of shared and combined). The L2 cache runs at the same clock speed as the cores in the cluster. The L2 TLB is eight-way and has more than 8000 entries.

This cache should be inclusive, so the L1 cache contents of all cores in the cluster are always copied in it (this should help with power efficiency since it simplifies coherency handling). The cache should be optimized for communication between cores and data sharing between them (and also supports requests from cores in adjacent clusters). It supports up to 50 concurrent memory data requests per core and up to 220 total.

Conversely, if data is not found in the L2 cache, page walkers support up to 16 concurrent data requests for each core – up to 192 for all 12 processor cores of the SoC.

This cluster is reportedly already prepared for integration of eight cores instead of four, which may have been the plan for the original Phoenix cores that Nuvia officially wanted to deploy in servers first (but that was likely largely dictated by the fact that server CPUs are an area where other technologies like GPUs, Wi-Fi, modem are not needed and thus making server SoCs is more viable path for startups, with lower barriers to entry). It’s possible that Qualcomm will still utilize this option in coming future processors with higher core counts.

Only SLC between the L2 and memory, no L3 cache

There is no shared L3 cache above the L2 caches of the clusters that would unify the memory subsystem. But the entire SoC still contains a so-called System Level Cache (SLC), which replaces this role, but also caches data for other components such as the integrated GPU, NPU, or camera image processing unit.

This SLC cache has a capacity of 6 MB and, according to Qualcomm, achieves a latency of around 26–29 ns and a bandwidth of 135 GB/s in both directions (i.e. 135GB/s read and 135GB/s write simultaneously). The SLC bandwidth is the same the company gives for DRAM (also 135 GB/s), but covering the data requirements from the SLC’s contents should be more power efficient and with lower latency than going all the way to RAM.

The operating memory used by the Snapdragon X Elite and Plus processors is LPDDR5X running at 8448 MHz effective speed, eight 16-bit channels are used, so it is a memory with a standard width of 128 bits, same as what is used in the basic line of Apple processors (M1/M2/M3/M4) and with common PC processors with “dual-channel” memory. Qualcomm states that RAM latency is typically 102–104 ns (notably higher than typical DDR5). The maximum amount of RAM you will be able to get in PCs with these processors is 64 GB.

The core and L1 cache

The individual cores have only L1 cache as their highest-level fully private (dedicated) cache. The L1 instruction cache (for program code) has a capacity of 192 kB with 6-way associativity, and Fetch reads up to 16 instructions (64 bytes) from it each cycle. The L1 instruction cache has a first-level iTLB for 256 entries with 8-way associativity. The data L1 cache has a capacity of 96 kB also with 6-way associativity. Unfortunately, Qualcomm does not specify latencies. The data L1 TLB has 224 entries with 7-way associativity. The L1 cache capacities are much larger than what’s available to Intel and AMD x86 processors (32 to 48 KB), but Oryon still has a smaller L1 data cache than Apple’s cores (128 kB).

4kB and 64kB memory pages are supported, which is different from Apple, which currently uses 16kB pages. Windows on ARM should be using the 4kB size for compatibility reasons, and also it can be said that small pages lead to better RAM utilization. Larger pages can lead to wasted RAM similar to how large clusters in a file system lead to higher percentage of wasted (slack) space. But with large pages, TLBs can be utilized a bit better, so they can improve performance. It’s possible that 64kB pages are a bit too big already, and Apple’s 16kB pages would be a better compromise than 64kB. However, that’s a non-standard size in terms of the ARM instruction set that Apple can enforce because of how pushy the company can allow to be with its platform and how much control it can exert over independent software developers. Windows is a much more open platform leaving much more liberty to the software developers, so it’s probably more appropriate for it to sacrifice a few percent of potential extra performance and stick with the old 4kB pages instead of breaking everyone’s software by changing the memory page size.

Of course, the core features advanced technologies to prefetch data into L1 and L2 caches and an advanced branch predictor that is multi-level. Qualcomm states that the penalty for branch mispredicts is 13 cycles, which gives a bit of an indication of roughly how deep the core pipeline is. Both of these components are critical to the ability to keep the core’s execution units utilised and prevent execution stalls and bubbles, which in the end means that code is processed faster (or performance per 1 MHz is higher).

8 decoders, massive RoB

The decoding phase is handled by 8 parallel decoders, which is expected from a very large ARM core (Intel has also already matched this in Lion Cove, and Apple and ARM have been trying to go highr with 10 decoders in recent architectures). The decoders translate ARM instructions into micro-ops (uOps) that the core works with internally, where a single instruction can be translated into one or more uOps.

In this way, contrary to the popular association of ARM architecture with the RISC concept, there is very little to no difference to x86 processors (and there’s not even that big of a difference in the number of instructions the ISAs contain, ARMv8 composes of close to thousand or even more instructions). The CISC versus RISC distinction has almost no practical relevance for modern processors today, the most important remaining difference between the ARMv8/v9 and x86 (or x86-64) concepts being that ARM has all instructions with a 32-bit length, while with x86 the instruction length is variable.

The Reorder Buffer, which is a window of instructions that the core can see at once and within which it can perform optimizations and issue ready-to-run operations out of order, has a capacity of “more than 650 operations”, which means it is larger than the RoB of the big Intel Lion Cove core (576). The core has over 400 physical general registers, as well as over 400 physical SIMD/FPU registers for renaming architectural registers.

How many ALUs & the execution backend

The integer part is 6 pipelines wide and contains 6 ALUs. Four ALUs are apparently simple, two are complex, as Qualcomm states that the core can perform two multiplications per cycle. Two of these pipelines also have a branch unit located with the ALUs. Each of these pipelines has its own 20-entry Reservation Queue in front of it.

The six ALU is pretty standard count now, or will be soon. It seems it was also chosen by AMD in Zen 5 as well as by Intel in Lion Cove. On the other hand, Apple’s cores and ARM’s Cortex-X4 and Cortex-X925 are already trying to add even more ALUs.

But the higher the number of ALUs in the core is, the harder it is to find work for them. The seventh and eighth ALUs are likely to be utilised only during a small percentage of the processor’s runtime, so having “just” six ALUs versus eight, for example, very likely is only a small disadvantage in reality, although +33% backend units may look like a massive difference at first glance.

Still 128-bit SIMD units

The FPU, or the vector/SIMD unit, has four pipelines, which is the same as Apple or even Intel (Cortex-X925 is more ambitious with 6 pipelines, however). Each pipeline has a Reservation Station of 48 entries. The core can handle integer and floating-point additions and multiplications on all four pipelines. The vector instructions can use INT8, INT16 and higher integer and FP16, FP32 and FP64 floating-point data types. Newer formats used for AI such as BFloat16 or the recently introduced FP8 are not supported.

However, as already mentioned, neither SVE nor SVE2 instructions are supported, the unit works with classic SIMD Neon instructions. This also determines the width of the vector and physical unit to be 128 bits. However, even Cortex cores with SVE2 support do not have wider units at this time, so NEON-only instruction support actually isn’t a big liability given how the main SVE 2 advantage (potentially bigger vector sizes) is not relevant for any of these cores.

Overall, ARM cores have a disadvantage in this area when compared to high-performance Intel and AMD cores, which have 256-bit or even 512-bit SIMD units (in case of Zen 5 and server Intel cores), so their theoretical compute thoughput capacity per cycle is double or even quadruple for each individual pipeline.

The Load/Store part also has four pipelines/AGUs. Each has a Reservation Station of 16 entries, and the data widths that the read and write pipelines work with are also 128 bits (so that it is possible to store or retrieve data for a 128-bit Neon vector in a single operation). All of these pipelines are universal, so they can handle both reads and writes – here Nuvia/Qualcomm have gone in the opposite direction to Intel, who opted to make all units single-purpose (dedicated to either loads or stores) in their latest architecture.

The Oryon core can thus perform up to four writes or four reads per cycle, or any combination of up to four operations total. The Store to Load Forwarding technique is supported, which bypasses memory when reading from an address to which the CPU has previously written data. The read queue has 192 entries and the write queue has 56 entries.

Virtualization, emulation?

The processor supports virtualization, including Nested Virtualization, which means you can run another hypervisor within an operating system that is already itself virtualized. According to Qualcomm, when designing, care was also taken to make the core robust against various side-channel attacks like Spectre, which the core has “mitigations” (which are not necessarily 100% effective protections, but usually measures to make such attacks more difficult) against. In some parts of the processors, block encryption is supposedly used for data to “obfuscate” it against side-channel attacks.

Qualcomm has also revealed that the Oryon core includes some special modifications aimed at making it run programs compiled for x86 processors faster, or more accurately to provide better performance when using real-time compile/emulation of such programs. For example, there should be hardware support for x86 emulation in the FPU, which is useful because ARM ISA behaves a bit differently than x86 (x87) FPUs, and handling these differences would bring down performance a lot, so the core is instructed to switch itself to the mode corresponding to x86, during emulation.

Similarly, Oryon is said to support a mode that switches cores to a mode that respects the stricter memory model of the x86 architecture, also known as total store ordering (this affects only performance of multithreaded programs, not single-threaded ones). As far as is known, Apple’s Rossetta 2 technology uses essentially the same techniques (described in this article, for example), also integrated into the cores of Apple’s processors. Basically Qualcomm seems to be using the same recipes for speeding the emulated apps up.

This means that the performance degradation of an x86 application running on Snapdragon X Elite in comparison to running natively on x86 cores will not be as great as what has been seen so far on ARM Windows PCs relying on older Cortex cores, which were also generally slower themselves. In other words, the performance of x86 applications generally considered poor with ARM Windows platform should be significantly better with Qualcomm’s new processors (it has to be noted that this is a notion or impression that many people probably don’t have solidly backed by real numbers in most cases, so it’s often more sentiments and assumptions than experience). But of course full performance will only be available in native applications directly compiled for the ARM instruction set.

These hardware features that aid in emulation will be used by the Prism compiler in Windows 11, which has itself been improved from the previous versions according to Microsoft. As a reminder, Microsoft has had this technology for translating x86 apps on ARM computers since 2017 (support for 64-bit apps came a bit later), actually longer than Apple had their Rossetta 2 equivalent available.

Concluding remarks on Oryon

The Oryon architecture should be quite powerful and it seems this backs up the performance numbers that Qualcomm has been promising. The core doesn’t contain any big surprises, the Nuvia team has apparently gone about things in quite a conservative manner, eschewing risks and repeating the successful recipes that its leaders and founders have used before for Apple’s cores: i.e. a wide core with a large L1 cache. Also typical of this are those large shared L2 caches instead of shared L3 caches.

However, at the same time, the engineers also didn’t try to go and one-up Apple in their first design in things like the number of decoders or ALUs (which appears to be ARM’s strategy with the Cortex X line of performance cores), so perhaps Qualcomm currently has a more balanced design, though the core may not have as much brute power as the wider Apple M4.

So, there is no exotic brand new technique or feature or some unique and potentially groundbreaking (or fatal in case of failure) idea utilised in this core (or if there is, Qualcomm keeps silent about it). Though it’s possible that some unique differentiator of this sort will still come in future generations – that Qualcomm’s CPU core program is ongoing and the company has confirmed that Oryon 2, 3 and so on architectures are coming in the future years.

Tip: ARM believes it will push x86 processors out of most PCs within five years

Last year’s x86 processors defeated, but we’ll see about the current ones

The Oryon core has a higher IPC (performance per 1 MHz) than the current competition like AMD Ryzen 8000 (Zen 4) and Intel Meteor Lake (Redwood Cove) processor that Qualcomm says the Oryon-based Snapdragon X Elite beats despite lower clock speeds (maximum 4.25 GHz versus ~5.1 GHz in AMD and Intel laptops). At the same time, Oryon is also supposed to have a lower single core power consumption. According to Qualcomm, Snapdragon X Elite should have a maximum power consumption in single-threaded applications somewhere around 15 W versus some 23 W for AMD (Ryzen 9 7940HS) and 27 W for Intel (Core Ultra 9 185H – all these numbers are coming straight from Qualcomm, so take them with a grain of salt and wait for some independent reviews before making your conclusions).

Oryon core power consumption and performance against Zen 4 and Meteor Lake, Qualcomm’s official benchmarks (Author: Qualcomm, via Anandtech)

But both AMD and Intel will launch their respective new generations of processor architectures this summer, which may understandably change these takeaways (Apple has also come out with the new M4 processor in the time between Oryon’s announcement and physical availability). Which means that in the end, the “Nuvia” Snapdragon X Elite may not squarely beat all the competitors as was tentatively promised last year. But it should certainly be a competitive processor, nevertheless. With all that said, it’s now up to Qualcomm’s ability to push the chip to laptop manufacturers and get design wins with them, and up to how capable those manufacturers are going to be at convincing the customers to choose these laptops over x86 ones.

Source: AnandTech

English translation and edit by Jozef Dudáš


  •  
  •  
  •  
Flattr this!

Zen 5 tested: Mobile core differs considerably from desktop one

Next month, AMD will begin selling both mobile Ryzen AI 300 “Strix Point” processors with Zen 5 architecture and desktop Ryzen 9000 with this new core. AMD only said vague things about the core at Computex, mentioning a 16% increase in IPC (for selected programs, of course), but now a test of a an actual specimen has appeared on the internet, revealing more about the Zen 5 architecture. And it’s very interesting stuff. Read more “Zen 5 tested: Mobile core differs considerably from desktop one” »

  •  
  •  
  •  

Skymont architecture analysed: Intel little core outgrows the big?

Intel unveiled their next-gen Lunar Lake mobile processor at Computex 2024. It will power Copilot+ PCs with its NPU and is supposed to be very power efficient, but it’s extremely interesting mainly because of the new CPU architectures, which will power future Arrow Lake desktop CPUs. Ironically, the star of this generation might actually be the little efficient E-Core accompanying the big P-Cores. Its architecture seems to have taken a giant leap. Read more “Skymont architecture analysed: Intel little core outgrows the big?” »

  •  
  •  
  •  

Intel’s new P-Core: Lion Cove is the biggest change since Nehalem

Intel revealed its next-gen Lunar Lake mobile processor at Computex 2024, to be released this summer. It will power Copilot+ PCs with its fast NPU and is supposed to be highly power efficient, but it’s also extremely interesting because its new CPU architectures are also coming to future Arrow Lake desktop CPUs. First up, we’ll take a look at the big P-Core architecture, which represents the biggest changes in many years. Read more “Intel’s new P-Core: Lion Cove is the biggest change since Nehalem” »

  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *