Intel’s new P-Core: Lion Cove is the biggest change since Nehalem

Lion Cove: New P-Core architecture for Lunar Lake and Arrow Lake processors (analysis)

Intel revealed its next-gen Lunar Lake mobile processor at Computex 2024, to be released this summer. It will power Copilot+ PCs with its fast NPU and is supposed to be highly power efficient, but it’s also extremely interesting because its new CPU architectures are also coming to future Arrow Lake desktop CPUs. First up, we’ll take a look at the big P-Core architecture, which represents the biggest changes in many years.

The latest information on Lunar Lake is important for gamers too, even if they’re not interested in mobile devices (though these chips could easily find use in gaming handhelds like Steam Deck). That’s because these processors share the same CPU core architecture as the upcoming higher-performance Arrow Lake processors, which will also come in desktop form factor with the new LGA 1851 platform.

The “big” core (aka P-Core) architecture is called Lion Cove and is apparently a microarchitecture that is brand new or significantly revised from the previous Golden Cove, Raptor Cove and Redwood Cove cores present in Alder Lake, Raptor Lake and Meteor Lake processors. The extent of the changes should to be at on par with AMD’s Zen 5. But we don’t know all the details on Zen 5 yet, so perhaps Lion Cove could be even bigger upheaval, it’s possible that Zen 6 will end up being rearchitected in fewer ways, compared to its predecessor, than Intel’s core.

The overall features of Lion Cove are markedly different from predecessors, starting for example with Intel abandoning the characteristic unified scheduler (scheduler that distributes instructions among units), behind which all execution units – ALU, AGU, and SIMD units (FPU) were connected, this was true for cores based on the P6 architecture line through Conroe and Sandy Bridge. In contrast, AMD cores or even Intel’s small E-Cores cores have various sorts of split and distributed schedulers for different types of execution units. Lion Cove is now switching to this concept as well.

6× ALU

The integer part of the execution backend contains six ALUs (one more than Golden Cove and probably the same as Zen 5) and it now has its own integer scheduler. These six ALUs are behind ports P0 to P5 and all can handle basic arithmetic-logic ALU instructions.

Even ports can also process branching (three “JMP” units are available for this). Odd ports, on the other hand, can process operations like SHIFT and integer multiplication (MUL) in addition to simple ALU operations. Thus, the core can perform all these operations at three per cycle. Of course, one port can only accept one operation per cycle; it cannot perform both ALU and, for example, branch processing at once.

6× AGU

The load-store part is also very powerful, used for writing and reading to and from memory. The processor has a separate memory scheduler for load-store operations and six AGUs (ports P20, P21, P22, P25, P26, P27), three of which can handle loads and three writes. Processors often have a portion of the units flexibly capable of both operations, resulting in better unit utilization across all different workloads, as the ratio of reads to writes can fluctuate in programs. Intel, however, seems to have decided otherwise.

Single-purpose store or load units are probably smaller and lower-power, but this means that despite the presence of six pipelines, always a hard maximum of three writes and a maximum of three reads can be served at any time (more than three memory operations can only happen in one cycle when mixing reads and writes).

Then there are two separate pipelines for Store Data operations on ports P10 and P11. These pipelines have their own store data scheduler.

Standalone FPU/SIMD unit for the first time

The FPU part in particular has its own scheduler (vector scheduler), which is separated with the FPU execution units into a distinct domain from the rest of the core, like it’s usual in the aforementioned AMD cores. This allows it to be put to sleep to save power, but likely also allows the core to be developed further more easily and flexibly as these blocks are separated. For example, Intel has widened the ALU part, while they hav not added a pipeline for the SIMD part.

However, it is more accurate to talk about the SIMD unit rather than FPU, as this is its main use today and SIMD instructions operate on both integer and floating-point data types. In the Lion Cove core, both types of SIMD operations are directed to this unit, whereas in previous big Intel cores, vector units were hidden behind the same ports as regular scalar integer ALUs. This had the disadvantage that using ALU ops blocked SIMD units on the same port and vice versa.

Now the SIMD, or vector part, has four dedicated pipelines with ports V0, V1, V2 and V3, which are usable in parallel with ALUs. The separation of the SIMD unit also makes sense because it operates over its own registers (SSE provides XMM registers, AVX2 has YMM registers and AVX-512 has ZMM registers, the x87 and MMX legacy instructions also use their own registers originally introduced for x87 FPUs), while the ALU operates with the so-called general-purpose registers.

Both the generic register set and the SIMD register set have their own physical register file on the die to store their contents. Accordingly, the ALU part and the SIMD unit also each have their own register renaming mechanism.

The four pipelines are made in such a way that half of them (V0 and V1) can handle FMA operations (i.e. fused multiply-accumulate, but can perform a separate addition or multiplication only) and vector SHIFT operations. The other half (V2, V3) of pipes can only perform floating-point addition (FADD) and not FMAs, but these pipelines/ports furher provide the divider units (FPDIV) and shuffle units. Thus, the core is always able to execute these kinds of instructions at throughput of two per cycle (except for FADDs, where four will probably be possible). However, integer SIMD operations are supported on all four pipelines, so the core can handle four per cycle.

The SIMD unit in Lunar Lake processors should only provide a width of 256 bits, i.e. which means support for SSE-SSE4, AVX and AVX2 operations, but not for AVX-512, which is yet again due to compatibility with small cores in big.LITTLE processors. However, the server version of the core could support AVX-512 in the future, meaning it could probably have a 512-bit unit width.

Interestingly, there are still just four SIMD pipelines as in the previous core, even though the core has been widened with regard to the other units. However, separating SIMD and ALU operations could itself lead to interesting performance improvements (thanks to the elimination of ALU and SIMD port conflicts) even though the width of the SIMD units has not changed.

Very wide frontend with eight decoders

However, by jumping straight to the execution units (which is the so-called backend) at the start, we skipped the usual order of things. So let’s go back to the frontend, which is where instruction processing starts and whose role is to feed the execution units with work from the executing program code as efficiently as possible – which is not a trivial thing at all, on the contrary, this is literally a core problem that is extremely complex and critical for performance, in today’s processors.

Intel states that the fetch of the processor has been widened, meaning the phase of fetching a part of the code from the L1 instruction cache, but we don’t have exact numbers. Branch prediction has also been strengthened, the prediction block is reportedly 8× larger.

The x86 instruction set is dead because…

Lion Cove has eight parallel instruction decoders, so it can decode up to eight instructions in one cycle. This somewhat demolishes one of the mantras often espoused by critics of the x86 architecture, namely that its variable-length instructions mean that x86 processors will never have a large number of parallel decoders. And this, according to this line of argument, makes it impossible for them to achieve the high IPC levels that are achievable with ARM processors (for example, Apple’s cores).

The Lion Cove core obviously overcomes this “impossibility” (although note the previous Golden Cove already had six decoders), and it is also apparently not prevented from doing so by the fact that these decoders are likely to actually be more complex and larger in area (and power consumption) than the ARM architecture decoders. Although, of course, it remains to be seen from detailed analyses how well the core is able to employ so many decoders when running practical code.

In addition to decoders, the processor can receive instructions from the uOP cache, which stores already decoded instructions so that they do not need to be decoded again, thus saving power. The Lion Cove core can pull up to 12 instructions per cycle from the uOP cache, so it can reach even higher perofrmance in this mode. Typically, x86 processors should run from the uOP cache most of the time. When handling instructions that are microcoded, up to four operations per cycle can be sent for further processing from the decoders.

RoB has 576 entries

The following allocate/rename phase, which renames registers to eliminate code conflicts and improve performance by allowing operations to execute independently, also has expanded capacity. Up to eight operations per cycle can pass through it, compared to six in the previous core. Intel has also increased the uOP queue in terms of capacity and bandwidth, and in particular increased the Reorder Buffer (RoB), which is a queue that forms the “window” of instructions over which the core can perform out-of-order code execution.

This means that instructions are executed out of order if they are not dependent on each other. So, for example, when the CPU has a free ALU, it will use it to execute an independent instruction that is somewhere further down in the code. This increases the IPC (performance per 1 MHz) and the effectiveness is helped the more of the code the CPU can see while looking for instructions it can execute in this manner.

In Golden Cove, the RoB was already quite large with a capacity of 512 instructions (o rather, we have to talk about uOPs, since it’s already decoded instructions). Lion Cove does not increase the depth radically, it has a RoB with a capacity of 576 operations. The core has also expanded the final retirement phase, which can handle up to 12 operations per cycle instead of the previous eight.

Hyper-Threading: No longer needed?

The major new feature of Lion Cove, which we have known about for a while, is the removal of Hyper-Threading or HT (HT is a company-specific branding used by Intel, the “generic” designation is SMT). This technique processes two threads on one core at the same time. The goal is that with two threads, in multi-threaded workloads, the core can take advantage of resources (such as ALUs) that a single thread fails to utilise, recovering extra performance that would otherwise be left unused.

The total performance extracted from the processor is increased, Intel claims by up to 30%, meaning that both threads deliver 65% of the performance that a single thread would (the power efficiency of this processing is slightly better, even though the power consumption is 20% higher). In the case where the core is processing a single-threaded application, the second thread is idle and ideally should not take away from the performance, which should thus be virtually 100%.

Intel says Hyper-Threading usually adds about 30% performance (Author: Intel, via: ComputerBase)

However, SMT requires the duplication of a number of structures in the core, which must constantly keep track of which instructions (and cached data) belong to which thread. This technique therefore makes it very difficult to ensure and verify correct processor operation (it is also the target of many timing attacks, so SMT is considered problematic for security), while also increasing the core’s die footprint.

Why isn’t HT worth it anymore?

Usually, the investment in this extra area is considered relatively small considering the benefits, however Intel has decided to remove SMT (HT) from the core and rather opt for the advantages of simplification. According to the company, this optimization has shrunk the core by 15%, and performance at a given power consumption can be 5% better (note that it is possible this figure only applies to single-threaded task performance though). The reasoning is is that Intel now makes hybrid processors whose E-Cores have similar pusrpose, making the HT a solution that is somewhat duplicative.

Intel’s processor policy is that a multithreaded task (HT is not relevant for tasks using a single or few threads) first occupies P-Cores with just one thread per core used, then allocates further threads on the E-Cores, and only in a third place does it start occupying the second HT threads on P-Cores. This means that in less scalable tasks SMT/HT will be used less often than E-Cores, which again contributed to the decision to ditch HT. However, in easily scaling tasks that can take advantage of all the threads available to them, taking away the HT of Arrow Lake and Lunar Lake processors will hurt performance a bit – this is a trade-off.

Cache memory completely redesigned, L0/L1 added

Intel has also redesigned the memory cache system, which, except for specific capacities, has worked quite similarly in client processors since the Nehalem architecture of 2008. The processor has L0, L1, L2 and L3 cache, plus an SLC cache for the entire SoC. It’s not a completely from-scratch design. The L0 cache for data has a capacity of 48 kB and a latency of 4 cycles. It seems to have more or less adopted the role or architecture of the previous L1 cache, but it is not quite the same. The L1 cache in the previous core had the same capacity, but had a latency of 5 cycles for most operations.

But what is new is what is now called the L1 cache for data – it is a cache memory with a capacity of 192 kB and a latency of 9 cycles, which is quite low latency (small L2 cache typically had a latency of 12 cycles, for example in Zen cores), so it could probably be alternatively called L1.5 cache.

This memory is probably supposed to make up for the disadvantage that the big Intel cores lately had. Intel kept increasing their L2 cache while their latency got worse (while Skylake still had only a 256kB fast L2), which negated part of the advantage of better hitrate. In Lion Cove, the L2 cache is 3MB (but it seems it might only be 2.5MB in Lunar Lake) with a latency of 17 cycles. So it can fit fairly large working program sets, but accesses take a longer time. The new L1 cache compensates for this by providing less space, but with even more speed than the earlier fast L2 caches in cores like Skylake.

In addition to the data cache itself, Intel also increased the TLB for the L0 data cache from 96 to 128 entries for the Lion Cove core.

The L3 cache in Lunar Lake has a capacity of 12 MB, and it apparently only serves the four big cores, not the efficient cores. This is so that in power saving mode, when the P-Cores are “asleep”, the L3 cache and interconnect logic of the big cores can be powered down too. However, this should not be the case in Arrow Lake processors, where the E-Cores will probably share the L3 cache as usual in order to reach better performance.

Clock speeds will no longer jump in 100 MHz increments, smarter boost

Lion Cove differs in one more thing, which again probably demonstrates the depth to which Intel has changed its architecture here. For a very long time (since the FSB was discontinued), Intel processor clock speeds for large cores have varied in 100 MHz increments. Similar to how AMD went to more granular 25MHz steps for the Zen architecture, Intel has now gone to frequency granularity in 16.66 MHz increments for the Lion Cove core (i.e. a 100 MHz step has been split into 6 slices).

This allows finer tuning of the clock speed. You may not see this directly in the specs and they may still be aligned to the round hundreds, but the processor will also be able to adjust boost clock more smoothly and due to the greater granularity should usually be able to clock a bit better than when it was bound to 100MHz steps. This should therefore improve performance a bit, but also the power efficiency.

The automatic clock speed and power consumption management should also be more advanced. Until now, the behavior has been predefined and quite rigid. But the new boost should automatically adapt to the current conditions and hopefully achieve better performance, the algorithm should be optimized by AI.

14% better IPC compared to Meteor Lake

The purpose of these major changes and broad design is both to increase performance per 1 MHz (the popular so-called IPC figure) and to provide a basis for future growth. Intel states that in the current version of the architecture, Lion Cove, this core has a roughly 14% higher IPC than the previous Redwood Cove core.

Note that the comparison is against the Reedwood Cove core used in the 4nm Meteor Lake processors, not against the Golden Cove / Raptor Cove core used in Alder Lake and Raptor Lake processors. According to some tests, Redwood Cove may be a little worse per 1 MHz, but even so Lion Cove should still show a double-digit IPC improvement over Raptor Lake processors at the same clock speed.

This figure is slightly worse than the one reported by AMD for Zen 5 (+16%). However, numbers like this tend to be a bit arbitrary, as it depends on what tests are included in the IPC measurement. So it is too early to discuss whether Zen 5 or Lion Cove will demonstrate a bigger jump in performance in practice.

For Lion Cove, of course the final performance will also depend on clock speeds, which are the other crucial performance factor as important as the IPC. While Zen 5 is expected to reach the same maximum clock speeds as the previous generation Zen 4 architecture in both desktop (5.7 GHz) and laptop (5.1 GHz) processors, we don’t know the numbers yet for Lunar Lake and Arrow Lake processors.

These Intel CPUs are using a TSMC manufacturing node instead of Intel’s for the first time, which is a radical change. It’ll represent a jump to a 3nm node, which should be an advantage in power efficiency, but it’s not certain that this won’t have adverse effect on achieved clock speeds, as Intel is only supposed to be using the basic N3B version for now, not the improved N3P node that recently allowed Apple to significantly increase the clock speeds of its M4 processor.

Lion Cove core schematic (Author: Intel, via: AnandTech)

Regardless of how well Lion Cove performs in benchmarks, though, its architectural features look quite remarkable from what we know now. Intel should make significant strides in performance and power efficiency with this microarchitecture, and it should help the company’s competitiveness quite a bit (unless the processor is hampered by low maximum clock speeds, that is something that remains to be seen).

Various preliminary obituaries and judgments calling Intel dysfunctional or hopelessly behind might need to be reconsidered in the future. At the same time, it’s important that architecture, like AMD’s Zen 5, should also open up potential for further development besides processors directly using Lion Cove, so this architecture could be an important step in the long-term development of the Intel CPU story.

Sources: Intel, AnandTech, Tom’s Hardware

English translation and edit by Jozef Dudáš


  •  
  •  
  •  
Flattr this!

Leave a Reply

Your email address will not be published. Required fields are marked *