Nvidia’s new fastest AI GPU: H200 with 141GB of HBM3E memory

Hopper receives faster memory and a performance increase

Last year, Nvidia launched the 4nm H100 accelerator with Hopper architecture. It has since been the company’s fastest GPU for AI. Now the company is launching its successor dubbed H200. It isn’t quite a new generation yet, but something of a refresh that will lead Nvidia’s lineup until the next generation with the Blackwell architecture is released. The H200 relies on the use of faster memory, but that should also lift overall performance.

The H200 accelerator should use the same 4nm Hopper chip with 80 billion transistors as the H100, and also probably the same mezzanine form factor. What is new, however, is the HBM3E memory, which the GPU should apparently be the first on the market to use. This memory provides a capacity of 141GB, which is an unusually irregular number – apparently it should be 144GB made up of six 24GB HBM3E packages, but 3GB are unavailable for some reason.

The question is if the GPU keeps the missing space reserved for some dedicated purpose, or if Nvidia, in cooperation with the manufacturers of this memory, can disable the individual DRAM layers in HBM3E packages (if these 24GB packages are eight layers stacks, then one DRAM layer would correspond exactly to 3 GB of capacity). This could salvage an HBM3E package with some defect that was found after it is mounted on the GPU, whereas normally the entire package would have to be deactivated at the cost of a significant loss of performance and graphics memory capacity.

Memory bandwidth reaches up to 4.8 TB/s, which with a 6144-bit bus with six packages means that the memory should run at about 6400 MHz (6.4 Gb/s per one bit of width) effective speed. For the H100, Nvidia claimed a bandwidth of 3 TB/s, so this should be an increase of up to 60%. This is not only due to the higher HBM3E clock speed, but also because the full 6144-bit interface is used, whereas the H100 only used 5120 bits – only five HBM3 packages out of six were active.

Nvidia H200

We don’t know if the clock speeds and number of compute units have increased. In the H100 version, the chip had 16,896 shaders (132 SMs) and 528 tensor cores enabled with a boost clock of around 1.83 GHz, giving a raw performance of 66.9 TFLOPS in FP32 operations and 33.5 TFLOPS in FP64. Using tensor cores and 8-bit precision, the theoretical performance should be approaching 2000 TOPS. The TDP of the original version was 700 W, again we don’t know yet if it has stayed the same.

Nvidia states that this new product can have up to 60% higher performance in GPT-3 inference with 175 billion parameters compared to H100, up to 90% higher performance in Llama2 inference with 70 billion parameters and in HPC simulation type computing it can be up to 2x faster, but this last figure is only comparing it against the 7nm Ampere A100, not H100. Beware though that these are just the vendor-provided benchmarks and may be selective and thus misleading. For example, if a company has selected those numbers where a task was previously severely slowed down due to not fitting into the available memory (while H200 removes the bottleneck for them), this resulting speedup will not represent tasks that were not previously capacity limited.

HGX system board with four CPUs and H200 Grace Hopper Superchip accelerators

The H200 will be produced as a standalone mezzanine accelerator (whicht needs a special carrier motherboard, there is no information about a standard PCI Express version yet). Nvidia will also offer a version combined into one package with an ARM processor, named H200 Grace Hopper Superchip.

The Jupiter supercomputer at the Jülich Computing Centre in Germany is currently being built on these processors/GPUs. It will be an Eviden BullSequana XH300 cluster with just under 24,000 Grace Hopper Superchips. Its power draw is to be up to 18.2 MW and its performance in AI operations 90 EFLOPS or up to 1 EFLOPS in scientific computing (FP64). This could put this system in the “exascale” club.

The Jupiter supercomputer using the H200 Grace Hoper Superchip

Available in Q2 2024

As is the case with Nvidia’s compute GPUs (and other companies’ server products), the current unveiling is preliminary and real availability will come much later. In the case of the H200, it should come in the second quarter of 2024, when these accelerators will become available from manufacturers of servers and in cloud services. Nvidia itself will offer these GPUs (in quad or octal configurations) in its Nvidia HGX servers.

Source: Nvidia (1, 2) AnandTech

Jan Olšan, editor @ Cnews.cz


  •  
  •  
  •  
Flattr this!

Cooperative Vectors in DirectX to use Blackwell Neural Shaders

Nvidia recently talked new features for GeForce graphics cards – primarily the RTX Remix modding platform leaving beta and first games using Nvidia ACE. The company has another announcement: Neural Shaders, one of the architectural innovations in Blackwell GPUs, will be coming to DirectX. Microsoft is adding a Cooperative Vectors function to this API, which GeForce RTX 5000 series will support precisely through their Neural Shaders. Read more “Cooperative Vectors in DirectX to use Blackwell Neural Shaders” »

  •  
  •  
  •  

Better, more capable than expected: RDNA 4 architecture deep dive

Unofficial leaks from the past initially didn’t paint the RDNA 4 architecture as a major new design, suggesting that it’s more akin to RDNA 3 bugfix – except for new ray tracing units. But it turns out that was a big misconception, as RDNA 4 is a significant upgrade that leaves no GPU subsystems untouched, far beyond just adding new ray tracing units. It also brings enhanced AI acceleration and redesigned compute units (shaders). Read more “Better, more capable than expected: RDNA 4 architecture deep dive” »

  •  
  •  
  •  

Nvidia boosts RTX Video Super Resolution performance, adds HDR

When Nvidia unveiled GeForce RTX 5000 graphics in January, various new features were presented (though not all of them are exclusive to these new GPUs), most notably DLSS 4 able to generate more interpolated frames. We’ve devoted a separate article to Blackwell’s features, but now that the GPUs have started selling (albeit in limited quantities), we see that that are some additional new features that have flown under-the-radar before. Read more “Nvidia boosts RTX Video Super Resolution performance, adds HDR” »

  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *