It’s AWS re:Invent this week, Amazon’s annual cloud computing extravaganza in Las Vegas, and as is tradition, the company has so much to announce, it can’t fit everything into its five (!) keynotes. Ahead of the show’s official opening, AWS on Monday detailed a number of updates to its overall data center strategy that are worth paying attention to.
The most important of these is that AWS will soon start using liquid cooling for its AI servers and other machines, no matter whether those are based on its homegrown Trainium chips and Nvidia’s accelerators. Specifically AWS notes that its Trainium2 chips (which are still in preview) and “rack-scale AI supercomputing solutions like NVIDIA GB200 NVL72” will be cooled this way.
It’s worth highlighting that AWS stresses that these updated cooling systems can integrate both air and liquid cooling. After all, there are still plenty of other servers in the data centers that handle networking and storage, for example, that don’t require liquid cooling. “This flexible, multimodal cooling design allows AWS to provide maximum performance and efficiency at the lowest cost, whether running traditional workloads or AI models,” AWS explains.
The company also announced that it is moving to more simplified electrical and mechanical designes for its servers and server racks.
“AWS’s latest data center design improvements include simplified electrical distribution and mechanical systems, which enable infrastructure availability of 99.9999%. The simplified systems also reduce the potential number of racks that can be impacted by electrical issues by 89%,” the company notes in its announcement. In part, AWS is doing this by reducing the number of times the electricity gets converted on its way from the electrical network to the server.
AWS didn’t provide many more details than that, but this likely means using DC power to run the servers and/or HVAC system and avoiding many of the AC-DC-AC conversion steps (with their default losses) otherwise necessary.
“AWS continues to relentlessly innovate its infrastructure to build the most performant, resilient, secure, and sustainable cloud for customers worldwide,” said Prasad Kalyanaraman, vice president of Infrastructure Services at AWS, in Monday’s announcement. “These data center capabilities represent an important step forward with increased energy efficiency and flexible support for emerging workloads. But what is even more exciting is that they are designed to be modular, so that we are able to retrofit our existing infrastructure for liquid cooling and energy efficiency to power generative AI applications and lower our carbon footprint.”
In total, AWS says, the new multimodal cooling system and upgraded power delivery system will let the organization “support a 6x increase in rack power density over the next two years, and another 3x increase in the future.”
In this context, AWS also notes that it is now using AI to predict the most efficient way to position racks in the data center to reduce the amount of unused or underutilized power. AWS will also rool out its own control system across its electrical and mechanical devices in the data center, which will come with built-in telemetry services for real-time diagnostics and troubleshooting.
“Data centers must evolve to meet AI’s transformative demands,” said Ian Buck, vice president of hyperscale and HPC at NVIDIA. “By enabling advanced liquid cooling solutions, AI infrastructure can be efficiently cooled while minimizing energy use. Our work with AWS on their liquid cooling rack design will allow customers to run demanding AI workloads with exceptional performance and efficiency.”