Computing Under Pressure: Key Insights From Hot Chips 2025

Tyler Kvochick
|
|
November 20, 2025
|
3
Min Read
Computing Under Pressure: Key Insights From Hot Chips 2025

Artificial intelligence is stretching the limits of what our data centers can deliver. That was the clear message at Hot Chips 2025, the annual conference where chip designers, system architects, and researchers reveal what is next in compute hardware. This year, the takeaway was simple and unavoidable: physics has become the primary constraint on progress. If organizations want smarter and faster AI, they must reimagine how bits move, how networks scale, and how efficiently power turns into intelligence.

Our team attended the event and returned energized and excited for the next generation in compute and networking. The work happening in silicon, optics, memory, and interconnect technology is about to reshape how data centers are built. Here are the insights we believe matter most for owners, developers, and operators of AI infrastructure.

The Data Center Is the New Unit of Compute

In multiple presentations, speakers stated that the data center is now the true computer. Not the CPU. Not the GPU. Not the rack. The entire building. AI workloads rely on massive clusters that behave like a single machine, distributing memory, compute, and cooling through the entire space. To make those clusters useful, every piece of the environment must scale together. Networking, storage, cooling, and power delivery are becoming programmable components in one tightly coordinated system.

AI models are also not shrinking. They are ballooning in size because the context window needs to grow for AI models to increase in usefulness. The attention mechanism that underpins transformer-based AI models must grow exponentially in the size of the context window. This means that more useful models consume more memory but there is less physical area on the die to place that memory. The only place to get more memory is to add more compute nodes in a cluster, requiring far more interconnect bandwidth. The organizations driving the advances in compute capability are limited by the ability to move information between chips. This is why networking is now the bottleneck under the most intense engineering scrutiny.

Photonics Steps into the Spotlight

The most exciting theme was photonic networking: using photons rather than electrons to move data inside and between compute systems. Photonics has long been seen as a future technology, but Hot Chips made something as clear as day: lighter is faster.

Photonic interconnects transmit data using dramatically less energy per bit compared to electrical signaling. Even small improvements are worth celebrating at hyperscale, but proponents demonstrated power draw amounts that were an order of magnitude smaller than today’s systems.

Nvidia showcased new SpectrumX hardware that uses micro ring modulators to control light at the chip level. The result enables large scale switching fabrics with significantly reduced power consumption due to transceivers through the use of external laser sources. Several startups also showed breakthrough developments, including prototype silicon that supports tens of terabits per second of optical bandwidth right at the chip interface.

What does this mean for data center design? Fiber everywhere. The next generation of GPU clusters will require 10 to 30 times more fiber density per device than existing deployments. Cable management, pathway planning, and connector strategy will become central architectural challenges. On the other end of the spectrum, the higher density will mean that enterprises with less available space will be able to install AI training clusters with the compute capacity that would require a dedicated site today.

AI Is Driving Every Decision

There was no ambiguity about the killer application for the GPUs and switches. Every session, every roadmap, and every forward looking concept revolved around artificial intelligence. Generative AI and agent based AI require enormous computational resources, which depend heavily on high speed parallel memory access.

Remote direct memory access (RDMA) is now the backbone technology for large AI training clusters. RDMA allows accelerators to read and write to each other’s memory directly over the network. It reduces overhead and increases throughput, but it also creates a highly coupled compute environment. Training clusters are latency intolerant. Failures propagate instantly. Any slow link delays the entire system.

LLM architectures make this challenge even more intense. Multi head attention layers require all outputs to be collected before a model can progress. If one GPU finishes late, every GPU waits. If one link drops packets, accuracy falls. The overall system is only as strong as its weakest interconnect. Additionally, up-front capital expenditure on redundancy in compute and connectivity de-risks operational expenditure against hardware failure.

This shows a simple truth: connectivity is compute.

Bigger Loads, Higher Densities, and New Power Strategies

Performance requires power. Lots of it. AI clusters are already pushing rack densities toward the 100 kilowatt class. Liquid cooling is not optional anymore. Rack weights are increasing as well, which means structural coordination becomes more critical.

Major cloud operators are now considering distributing higher voltage direct current power directly to racks rather than relying on in row conversion. This would free up valuable rack units for compute and reduce the thermal waste of power electronics. When every cubic inch of the rack must support training or inference, operators want fewer overhead components consuming space.

Standardization vs Customization

Turnkey training clusters are rapidly becoming the default for organizations that want to move quickly. Nvidia reference architectures such as NVL72 provide a proven baseline, which reduces integration risk and deployment time. However, large scale operators with unique goals continue to tweak reference or build custom architectures. Meta’s Catalina cluster diverges from Nvidia’s defaults. Google remains committed to its in house TPU strategy.

The takeaway is that the industry will remain split. Most data centers will follow tightly integrated vendor ecosystems. The biggest platforms will pursue optimization advantages with bespoke systems. Designers and integrators must support both routes.

What This Means for TEECOM Clients

The conference highlighted specific areas where owners will increasingly rely on expert partners:

  • Fiber layout and distance planning: Switches supporting RDMA have strict limits on buffer sizes and optical distance budgets. A cable route that is a few meters too long could create bottlenecks under load.
  • Topology and interconnect strategy: All to all architectures are more applicable to AI data centers than multi stage Clos networks. This changes rack adjacency, cable tray sizing, and space planning.
  • Legacy integration: AI pods will be added to existing facilities. Properly bridging new high speed fabrics into spine networks will be a high stakes design decision.
  • Cooling coordination: Liquid cooling technologies vary. Integration requires mechanical, electrical, and network design alignment from day one.
  • Power delivery transformation: High voltage DC distribution will require new safety practices and equipment layouts.
  • Site design: Converting AC to DC on a data center campus will require coordinating additional structures with their own telecom and security requirements

AI infrastructure continues its pace of rapid evolution. Early adopters are already designing around photonics, extreme RDMA performance, and unprecedented power densities. Many operators will follow within the next few refresh cycles.

Data centers are becoming giant cohesive computers built to train and deploy intelligence at scale. The winners in this next era will be the organizations that embrace these shifts early and design facilities that match the pace of innovation. TEECOM is ready to help them get there. Contact Tyler Kvochick today to discuss these insights further.

About the Author

Tyler Kvochick is Director of Research at TEECOM, where he leads the development of novel system-design platforms to enhance quality and efficiency in TEECOM’s projects. With a master’s degree in architecture from Princeton University and experience as a software developer in construction-tech and environmental-analysis companies, Tyler combines computational and design thinking to eliminate errors, information loss, and ambiguity for our clients.