Microsoft's first "AI super factory" is put into operation: connecting two data centers to build a distributed network

Microsoft's first "AI super factory" has officially commenced operations, connecting data centers in Atlanta and Wisconsin through a dedicated high-speed network to build a cross-state collaborative distributed computing cluster. This innovative architecture integrates dispersed computing resources into a virtual supercomputer, marking a significant shift in AI infrastructure from independent site construction to a new era of networked collaboration

Microsoft is embarking on a new chapter in its AI infrastructure by connecting large data centers across different states to build a collaborative distributed "AI super factory." This strategy aims to accelerate the training of AI models at an unprecedented scale and speed, marking a shift in the industry’s competition from point construction to networked layout to meet explosive computing power demands.

According to Microsoft, its next-generation AI data center in Atlanta officially began operations in October this year. This is the second facility in Microsoft's "Fairwater" series and has been connected via a dedicated high-speed network to another data center announced for construction in Wisconsin. This means that Microsoft's first interstate collaborative AI computing cluster is now operational, capable of reducing complex AI training tasks that would normally take months to just weeks.

This move comes amid an escalating "AI arms race" among tech giants. According to The Wall Street Journal, Microsoft plans to double its total data center footprint in the next two years to meet surging computing power demands. The new "AI super factory" network will not only support core businesses such as OpenAI, Microsoft's own AI superintelligence team, and Copilot, but will also serve key clients like Mistral AI in France and Elon Musk's xAI, highlighting its core position in the AI infrastructure field.

Behind this massive construction plan is significant capital expenditure. Microsoft’s capital expenditures exceeded $34 billion in the recently concluded fiscal quarter and is expected to continue increasing in the coming year. Across the industry, total AI-related investments by tech companies are projected to reach $400 billion this year. Against this backdrop, Microsoft’s distributed network strategy is not only a technological innovation but also a crucial step in solidifying its leadership position in a fiercely competitive market.

“AI Super Factory”: From Independent Sites to Distributed Networks

The core of Microsoft's "AI super factory" concept lies in integrating multiple geographically dispersed data centers into a single virtual supercomputer, which is fundamentally different from traditional data center design concepts.

Alistair Speirs, General Manager of Microsoft Azure Infrastructure, explains: “Traditional data centers are designed to run millions of independent applications for multiple clients, while we call this an ‘AI super factory’ because it runs a complex job on millions of hardware.” In this model, it is no longer a single site training an AI model, but rather a network of sites collectively supporting the same training task.

This distributed network will connect multiple sites, integrating hundreds of thousands of state-of-the-art GPUs, exabyte-level storage, and millions of CPU cores. Its design goal is to support future AI model training with parameter scales reaching trillions. As AI training processes become increasingly complex, encompassing multiple stages such as pre-training, fine-tuning, reinforcement learning, and evaluation, this cross-site collaborative capability becomes essential.

Purpose-Built for AI: Design and Technology of the Next Generation Data Center

To realize the vision of a "super factory," Microsoft has designed the "Fairwater" series of data centers from the ground up. The facility located in Atlanta covers 85 acres and has a building area of over 1 million square feet, with a design fully optimized for AI workloads.

Its key technical features include:

High-Density Architecture: An innovative dual-layer building design is used to accommodate more GPUs in a smaller physical space, thereby reducing internal communication latency.

Cutting-Edge Chip Systems: Deployment of NVIDIA's GB200 NVL72 rack-scale system, scalable to hundreds of thousands of NVIDIA Blackwell architecture GPUs.

Efficient Liquid Cooling System: To cope with the high heat generated by GPU clusters, Microsoft has designed a complex closed-loop liquid cooling system. This system consumes almost no water resources, with its initial water fill amount equivalent to the annual water usage of 20 American households.

Internal High-Speed Interconnect: Within the data center, all GPUs are closely connected through a high-speed network, ensuring rapid information flow between chips.

"Achieving leadership in artificial intelligence is not just about adding more GPUs, but about building the infrastructure that allows them to work together as a system," said Scott Guthrie, Executive Vice President of Microsoft's Cloud and AI division. He emphasized that the design of Fairwater embodies Microsoft's years of end-to-end engineering experience, aimed at meeting the growing demand with real-world performance.

Connecting Multiple States: AI Wide Area Network and Computing Power Allocation Strategy

Connecting multiple data centers that are far apart into a cohesive whole relies on Microsoft's specially built AI Wide Area Network (AI WAN). Microsoft has deployed 120,000 miles of dedicated fiber optic cables to create a "highway" exclusively for AI traffic, allowing data to be transmitted without congestion at speeds close to the speed of light.

Mark Russinovich, Chief Technology Officer of Microsoft Azure, pointed out that as model sizes grow, the computing power required for training has long exceeded the limits of a single data center. If any part of the network experiences a bottleneck, the entire training task will come to a halt. The goal of the Fairwater network is to keep all GPUs continuously busy.

The choice to build across states, rather than concentrating all computing power in one location, is primarily due to considerations of land and power supply. Alistair Speirs stated in an interview with The Wall Street Journal that distributing power demand across different regions can avoid overburdening any single power grid or community. He admitted, "You have to be able to train across multiple regions because no one has reached our current scale, so no one has really encountered this problem."

"Arms Race" Under Surging Demand

Microsoft's "AI super factory" is a core asset in its response to the surge in AI computing power demand and competition with rivals. Although Microsoft has previously adjusted some data center leasing plans, Alistair Speirs clarified that this is merely a "shift in capacity planning," and the demand the company currently faces far exceeds its supply capacity In this computing power competition, Microsoft is not alone. Its main competitor, Amazon, recently launched the 1,200-acre Project Rainier data center cluster in Indiana, which is expected to consume 2.2 gigawatts of electricity. Additionally, companies like Meta Platforms and Oracle have also announced large-scale construction plans, while AI startup Anthropic has announced plans to invest $50 billion in computing infrastructure in the United States.

By connecting data centers into a unified distributed system, Microsoft has not only paved new technological paths but also prepared commercially to meet the enormous demands of top AI companies. As Scott Guthrie stated, "We operate AI sites as a whole, which can help our customers turn breakthrough models into reality."