what is WAFER-SCALE ENGINE (WSE).?

Abstract:

To meet the growing demand from deep learning applications for computing resources, accelerators by ASIC are necessary. A wafer-scale engine (WSE) is recently proposed [1], which is able to simultaneously accelerate multiple layers from a neural network (NN). However, without a high-quality placement that properly maps NN layers onto the WSE, the acceleration efficiency cannot be achieved. Here, the WSE placement resembles the traditional ASIC floorplan problem of placing blocks onto a chip region, but they are fundamentally different. Since the slowest layer determines the compute time of the whole NN on WSE, a layer with a heavier workload needs more computing resources. Besides, locations of layers and protocol adapter cost of internal 10 connections will influence inter-layer communication overhead. In this paper, we propose GigaPlacer to handle this new challenge. A binary- search-based framework is developed to obtain a minimum compute time of the NN. Two dynamic-programming-based algorithms with different optimizing strategies are integrated to produce legal placement, The distance and adapter cost between connected layers will be further minimized by some refinements. Compared with the first place of the ISPD2020 Contest, GigaPlacer reduces the contest metric by up to 6.89% and on average 2.09%, while runs 7.23X faster.

The Wafer-Scale Engine (WSE) is the revolutionary central processor for our deep learning computer system. The second-generation WSE (WSE-2) powers our CS-2 system: it is the largest computer chip ever built and the fastest Al processor on Earth.

Unlike legacy, general-purpose processors, the WSE was built from the ground up to accelerate deep learning: 850,000 cores for sparse tensor operations, massive high bandwidth on-chip memory, and interconnect orders of magnitude faster than a traditional cluster could possibly achieve. Altogether, it gives you the deep learning compute resources equivalent to a cluster of legacy machines all in a single device, easy to program as a single node – radically reducing programming complexity, wall-clock compute time, and time to solution

Compute Designed for Al

Each core on the WSE is independently programmable and optimized for the tensor-based, sparse linear algebra operations that underpin neural network training and inference for deep learning, enabling it to deliver maximum performance, efficiency, and flexibility.

The WSE-2 packs 850,000 of these cores onto a single processor. With that, and any data scientist can run state-of-the-art Al models and explore innovative algorithmic techniques at record speed and scale, without ever touching distributed scaling complexities.

Memory Capacity and Bandwidth

Unlike traditional devices, in which the working cache memory is tiny, the WSE-2 takes 40GB of super-fast on-chip SRAM and spreads it evenly across the entire surface of the chip. This gives every core single-clock-cycle access to fast memory at extremely high bandwidth-20 PB/s. This is 1,000x more capacity and 9,800x greater bandwidth than the leading GPU.

This means no trade-off is required. You can run large, state-of-the art models and real-world datasets entirely on a single chip. Minimize wall clock training time and achieve real-time inference within latency budgets, even for large models and datasets.

High Bandwidth. Low Latency.

Deep learning requires massive communication bandwidth between the layers of a neural network. The WSE uses an innovative high bandwidth, low latency communication fabric that connects processing elements on the wafer at tremendous speed and power efficiency. Dataflow traffic patterns between cores and across the wafer are fully configurable in software.

The WSE-2 on-wafer interconnect eliminates the communication slowdown and inefficiencies of connecting hundreds of small devices via wires and cables. It delivers an incredible 220 Pb/s processor-processor interconnect bandwidth. That’s more than 45,000x the bandwidth delivered between graphics processors.