- guidelines for domain specific architectures (DSAs)
- example DSAs
  - Google's Tensor Processing Unit
  - Microsoft's Catapult
  - Google's Pixel Visual Core
- crosscutting issues

## Reasons for Adoption of Domain Specific Architectures

Introduction

Guidelines

• End of Dennard scaling and slowing of Moore's Law means we need to lower the energy per operation to improve performance.

Pixel Visual Core

- If order-of-magnitude performance improvements are to be obtained, we need to increase the number of arithmetic operations per instruction from one to hundreds.
- Conventional architecture innovations may not be a good match to some domains.
- Many believe that future computers will consist of standard processors that can run conventional programs along with domain-specific processors that can very efficiently perform only a narrow range of tasks.



- Find a domain whose demand for applications is for enough chips to justify the nonrecurring engineering (NRE) costs.
- DSAs are best applied for small compute-intensive kernels of larger systems, where a significant fraction of the computation is done for some applications.
- This means that future computers may be much more heterogeneous than current homogeneous multicore chips.
- Architects must learn the application domain and algorithms.
- There must also be support for porting software to exploit these DSAs.

Guidelines Pixel Visual Core DSA Guidelines

- Use dedicated memories to minimize the distance over which data is moved.
  - Multi-level caches are expensive in area and energy for moving data.
  - Many domains have predictable memory access patterns with little data reuse.
  - DSA programmers understand their domain.
  - Data movement can be more efficient with software controlled memories.
- Invest resources saved from dropping advanced microarchitectural optimizations into more arithmetic units and/or larger on-chip memories.

## DSA Guidelines (cont.)

GPUs.

- Use the easiest form of parallelism that matches the domain.
  - Target domains for DSAs will have inherent parallelism.
  - Need to exploit that parallelism and expose it to the software so it does not need to be automatically found by the hardware (no OoO execution).
- Reduce data size and type to the simplest needed for the domain.
  - Many domains are memory bound.
  - Can increase memory bandwidth and utilization by using narrower data types.
- Use a domain-specific programming language to port code to the DSA.
  - Allow for easier exploitation of parallelism.
  - Simplify porting of code to a DSA.

| <br>Guideline                | TPU                                          | Catamult                             | Crest            | Pixel Visual Core                                       |
|------------------------------|----------------------------------------------|--------------------------------------|------------------|---------------------------------------------------------|
| Design target                | Data center ASIC                             | Catapult  Data center FPGA           | Data center ASIC | PMD ASIC/SOC IP                                         |
| Dedicated memories           | 24 MiB Unified Buffer,<br>4 MiB Accumulators | Varies                               | N.A.             | Per core: 128 KiB line<br>buffer, 64 KiB P.E.<br>memory |
| 2. Larger arithmetic unit    | 65,536 Multiply-<br>accumulators             | Varies                               | N.A.             | Per core: 256 Multiply-<br>accumulators (512 ALUs       |
| 3. Easy parallelism          | Single-threaded, SIMD, in-order              | SIMD, MISD                           | N.A.             | MPMD, SIMD, VLIW                                        |
| 4. Smaller data size         | 8-Bit, 16-bit integer                        | 8-Bit, 16-bit integer 32-bit Fl. Pt. | 21-bit Fl. Pt.   | 8-bit, 16-bit, 32-bit integ                             |
| 5. Domain-<br>specific lang. | TensorFlow                                   | Verilog                              | TensorFlow       | Halide/TensorFlow                                       |

| Introduction<br>000 | Guidelines<br>000 | TPU<br>●000000 | Catapult<br>000000 | Pixel Visual Core<br>000000 | ls su es<br>OO |
|---------------------|-------------------|----------------|--------------------|-----------------------------|----------------|
| Google's            | Tensor Pro        | cessing Ur     | nit (TPU)          |                             |                |
|                     |                   |                |                    |                             |                |
|                     |                   |                |                    |                             |                |
| • G                 | oogle's first AS  | SIC DSA for i  | ts WSCs            |                             |                |
|                     | omain is for de   |                |                    | ls).                        |                |
|                     | rogrammed us      | ·              | •                  | •                           |                |

• Goal is to improve cost-performance by a factor of 10 over

• Deployed in Google data centers since 2015.

| Introduction<br>000 | Guidelines<br>000 | TPU<br>0•00000 | Catapult<br>000000 | Pixel Visual Core | ls su es<br>OO |
|---------------------|-------------------|----------------|--------------------|-------------------|----------------|
| TPU Arc             | hitecture         |                |                    |                   |                |
|                     |                   |                |                    |                   |                |

- TPU is a coprocessor on the PCle I/O bus from which it receives instructions.
- Has a large software managed on-chip memory, which consists of a 24 MiB Unified Buffer.
- Has off-chip 8 GiB DRAM for Weight Memory.
- Matrix Multiply Unit (MMU) contains 256x256 (65,536) ALUs that can perform 8-bit multiply-and-adds on integers.
  - Reads and writes 256 values per clock cycle.
  - Used for matrix multiplications and convolutions.





in accumulators and stores results in unified buffer.

into the CPU host memory.

• Write Host Memory - writes data from the unified buffer







- Use dedicated memories to minimize the distance over which data is moved. Has on-chip a 24 MiB unified buffer that allows accessing 256 bytes each cycle, a weight FIFO, and a 4 MiB accumulators.
- Invest resources saved from dropping advanced microarchitectural optimizations into more arithmetic units or bigger memories. Has 28 MiB of on-chip memory and 64K 8-bit ALUs.
- Use the easiest form of parallelism that matches the domain. Exploits 2D SIMD parallelism with the 256x256 MMU.
- Reduce data size and type to the simplest needed for the domain. Primarily does computation on 8-bit integers.
- Use a domain-specific programming language to port code to the DSA. Programs to control the TPU are written in the TensorFlow language.

## Microsoft's Catapult

Guidelines

 Microsoft placed an FPGA on a PCIe bus board in data center servers.

Catapult

Pixel Visual Core

- Used FPGA flexibility to tailor use for varying applications.
- FPGAs have lower NRE costs than ASICs.
- FPGAs are slower than ASICs.
- Key applications were to provide a CNN accelerator and to improve the performance of the Microsoft Bing search engine.











- Use dedicated memories to minimize the distance over which data is moved. Has 5 MiB of on-chip memory.
- Invest resources saved from dropping advanced microarchitectural optimizations into more arithmetic units or bigger memories. Has 3926 18-bit ALUs.
- Use the easiest form of parallelism that matches the domain.
   Exploits 2D SIMD parallelism for the CNN application and
   MISD parallelism for search ranking.
- Reduce data size and type to the simplest needed for the domain. Does computation on 8-bit integer to 64-bit FP values.
- Use a domain-specific programming language to port code to the DSA. Programming is done in Verilog.





- Uses a 2D SIMD achitecture of independent processing elements (PEs), each containing 2 16-bit ALUs, 1 16-bit MAC unit, 10 16-bit registers, and 10 1-bit predicate registers.
- PE memory is a compiler managed scratchpad containing 128 16-bit entries (256 bytes).
- Each PE collects inputs from nearest neighbors.









- Use dedicated memories to minimize the distance over which data is moved. Has 128 KiB of line buffers per core. Also has 64 KiB of software controlled PE memory.
- Invest resources saved from dropping advanced microarchitectural optimizations into more arithmetic units or bigger memories. Has 16x16 2D array of PEs per core.
- Use the easiest form of parallelism that matches the domain. Exploits 2D SIMD parallelism in its PE array, VLIW instructions for ILP, and MPMD for utilizing multiple cores.
- Reduce data size and type to the simplest needed for the domain. Does computation mostly on 8-bit and 16-bit integers.
- Use a domain-specific programming language to port code to the DSA. Programming is done in Halide for image processing and Tensorflow for CNNs.



