## **mNPUsim**: Evaluating the Effect of Sharing Resources in Multi-core NPUs

Soojin Hwang\*, Sunho Lee\*, Jungwoo Kim,

Hongbeen Kim, and Jaehyuk Huh





#### Emergence of Multi-core NPU



1) Norrie et al., "Google's Training Chips Revealed: TPUv2 and TPUv3", Hot Chips Symopsium, 2020

#### Emergence of Multi-core NPU





1) Norrie et al., "Google's Training Chips Revealed: TPUv2 and TPUv3", Hot Chips Symopsium, 2020

#### Multi-core NPU Architecture

- Computation
  - Per-core: Systolic array
- Memory
  - Per-core: On-chip scratchpad memory
  - Entire-core: IOMMU, memory controller/channel, off-chip memory





#### **NPU Execution Model**



Tile & Double buffering → Memory Burstiness!



#### **NPU Memory Requests Burstiness**





## **Existing NPU Simulators**

- Weakness: Ignore interference between cores
- Single-core iteration
  - Multi-core result = <u>Naïve cycle sum</u> of single-core results (<u>w/o interference</u>)
- Fixed-cycle based memory simulation
  - Adopting analytical modeling in off-chip memory access simulation



|                           | Fixed Latency                |  |
|---------------------------|------------------------------|--|
| Transit Latency           | Traffic / Bandwidth (cycles) |  |
| Page Table Walk           | 100 (cycles/level)           |  |
|                           |                              |  |
| Off-chip Memory<br>Access | 100 (cycles)                 |  |
|                           |                              |  |

Single-core Iteration

**Fixed-cycle based Memory Simulation** 



## **Existing NPU Simulators**

- Weakness: Ignore <u>interference</u> between cores
- Single-core iteration
  - Multi-core result = <u>Naïve cycle sum</u> of single-core results (<u>w/o interference</u>)
- Fixed-cycle based memory simulation
  - Adopting <u>analytical modeling</u> in off-chip memory access simulation

#### We propose the <u>dynamic multi-core</u> NPU simulator!

| Core 2▶               | Page Table Walk                     | 100 (cycles/level) |
|-----------------------|-------------------------------------|--------------------|
|                       |                                     |                    |
|                       | Off-chip Memory<br>Access           | 100 (cycles)       |
|                       |                                     |                    |
| Single-core Iteration | Fixed-cycle based Memory Simulation |                    |



## mNPUsim: A Cycle-accurate Multi-core NPU Simulator



#### mNPUsim



#### DRAMsim3<sup>1)</sup>-integrated dynamic off-chip memory simulation



**Multi-core simulation** 



#### Open-sourced & Artifact-evaluated

- 1. Architecture
- 2. Network
- 3. Off-chip Memory
- 4. On-chip Memory
- 5. Execution Mode

#### Various configuration inputs

Ξ





1) Generates memory & compute requests





1) Generates memory & compute requests

- 1-1) Memory request: Sequence of virtual address
- 1-2) Compute request: Sequence of tile computation time





2) Simulates out-of-order requests





3) Off-chip memory simulation using DRAMsim3





- 3) Off-chip memory simulation using DRAMsim3
- 3-1) Simulates address translation (optional)
- 3-2) Sends transactions to DRAMsim3
- 3-3) Tick until all read/write requests finished

#### ΚΔΙΣΤ



4) Generates simulation outputs





- 4) Generates simulation outputs
- 4-1) Elapsed cycles
- 4-2) PE utilization





4) Generates simulation outputs

4-1) Elapsed cycles

4-2) PE utilization

4-3) Request logs of shared resources

#### ΚΔΙΣΤ

# Shared Resource Analysis with mNPUsim



## Methodology

- Benchmarks: 8 machine learning workloads
- Simulator configuration
  - NPU: TPUv4 configuration<sup>1)</sup>
  - Off-chip memory: HBM2

| Туре           | Model                                              |  |  |
|----------------|----------------------------------------------------|--|--|
| CNN            | Resnet50 (res)<br>Yolo-tiny (yt)<br>AlexNet (alex) |  |  |
| RNN            | Selfish-RNN (sfrnn)<br>DeepSpeech2 (ds2)           |  |  |
| Recommendation | DLRM (dlrm)<br>NCF (ncf)                           |  |  |
| Attention      | gpt2 (gpt2)                                        |  |  |
| Benchmarks     |                                                    |  |  |



Simulator Configuration



1) Google, "CloudTPU", https://cloud.google.com/tpu/docs/system-architecture-tpu-vm

## Methodology

- Two metrics
  - Performance: relative speedup
  - Fairness<sup>1</sup>): balance of speedup between cores

$$speedup_k = \frac{exec.time \ of \ baseline}{exec.time \ of \ k^{th} \ core}$$

performance<sub>i</sub> = geometric mean of speedup for i<sup>th</sup> mix workload

$$slowdown_k = \frac{1}{speedup_k}$$

 $\mu_i$  = average of slowdown for  $i^{th}$  mix workload

 $\sigma_i$  = standard deviation of slowdown for  $i^{th}$  mix workload

$$fairness_i = 1 - \frac{\sigma_i}{\mu_i}$$



• Three levels of resource sharing





- Three levels of resource sharing
  - DRAM-only sharing: DRAM components only sharing





- Three levels of resource sharing
  - DRAM-only sharing: DRAM components only sharing
  - DRAM & PTW sharing: Plus PTW sharing





- Three levels of resource sharing
  - DRAM-only sharing: DRAM components only sharing
  - DRAM & PTW sharing: Plus PTW sharing
  - DRAM & PTW & TLB sharing: Plus TLB sharing





#### Shared DRAM Bandwidth: Experiment

- Setup: <u>DRAM-only</u> sharing
  - No address translation, dual-core NPU
- Compares three sharing schemes
  - Baseline: Ideal (per-core monopolization)
  - Static: <sup>1</sup>/<sub>2</sub> (for NPU 0), <sup>1</sup>/<sub>2</sub> (for NPU 1)
  - Dynamic: Dynamically sharing whole resources
- Dynamic sharing improves +14.5% performance over static
  - By sacrificing fairness -10.3%





#### Shared DRAM Bandwidth: Analysis

- Setup: <u>DRAM-only</u> sharing
  - No address translation, single-core NPU
- Result:
  - Higher bandwidth, better performance (geomean: x4.31)
  - Different workload, different sensitivity (sfrnn: x3.53, ncf: x5.30)
- Analysis:
  - Lack of bandwidth due to burstiness
  - Dynamic bandwidth sharing increases peak bandwidth



Performance with respect to DRAM bandwidth



## Shared DRAM Bandwidth: Analysis

- Setup: <u>DRAM-only</u> sharing
  - No address translation, single-core NPU
- Result:
  - Higher bandwidth, better performance (geomean: x4.31)
  - Different workload, different sensitivity (sfrnn: **x3.53**, ncf: **x5.30**)
- Analysis:

#### 1. <u>DRAM bandwidth</u> is a crucial resource 2. <u>Dynamic</u> DRAM bandwidth sharing is better





## Shared IOMMU

- Setup: Dual-core NPU
- Experiment 1. DRAM & PTW sharing
  - Performance improvement: +13.2%
  - Fairness drop: -4.30%
  - Reason: Burstiness of requests
- Experiment 2. DRAM & PTW & TLB sharing
  - <u>Negligible</u> difference due to <u>negligible TLB capacity contention</u>





#### Scalable Page Size

- Page size candidates from ARM64<sup>1)</sup>
  - 4KB (Baseline), 64KB, 1MB
- Single-core: +19.5% performance improvement (dlrm: +30.1%, gpt2: +5.8%)
- Multi-core: Only +12.5% improvement in quad-core
  - Other contention between cores reduces PTW effect



#### Scalable Page Size

- Page size candidates from ARM64<sup>1)</sup>
  - 4KB (Baseline), 64KB, 1MB
- Single-core: +19.5% performance improvement (dlrm: +30.1%, gpt2: +5.8%)
- Multi-core: Only +12.5% improvement in quad-core
  - Other contention between cores reduces PTW effect

#### Huge page is better choice (especially for less number of cores)





#### More Information on Paper

- Quad-core NPU experiment
- Off-chip memory utilization in single/dual-core evaluation
- Contention sensitivity
- Performance distribution affected by co-runners
- Workload mapping in multi-NPU
- Guideline of using mNPUsim



#### Conclusion

- Propose a cycle-accurate multi-core NPU simulator: mNPUsim
- Evaluate and compare shared resource management techniques

| Management<br>Techniques                | Performance                                | Fairness   |
|-----------------------------------------|--------------------------------------------|------------|
| Dynamic sharing of<br>DRAM bandwidth    | +14.5%                                     | -10.3%     |
| Dynamic sharing of<br>page table walker | +13.2%                                     | -4.3%      |
| Dynamic sharing of<br>TLB               | Negligible                                 | Negligible |
| Scalable page size                      | +19.5% (single-core)<br>+12.5% (quad-core) | Negligible |

Visit and enjoy <u>https://github.com/casys-kaist/mNPUsim</u>

