

34<sup>th</sup> IEEE International Conference on Computer Design *ICCD 2016* 

### Dynamic Prefetcher Reconfiguration for Diverse Memory Architectures

Junghoon Lee\*, Taehoon Kim, and Jaehyuk Huh



SAMSUNG ADVANCED INSTITUTE OF TECHNOLOGY



- Stream prefetcher
  - Stream: a sequence of consecutive memory blocks
  - If any demand request accesses a block in a *stream*(from A to A+P), generate prefetch request A+P, A+P+1, ..., A+P+N
- Parameters
  - Distance (P): how far future the prefetcher predicts
  - Degree (N): how many prefetch requests are generated



- Stream prefetcher
  - Stream: a sequence of consecutive memory blocks
  - If any demand request accesses a block in a *stream*(from A to A+P), generate prefetch request A+P, A+P+1, ..., A+P+N
- Parameters
  - Distance (P): how far future the prefetcher predicts
  - Degree (N): how many prefetch requests are generated



- Stream prefetcher
  - Stream: a sequence of consecutive memory blocks
  - If any demand request accesses a block in a *stream*(from A to A+P), generate prefetch request A+P, A+P+1, ..., A+P+N
- Parameters
  - Distance (P): how far future the prefetcher predicts
  - Degree (N): how many prefetch requests are generated



- Stream prefetcher
  - Stream: a sequence of consecutive memory blocks
  - If any demand request accesses a block in a *stream*(from A to A+P), generate prefetch request A+P, A+P+1, ..., A+P+N
- Parameters
  - Distance (P): how far future the prefetcher predicts
  - Degree (N): how many prefetch requests are generated



- Stream prefetcher
  - Stream: a sequence of consecutive memory blocks
  - If any demand request accesses a block in a *stream*(from A to A+P), generate prefetch request A+P, A+P+1, ..., A+P+N
- Parameters
  - Distance (P): how far future the prefetcher predicts
  - Degree (N): how many prefetch requests are generated



KAIST

- Stream prefetcher
  - Stream: a sequence of consecutive memory blocks
  - If any demand request accesses a block in a *stream*(from A to A+P), generate prefetch request A+P, A+P+1, ..., A+P+N
- Parameters

Distance (P): how far future the prefetcher predicts

Degree (N): how many prefetch requests are generated



- Stream prefetcher
  - Stream: a sequence of consecutive memory blocks
  - If any demand request accesses a block in a *stream*(from A to A+P), generate prefetch request A+P, A+P+1, ..., A+P+N
- Parameters
  - Distance (P): how far future the prefetcher predicts
  - Degree (N): how many prefetch requests are generated





- Traditional memory architecture
  - DDR: one dominant memory type
  - Relatively predictable bandwidth

[1] Loh et al. ISCA 2008[2] Qureshi et al. ISCA 2009[3] Chou et al. MICRO 2014



- Traditional memory architecture
  - DDR: one dominant memory type
  - Relatively predictable bandwidth
  - Memory heterogeneity
    - DDR, HBM<sup>[1]</sup>, non-volatile memory<sup>[2]</sup>, hybrid memory<sup>[3]</sup>
    - Wide range of bandwidth/latency

[1] Loh et al. ISCA 2008[2] Qureshi et al. ISCA 2009[3] Chou et al. MICRO 2014



[1] Loh et al. ISCA 2008[2] Qureshi et al. ISCA 2009[3] Chou et al. MICRO 2014

- Traditional memory architecture
  - DDR: one dominant memory type
  - Relatively predictable bandwidth
  - Memory heterogeneity
    - DDR, HBM<sup>[1]</sup>, non-volatile memory<sup>[2]</sup>, hybrid memory<sup>[3]</sup>
    - Wide range of bandwidth/latency



- Traditional memory architecture
  - DDR: one dominant memory type
  - Relatively predictable bandwidth
  - Memory heterogeneity
    - DDR, HBM<sup>[1]</sup>, non-volatile memory<sup>[2]</sup>, hybrid memory<sup>[3]</sup>
    - Wide range of bandwidth/latency

Prefetcher should consider various memory characteristics

[3] Chou et al. MICRO 2014

#### **Prior Work**

| KΛ | IST |
|----|-----|
|    |     |

| (dist., degree) |  |
|-----------------|--|
| (4, 1)          |  |
| (8, 1)          |  |
| (16, 2)         |  |
| (32, 4)         |  |
| (64, 4)         |  |
|                 |  |

- Feedback-directed prefetching [4]
  - Use stream prefetcher: distance & degree
  - Choose one of five aggressive levels
  - Consider application's memory bandwidth requirement
- Limitation
  - Five levels of pre-selected prefetch configurations
  - Consider DDR memory only

#### **Prior Work**





- Feedback-directed prefetching [4]
  - Use stream prefetcher: distance & degree
  - Choose one of five aggressive levels
  - Consider application's memory bandwidth requirement
- Limitation
  - Five levels of pre-selected prefetch configurations
  - Consider DDR memory only

#### **Prior Work**





- Feedback-directed prefetching [4]
  - Use stream prefetcher: distance & degree
  - Choose one of five aggressive levels
  - Consider application's memory bandwidth requirement
- Limitation
  - Five levels of pre-selected prefetch configurations

Only a small number of pre-selected configurations are not enough to cover the diversity of memory architectures

[4] Srinath et al. HPCA 2007

#### **Dynamic Prefetcher**







#### **Dynamic Prefetcher**





#### Outline



- Motivation : the effect of available memory bandwidth on prefetcher designs
  - Effect on the aggressiveness of prefetcher
  - Dominant factor: distance vs degree
  - Cache pollution by prefetcher
- Contributions
  - Propose a prefetcher reconfiguration mechanism
  - Propose a pollution mitigation mechanism



\* conservative (8,1) and aggressive (8,64)



#### \* conservative (8,1) and aggressive (8,64)

|     | Conservative | Aggressive |
|-----|--------------|------------|
| DDR | 10%          | -1%        |
| HBM | 20%          | 28%        |



\* conservative (8,1) and aggressive (8,64)

Observation 1: The best prefetcher aggressiveness differs for each memory type



- Distance vs. degree
  - Performance variation is higher on degree



- Distance vs. degree
  - Performance variation is higher on degree







#### Dynamic Prefetcher Reconfiguration

- Search by Random Profiling(RP)
  - Execute trial runs with randomly selected parameters
  - Adopt hill climbing algorithm
  - Direct performance metric (IPC: Instruction Per Cycles)
  - Profiling phase : Execution phase = 1 : 4
- Optimizations
  - Two-step profiling (decision order: distance  $\rightarrow$  degree)
  - Start profiling phase with previously used best parameters

#### Dynamic Prefetcher Reconfiguration

| 4         | N Intorval |           | <b>&gt;</b> |
|-----------|------------|-----------|-------------|
| Profiling |            | Execution |             |
|           |            |           |             |
|           |            |           |             |

- Search by Random Profiling(RP)
  - Execute trial runs with randomly selected parameters
  - Adopt hill climbing algorithm
  - Direct performance metric (IPC: Instruction Per Cycles)
  - Profiling phase : Execution phase = 1 : 4
- Optimizations
  - Two-step profiling (decision order: distance  $\rightarrow$  degree)
  - Start profiling phase with previously used best parameters

#### Dynamic Prefetcher Reconfiguration

|           |        | Ninterval |  |
|-----------|--------|-----------|--|
| Profiling |        | Execution |  |
| Distance  | Degree |           |  |
|           |        |           |  |

- Search by Random Profiling(RP)
  - Execute trial runs with randomly selected parameters
  - Adopt hill climbing algorithm
  - Direct performance metric (IPC: Instruction Per Cycles)
  - Profiling phase : Execution phase = 1 : 4
- Optimizations
  - Two-step profiling (decision order: distance  $\rightarrow$  degree)
  - Start profiling phase with previously used best parameters



- The performance curve has common form
- The curve rarely exhibits multiple local maximums
- Average trial runs is 3.77



- The performance curve has common form
- The curve rarely exhibits multiple local maximums
- Average trial runs is 3.77



- The performance curve has common form
- The curve rarely exhibits multiple local maximums
- Average trial runs is 3.77



- The performance curve has common form
- The curve rarely exhibits multiple local maximums
- Average trial runs is 3.77





- The performance curve has common form
- The curve rarely exhibits multiple local maximums
- Average trial runs is 3.77

LRU

- Demand Prefetch Insertion-only (Prior work <sup>[4]</sup>)
  - Adjust insertion location of prefetch data
  - Promote to MRU directly



MRU

[4] Srinath et al. HPCA 2007[5] Xie et al. ISCA 2009

KVI21

- Demand Prefetch Insertion-only (Prior work <sup>[4]</sup>)
  - Adjust insertion location of prefetch data
  - Promote to MRU directly





[4] Srinath et al. HPCA 2007[5] Xie et al. ISCA 2009

LRU

- Demand Prefetch Insertion-only (Prior work <sup>[4]</sup>)
  - Adjust insertion location of prefetch data
  - Promote to MRU directly



MRU

[4] Srinath et al. HPCA 2007[5] Xie et al. ISCA 2009

KVI21

LRU

- Prefetch Insertion-only (Prior work <sup>[4]</sup>)
  - Adjust insertion location of prefetch data
  - Promote to MRU directly



Demand

MRU

[4] Srinath et al. HPCA 2007[5] Xie et al. ISCA 2009



- Insertion-only (Prior work <sup>[4]</sup>)
  - Adjust insertion location of prefetch data
  - Promote to MRU directly



[4] Srinath et al. HPCA 2007 [5] Xie et al. ISCA 2009

#### Prefetch Partition(PP)

- Insight: prefetch data are often not reused after the initial demand hit
- Soft-partition: adopt simple pseudopartitioning from PIPP<sup>[5]</sup>
- Optimization: using top two policies
  - (MRU:LRU-4), (MRU: LRU)
  - Can reap out most of the benefits



- Insertion-only (Prior work <sup>[4]</sup>)
  - Adjust insertion location of prefetch data
  - Promote to MRU directly



[4] Srinath et al. HPCA 2007 [5] Xie et al. ISCA 2009

#### Prefetch Partition(PP)

- Insight: prefetch data are often not reused after the initial demand hit
- Soft-partition: adopt simple pseudopartitioning from PIPP<sup>[5]</sup>
- Optimization: using top two policies
  - (MRU:LRU-4), (MRU: LRU)
  - Can reap out most of the benefits



- Insertion-only (Prior work <sup>[4]</sup>)
  - Adjust insertion location of prefetch data
  - Promote to MRU directly



[4] Srinath et al. HPCA 2007 [5] Xie et al. ISCA 2009

#### Prefetch Partition(PP)

- Insight: prefetch data are often not reused after the initial demand hit
- Soft-partition: adopt simple pseudopartitioning from PIPP<sup>[5]</sup>
- Optimization: using top two policies
  - (MRU:LRU-4), (MRU: LRU)
  - Can reap out most of the benefits

#### **Evaluation**



- Full system OoO simulation on *McSim + Gems + DRAMSim2*; *8 cores*, 128 ROB, 4 ways
- Three Level Cache hierarchy
- Memory Configuration

| Parameter | Values                            |
|-----------|-----------------------------------|
| DDR       | 2channels, DDR3-1600(800Mhz)      |
| HBM       | 16channels, HBM-1600(800Mhz)      |
| fast HBM  | HBM with x2 frequency : 1600MHz   |
| half HBM  | HBM with /2 channels : 8 channels |

| Parameter | Values         |
|-----------|----------------|
| hybrid 1  | HBM + DDR      |
| hybrid 2  | fast HBM + DDR |
| hybrid 3  | half HBM + DDR |

- *Stream prefetchers* with 8 streams
- Benchmarks: SPECCPU, Mixed workloads

#### **Evaluation**



- Full system OoO simulation on *McSim + Gems + DRAMSim2*; *8 cores*, 128 ROB, 4 ways
- Three Level Cache hierarchy
- Memory Configuration

| Parameter | Values                            | Parameter | Values         |
|-----------|-----------------------------------|-----------|----------------|
| DDR       | 2channels, DDR3-1600(800Mhz)      | hybrid 1  | HBM + DDR      |
| HBM       | 16channels, HBM-1600(800Mhz)      | hybrid 2  | fast HBM + DDR |
| fast HBM  | HBM with x2 frequency : 1600MHz   | hybrid 3  | half HBM + DDR |
| half HBM  | HBM with /2 channels : 8 channels |           |                |
|           | ·                                 |           |                |

- *Stream prefetchers* with 8 streams
- *Benchmarks: SPECCPU, Mixed* workloads

13















Average 10.0% performance improvement (avg. 12.4% on HBM) on diverse memory architectures compared to the prior approach

#### Summary



- Investigate how the differences in memory architecture affect the optimal prefetching scheme
- Study how the prefetching parameters can be dynamically and effectively adjusted

#### Summary



- Investigate how the differences in memory architecture affect the optimal prefetching scheme
- Study how the prefetching parameters can be dynamically and effectively adjusted
- Dynamic Prefetcher Reconfiguration
  - Effective search by random profiling Prefetcher design on hybrid memory
  - Simple soft-partition mechanism to mitigate pollution
  - Average 10.0% performance improvement (avg. 12.4% on HBM) compared to the prior approach



34<sup>th</sup> IEEE International Conference on Computer Design *ICCD 2016* 

### Dynamic Prefetcher Reconfiguration for Diverse Memory Architectures

Junghoon Lee\*, Taehoon Kim, and Jaehyuk Huh



SAMSUNG ADVANCED INSTITUTE OF TECHNOLOGY

