STORAGE DEVELOPER CONFERENCE



Andy Banta Storage Janitor, Magnition Inc @andybanta

# Optimizing Complex Hierarchical Memory Systems Through Simulations

Andy

Banta

## Magnition.io (Consultant) SolidFire (VMware development) DataGravity (Container exploitation lead) VMware (iSCSI Tech Lead) Sun Microsystems (Initial Fibre Channel development) Patent, early distributed network projects, data acquisition @andybanta









# The Challenge

Modern compute and storage system use multiple layers interacting in multiple ways

# HOW CAN CURRENT TECHNOLOGY ACHIEVE...

- Latency control
- Multi-tenant thrash remediation
- Correct tier sizing
- Workload-awareness
- Hot working set management
- Latency and throughput SLAs
- Memory capacity planning

## AS MORE HARDWARE LAYERS ADD COMPLEXITY?



## **ABOUT MAGNITION**





STORAGE PERFORMANCE, REINVENTED



## World's First Real-Time Data Placement Optimization Patented technology is a first for the industry.

## Proven At-Scale, with Production Workloads

Use customer traces to fully test diverse workloads in real-time.

## Peer-Reviewed and Published in Leading Journals

Multiple industry articles published and reviewed.

Award-Winning, Patented Technology 3-time award winner for innovative technology.



• • • • • • • • • • • • ۲ •

# **Stimulating Simulations**

Our Approach to Simulations



5 | ©2023 SNIA. All Rights Reserved.

# A different approach to optimization

- Compose simulations of complex memory and storage
- +Break the simulation into components
- Allows the components to be assembled like building blocks
- Provide reasonable but constrained set of variables
- Run simulations with synthetic data or actual IO traces





# Value of simulations

Faster and easier to prototype
Minimal up-front hardware spend
Great opportunities for optimizations
Loads of simulations are done at ASIC level
The same practices should apply to

The same practices should apply to component and software levels

## Choose three

- 1. Lower cost
- 2. Higher speed
- 3. More flexibility







## Provide a framework to connect components

Lingua Franca provides this

✦Reactors represent system pieces

+Library of components ready to use

Allows clients to build their own modules

## Basic set of building blocks

+Cache

Media

+Wire



## Provide a framework to connect components

Lingua Franca provides this

Reactors represent system pieces

+Library of ready components for use

+Allows clients to build their own modules

## Basic set of building blocks

+Cache

Media

✦Wire





## Provide a framework to connect components

Lingua Franca provides this

Reactors represent system pieces

Library of ready components for use

Allows clients to build their own modules

Basic set of building blocks





## Provide a framework to connect components

- Lingua Franca provides this
- Reactors represent system pieces
- +Library of ready components for use
- Allows clients to build their own modules
- Basic set of building blocks





## Provide a framework to connect components

- Lingua Franca provides this
- Reactors represent system pieces
- +Library of ready components for use
- Allows clients to build their own modules
- Basic set of building blocks





## Provide a framework to connect components

- Lingua Franca provides this
- Reactors represent system pieces
- +Library of ready components for use
- Allows clients to build their own modules
- Basic set of building blocks



Memory, disk, cloud storage
Introduce distinct delays

**✦**MQSim

## ✦Parallel access

✦Contention delays

✦Queueing

## Only need to simulate delay





Memory, disk, cloud storage
Introduce distinct delays

**✦**MQSim

## ✦Parallel access

✦Contention delays

✦Queueing

## Only need to simulate delay



Memory, disk, cloud storage
Introduce distinct delays

**✦**MQSim

## ✦Parallel access

✦Contention delays

✦Queueing

## Only need to simulate delay



Memory, disk, cloud storage
Introduce distinct delays

**✦**MQSim

## ✦Parallel access

✦Contention delays

✦Queueing

## Only need to simulate delay



23

Memory bus, disk controller, network
 Can multiplex and change form of IO request
 Even type of wire can be variable

✦Type of memory bus

✦Hops in network topology





Memory bus, disk controller, network
 Can multiplex and change form of IO request
 Even type of wire can be variable

✦Type of memory bus

✦Hops in network topology





Memory bus, disk controller, network
 Can multiplex and change form of IO request
 Even type of wire can be variable

✦Type of memory bus

✦Hops in network topology





Memory bus, disk controller, network
 Can multiplex and change form of IO request
 Even type of wire can be variable

✦Type of memory bus

✦Hops in network topology





Memory bus, disk controller, network
 Can multiplex and change form of IO request
 Even type of wire can be variable

✦Type of memory bus

✦Hops in network topology





Memory bus, disk controller, network
 Can multiplex and change form of IO request
 Even type of wire can be variable

✦Type of memory bus

✦Hops in network topology





Memory bus, disk controller, network
 Can multiplex and change form of IO request
 Even type of wire can be variable

✦Type of memory bus

✦Hops in network topology







# Gas, Grass or Cache

No free ride with hierarchical memory



# Cache component

Easily build basics like lookups, allocation, and eviction
One (or more) hit path
One (or more) miss path
Many choices for variability



# Cache allocation

- Another building block
- Variable vs fixed
  - Object or Block
- Memory schemes
  - Slab, memalloc, persistent memory
- SSD fill buffers



# Lookup

## Hashing and locating algorithm

## Miss algorithms

- Trigger allocation or eviction
- Trigger speculative fill
- Pending misses
- Ageing
  - Dependent on eviction
- Element invalidation



## 

## . . . . . . . . . . . . .

# **Evictions**

- Who to evict
- Culling invalidations
- SSD eviction



• • • • ۲ • • ۲

# Duplicity is tricky

Two-tier cache issues



30 | ©2023 SNIA. All Rights Reserved.

# Multi-layer cache complexity

- All of these variables become a huge matrix with multiple layers
  - Cache algorithms
  - Media types
  - Interconnect type
  - Shared vs distributed







# **Simulation Summation**

Modeling and collecting results for a two-layer cache



32 | ©2023 SNIA. All Rights Reserved.

# GUI demo

## Show the graphical representation of the two case configs

Video to be added





## 34 | ©2023 SNIA. All Rights Reserved.





cdn\_cache

## Cache drilldown

## UI demo

## 

# Simulation code

 UI generated from code

 Code simulates component eactor cdn\_cache (bank\_index:int(0), pop\_tier\_id:int(0), pop\_id:int(0), n\_ports:int(1), cache\_level:string("L1"), cache\_size:uint32\_t(4096), page\_size:uint32\_t( input io\_request:cdn\_cache\_request\_t; output io\_resp:cdn\_cache\_response\_t; output hit\_fetch\_request:cdn\_cache\_request\_t; input hit\_fetch\_response:cdn\_cache\_response\_t; output miss\_fetch\_request:cdn\_cache\_request\_t; input miss\_fetch\_response:cdn\_cache\_response\_t; output miss\_store\_request:p\_cdn\_cache\_entry\_t; input miss\_store\_response:p\_cdn\_cache\_entry\_t; output allocation\_request:uint32\_t; input allocation\_success:uint64\_t; input allocation\_failure:int; input end\_trigger:uint32\_t; logical action sch\_response(0):cdn\_cache\_response\_t; logical action sch\_eviction(0):cdn\_cache\_request\_t; logical action sch\_insertions(0):cdn\_cache\_entry\_t\*; state free\_space:uint32\_t(0); CacheCtrl = new controller<int> (cache\_level = cache\_level, name = "cache\_controller", pop\_tier\_id = pop\_tier\_id, pop\_id = pop\_id, cache\_id = bank\_index, log LookUp = new lookup<cdn cache\_request\_t, p\_cdn\_cache\_entry\_t, cdn\_response\_tuple\_t> (n\_ports = 1, log\_level = {=LOG\_DEBUG\_LEVEL=}); Eviction = new eviction<p\_cdn\_cache\_entry\_t, cdn\_cache\_request\_t, cdn\_response\_tuple\_t> (evict\_methods = {=&lru\_eviction\_methods=}, n\_ports = 1, eviction\_typ Allocator = new allocator<cdn\_cache\_request\_t, p\_cdn\_cache\_entry\_t, cdn\_response\_tuple\_t> (log\_level = {=LOG\_DEBUG\_LEVEL=});



# Workloads matter

- No artificial workloads
- Content delivery
- Multiple sources
- (Need a CDN trace, not a syscall trace)

| semop(8126470, [{0, -1, SEM_UNDO}], 1)    | =   | 0   |      |         |       |
|-------------------------------------------|-----|-----|------|---------|-------|
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     |     | 0   |      |         |       |
|                                           |     |     |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| semop(8126470, [{0, -1, SEM_UNDO}], 1)    | =   | 0   |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| poll([{fd=61, events=POLLIN}], 1, 3000)   | =   | 0   | (Tim | eout)   |       |
| poll([{fd=61, events=POLLIN}], 1, 3000)   | =   | 0   | (Tim | eout)   |       |
| semop(8126470, [{0, -1, SEM_UNDO}], 1)    | =   | 0   |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| semop(8126470, [{0, -1, SEM_UNDO}], 1)    | =   | 0   |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| semop(8126470, [{0, -1, SEM_UNDO}], 1)    |     |     |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| poll([{fd=61, events=POLLIN}], 1, 3000)   |     | 0   | (Tim | eout)   |       |
| poll([{fd=61, events=POLLIN}], 1, 3000)   | =   | 0   | (Tim | eout)   |       |
| semop(8126470, [{0, -1, SEM_UNDO}], 1)    | =   | 0   |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
|                                           |     |     |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| semop(8126470, [{0, -1, SEM_UNDO}], 1)    |     |     |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| poll([{fd=61, events=POLLIN}], 1, 3000)   | =   | 0   | (Tim | eout)   |       |
| poll([{fd=61, events=POLLIN}], 1, 3000)   | =   | 0   | (Tim | eout)   |       |
| semop(8126470, [{0, -1, SEM_UNDO}], 1)    | =   | 0   |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| semop(8126470, [{0, -1, SEM_UNDO}], 1)    |     |     |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| semop(8126470, [{0, -1, SEM_UNDO}], 1)    |     | 0   |      |         |       |
| semop(8126470, [{0, 1, SEM_UNDO}], 1)     | =   | 0   |      |         |       |
| [poll([{fd=61, events=POLLIN}], 1, 3000^C | Cst | tra | ace: | Process | 14046 |
|                                           |     |     |      |         |       |



detached

# Matrix of results

- Number of PoPs
- Different L1 vs L2 algorithms

| Miss Ratio                        |         |       |       |          |              |                       |           |       |       |       |         |  |  |  |
|-----------------------------------|---------|-------|-------|----------|--------------|-----------------------|-----------|-------|-------|-------|---------|--|--|--|
| L1:randLB, LRU; L2:urlhashLB, LRU |         |       |       |          |              |                       |           |       |       |       |         |  |  |  |
| 10                                | - 0.816 | 0.809 | 0.807 | 0.806    | 0.802        | 0.799                 | 0.796     | 0.795 | 0.792 | 0.790 | - 0.975 |  |  |  |
| 6                                 | - 0.821 | 0.815 | 0.812 | 0.811    | 0.806        | 0.804                 | 0.803     | 0.803 | 0.797 | 0.796 | - 0.950 |  |  |  |
| ω                                 | - 0.832 | 0.823 | 0.822 | 0.821    | 0.817        | 0.815                 | 0.812     | 0.811 | 0.808 | 0.804 | - 0.950 |  |  |  |
| caches                            | - 0.846 | 0.837 | 0.834 | 0.832    | 0.828        | 0.827                 | 0.825     | 0.822 | 0.818 | 0.816 | - 0.925 |  |  |  |
| ° 2                               | - 0.853 | 0.843 | 0.844 | 0.843    | 0.837        | 0.833                 | 0.830     | 0.831 | 0.827 | 0.824 | - 0.900 |  |  |  |
| Number of<br>4 5                  | - 0.868 | 0.861 | 0.858 | 0.858    | 0.853        | 0.851                 | 0.846     | 0.845 | 0.840 | 0.838 | - 0.875 |  |  |  |
| Num<br>4                          | - 0.886 | 0.881 | 0.879 | 0.874    | 0.871        | 0.867                 | 0.864     | 0.862 | 0.856 | 0.854 | - 0.850 |  |  |  |
| m                                 | - 0.912 | 0.905 | 0.901 | 0.897    | 0.890        | 0.885                 | 0.882     | 0.881 | 0.876 | 0.871 |         |  |  |  |
| 7                                 | - 0.950 | 0.939 | 0.931 | 0.924    | 0.918        | 0.914                 | 0.910     | 0.908 | 0.905 | 0.900 | - 0.825 |  |  |  |
| Ч                                 | - 0.990 | 0.970 | 0.961 | 0.956    | 0.952        | 0.949                 | 0.946     | 0.944 | 0.942 | 0.940 | - 0.800 |  |  |  |
|                                   | 1       | 2     | 3     | 4<br>Num | ہٰ<br>ber of | <sup>6</sup><br>L1 ca | ל<br>ches | 8     | 9     | 10    |         |  |  |  |



**SD@** 

# Matrix of results

- Number of PoPs
- Different L1 vs L2 algorithms

| Miss Ratio<br>L1:urlhashLB, LRU; L2:urlhashLB, LRU |         |       |       |       |       |       |       |       |       |       |  |         |
|----------------------------------------------------|---------|-------|-------|-------|-------|-------|-------|-------|-------|-------|--|---------|
| 10                                                 | - 0.816 | 0.867 | 0.906 | 0.928 | 0.945 | 0.959 | 0.964 | 0.976 | 0.980 | 0.987 |  | - 1.000 |
| ი                                                  | - 0.821 | 0.876 | 0.914 | 0.936 | 0.952 | 0.966 | 0.970 | 0.979 | 0.987 | 0.986 |  | - 0.975 |
| ω                                                  | - 0.832 | 0.887 | 0.924 | 0.946 | 0.961 | 0.973 | 0.976 | 0.988 | 0.988 | 0.991 |  | - 0.950 |
| caches                                             | - 0.846 | 0.900 | 0.937 | 0.957 | 0.970 | 0.979 | 0.987 | 0.989 | 0.992 | 0.993 |  |         |
| L2<br>6                                            | - 0.853 | 0.910 | 0.947 | 0.966 | 0.977 | 0.989 | 0.987 | 0.993 | 0.995 | 0.996 |  | - 0.925 |
| of                                                 | - 0.868 | 0.925 | 0.960 | 0.976 | 0.989 | 0.991 | 0.992 | 0.996 | 0.997 | 0.998 |  | - 0.900 |
| Numk<br>4                                          | - 0.886 | 0.945 | 0.973 | 0.989 | 0.992 | 0.995 | 0.995 | 0.999 | 0.998 | 0.999 |  | - 0.875 |
| m                                                  | - 0.912 | 0.965 | 0.991 | 0.993 | 0.996 | 0.998 | 0.998 | 0.999 | 0.999 | 0.999 |  |         |
| 7                                                  | - 0.950 | 0.990 | 0.996 | 0.999 | 0.999 | 1.000 | 0.999 | 1.000 | 1.000 | 1.000 |  | - 0.850 |
| Ч                                                  | - 0.993 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |  | - 0.825 |
| 1 2 3 4 5 6 7 8 9 10<br>Number of L1 caches        |         |       |       |       |       |       |       |       |       |       |  |         |

# Heat map

Ability to compare sets of results

| Miss Ratio Difference (L1 randLB vs. urlhashLB) |        |        |        |        |        |        |        |        |        |        |  |   |       |
|-------------------------------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--|---|-------|
| 10                                              | 0.000  | -0.059 | -0.100 | -0.122 | -0.143 | -0.159 | -0.168 | -0.181 | -0.188 | -0.198 |  |   | 0.000 |
| o -                                             | 0.000  | -0.061 | -0.102 | -0.125 | -0.146 | -0.161 | -0.167 | -0.176 | -0.190 | -0.190 |  | - | 0.025 |
| - ∞                                             | 0.000  | -0.063 | -0.102 | -0.125 | -0.144 | -0.158 | -0.165 | -0.178 | -0.180 | -0.186 |  |   | 0.050 |
| ches                                            | 0.000  | -0.062 | -0.102 | -0.124 | -0.142 | -0.152 | -0.162 | -0.167 | -0.174 | -0.177 |  |   | 0.075 |
| Number of L2 caches                             | 0.000  | -0.066 | -0.103 | -0.123 | -0.140 | -0.156 | -0.157 | -0.162 | -0.168 | -0.172 |  |   |       |
| ber of<br>5                                     | 0.000  | -0.063 | -0.102 | -0.118 | -0.137 | -0.140 | -0.146 | -0.150 | -0.156 | -0.160 |  |   | 0.100 |
| Num<br>4                                        | 0.000  | -0.064 | -0.094 | -0.115 | -0.121 | -0.128 | -0.132 | -0.137 | -0.142 | -0.144 |  |   | 0.125 |
| m -                                             | 0.000  | -0.060 | -0.090 | -0.096 | -0.107 | -0.114 | -0.116 | -0.118 | -0.123 | -0.128 |  |   | 0.150 |
| - 7                                             | 0.000  | -0.051 | -0.066 | -0.075 | -0.081 | -0.086 | -0.089 | -0.092 | -0.095 | -0.100 |  |   | 0.175 |
|                                                 | -0.003 | -0.028 | -0.038 | -0.044 | -0.048 | -0.051 | -0.054 | -0.056 | -0.058 | -0.060 |  |   |       |
| 1 2 3 4 5 6 7 8 9 10<br>Number of L1 caches     |        |        |        |        |        |        |        |        |        |        |  |   |       |



• • • • • • • • • • •

# Solving the Halting Problem

Wrap up and conclusions



40 | ©2023 SNIA. All Rights Reserved.

# RESULTS WITH MAGNITION

As an example, a current customer has achieved the following measurable outcomes with Magnition:

## Experiments per day per engineer:

- Without Magnition: **2**
- With Magnition: 50,000+

Parameter variations tested **before prod release**:

- Without Magnition: 50
- With Magnition: **1,000,000+**

Workload performance improvement using our products to find **optimal out-of-the-box settings**: **10-50%+** 







# Today I Learned

1.Complex systems can be modularized rapidly into simulated components

2.Real-world problems can be analyzed efficiently using simulations

3.Modern simulators allow faster and more thorough cost and performance analysis than direct experimentation





# Please take a moment to rate this session.

Your feedback is important to us.



# Section Title

Section Subtitle



44 | ©2021 Storage Developer Conference ©. Insert Company Name Here. All Rights Reserved.

• ۲ ۲ • • • • • • • • • • ۲ • 🕘 • . ۲ •

# **Section Title**

Section Subtitle



45 | ©2023 SNIA. All Rights Reserved.

## 

# Light Slide Title

## Bullets 1

- Bullets 2
  - Bullets 3
    - Bullets 4
      - Bullets 5



# Dark Slide Title

## Bullets 1

- Bullets 2
  - Bullets 3
    - Bullets 4
      - Bullets 5

