STORAGE DEVELOPER CONFERENCE



BY Developers FOR Developers

## CXL and the Art of Hierarchical Memories

Their Management and Use

Andy Banta, Storage Janitor, Magnition Inc.

Kin-Yip Liu, Senior Director Solutions Architecture, AMD

Andy

Banta

Magnition.io (Consultant) SolidFire (VMware development, acq. by NetApp) DataGravity (Container exploitation lead) VMware (iSCSI Tech Lead, IPO) Sun Microsystems (Initial Fibre Channel development) Patent, early distributed network projects, data acquisition @andybanta







Kin-Yip Liu AMD (Sr. Director, Solutions Architecture; Networking & Storage) Intel (Sr. Director, Architecture; Persistent Memories) Marvell/Cavium Networks (Sr. Director, Solutions Architecture; Networking, Security, 3G/LTE/5G Infrastructure, Telco NFV) Intel (Architect, Designer; Server/Network/Mobile Processors) kin-yip.liu@amd.com



# AMD



#### NUMA overview

- Non-Uniform Memory Architecture
- Not all memory access is created equal

| # nur | nact] | L —H  |       |       |     |    |    |     |      |      |      |      |      |    |    |    |    |
|-------|-------|-------|-------|-------|-----|----|----|-----|------|------|------|------|------|----|----|----|----|
| avai  | Lable | e: 4  | nodes | (0-   | -3) |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 0 cp  | bus:  | 0 1 2 | 3 4   | 45  | 6  | 78 | 9 : | 10 1 | 11 1 | 12 1 | 13 1 | 14 1 | 15 |    |    |    |
| node  | 0 si  | ize:  | 64057 | MB    |     |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 0 f1  | cee:  | 48756 | MB    |     |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 1 cp  | bus:  | 16 17 | 18    | 19  | 20 | 21 | 22  | 23   | 24   | 25   | 26   | 27   | 28 | 29 | 30 | 31 |
| node  | 1 si  | ize:  | 64003 | MB    |     |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 1 fı  | cee:  | 50473 | MB    |     |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 2 cp  | ous:  |       |       |     |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 2 si  | ize:  | 25592 | 1 MI  | 3   |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 2 f1  | cee:  | 25591 | .8 MI | 3   |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 3 cp  | ous:  |       |       |     |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 3 si  | ize:  | 25562 | 3 MI  | 3   |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 3 f1  | cee:  | 25563 | 1 MI  | 3   |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | dist  | tance | es:   |       |     |    |    |     |      |      |      |      |      |    |    |    |    |
| node  | 0     | 1     | 2     | 3     |     |    |    |     |      |      |      |      |      |    |    |    |    |
| 0:    | 10    | 21    | 17    | 28    |     |    |    |     |      |      |      |      |      |    |    |    |    |
| 1:    | 21    | 10    | 28    | 17    |     |    |    |     |      |      |      |      |      |    |    |    |    |
| 2:    | 17    | 28    | 10    | 28    |     |    |    |     |      |      |      |      |      |    |    |    |    |
| 3     | 28    | 17    | 28    | 10    |     |    |    |     |      |      |      |      |      |    |    |    |    |



#### **CXL Roadmap Drives Memory Hierarchy Innovation**

#### CXL 1.1

Coherent memory expansion Boost capacity and/or bandwidth Pooling, CXL switches Improve overall TCO and memory utilization

**CXL 2.0** 

CXL 3.0 Coherent memory sharing New and fast sharing of data

\* Only a subset of CXL features and benefits are highlighted here.



#### **CXL Enables More Memory and Hierarchy Options**





#### Workload Performance Tuning Considerations

#### Memory Performance

- Characteristics: data rate, latency, read vs. write performance, access granularity, persistency. CXL adds variety, abstraction
- Memory channels, DIMMs per channel, module slots
- Capacity vs. Bandwidth boost. Interleaving options

#### NUMA, Affinity, Latency Optimization

- NUMA within socket, across socket, beyond compute node
- Compute and memory bandwidth allocation per NUMA node
- Scheduling processes to NUMA nodes. Dynamic realignment

#### Data Management

- Efficient tracking of hot/cold data, and migration among tiers
- Telemetries. Workload profiles
- Accelerator, compute-in-storage



- Understand that not all memory access is equal
- Developers need to understand and deal with differences
  - At the risk of inconsistent performance results
- Segregation
- Tiering
- Dynamic re-alignment



#### Segregation

#### System level assignment

- Assign VMs memory from a specific NUMA node
- Spread VMs across NUMA nodes and assign memory
- Assign processes memory from specific NUMA nodes

#### Application-based selection

Use libraries inside applications to tier memory



#### Assigning NUMA affinity in VMware

- Advanced settings on a VM configuration
- Without setting the affinity, VMware chooses memory, leading to unpredictable performance

| Modify or add configuration parameters     | as needed for experimen | tal features o | r as instructed by technica | l support. Er |
|--------------------------------------------|-------------------------|----------------|-----------------------------|---------------|
| lues will be removed (supported on ESXi 6. | 0 and later).           |                |                             |               |
|                                            |                         |                | 2                           |               |
|                                            |                         |                | ADD CONFIGURAT              | ION PARA      |
|                                            |                         |                |                             |               |
| Id New Configuration Params                |                         |                |                             |               |
|                                            | Value                   |                |                             |               |
| lame                                       | Value                   |                |                             |               |
| sched.mem.lpage.enable                     | TRUE                    |                |                             |               |
| numa.nodeAffinity                          | d                       |                |                             |               |
|                                            |                         |                |                             |               |
| Name                                       | т                       | Value          |                             |               |
| nvram                                      |                         | FGT-TIGE       | R-14-39.nvram               |               |
| svga.present                               |                         | TRUE           |                             |               |
| pciBridgeC.present                         |                         | TRUE           |                             |               |
| ociBridge4.present                         |                         | TRUE           |                             |               |
|                                            |                         |                |                             |               |
| pciBridge4.virtualDev                      |                         | pcieRootP      | ort                         |               |

**Configuration Parameters** 

Figure courtesy of Fortinet



X

#### Configuring VMs across NUMA Nodes

- Large VMs can split across NUMA nodes
- Memory affinity for each virtual CPU stays on node

| CPU Topology *    |                                                                                                                                             |  |  |  |  |  |  |
|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
| CPU               | 12                                                                                                                                          |  |  |  |  |  |  |
| Cores per Socket  | 6 V (1)<br>Sockets: 2                                                                                                                       |  |  |  |  |  |  |
|                   | The manual configuration for cores per socket might result in reduced X performance.                                                        |  |  |  |  |  |  |
| CPU Hot Plug      | Enable CPU Hot Add                                                                                                                          |  |  |  |  |  |  |
| NUMA Nodes        | 2 ~ (1)                                                                                                                                     |  |  |  |  |  |  |
|                   | Cores per NUMA node: 6           The manual configuration for NUMA nodes might result in reduced         X           performance.         X |  |  |  |  |  |  |
| Device Assignment | Manually assign devices to NUMA nodes.                                                                                                      |  |  |  |  |  |  |
|                   | Device Name <b>Y</b> NUMA Node <b>Y</b>                                                                                                     |  |  |  |  |  |  |
|                   | : SCSI controller 0 Unassigned                                                                                                              |  |  |  |  |  |  |
|                   | : Network adapter 1 Unassigned                                                                                                              |  |  |  |  |  |  |
|                   | USB xHCl controller Unassigned                                                                                                              |  |  |  |  |  |  |



### Node-picking in Linux

#### numactl(8)

- Process level
- --cpunodebind
- --membind
- --localalloc
- --preferrednode
- --interleave





#### **Application-level segregation**

#### SNIA PMDK

- For persistent memory
- VMware presents pmem resources to VMs

```
if (pmem2_config_set_required_store_granularity(cfg,
44
45
                               PMEM2_GRANULARITY_PAGE)) {
                       pmem2 perror("pmem2 config set required store granularity");
46
                       exit(1);
47
48
49
               if (pmem2_map_new(&map, cfg, src)) {
50
                       pmem2_perror("pmem2_map_new");
51
                       exit(1);
52
53
               }
54
               char *addr = pmem2_map_get_address(map);
55
               size_t size = pmem2_map_get_size(map);
56
57
               strcpy(addr, "hello, persistent memory");
58
59
               persist = pmem2_get_persist_fn(map);
60
61
               persist(addr, size);
```



- Hit-n-miss
- Promotion and demotion
- Complexity of tiering
  - How rapidly this becomes an unmanageable problem



#### Hit or miss

- Populate based on use or prediction
- Variety of algorithms for lookup, allocation, eviction and aging
- Can be tuned for workload

Cache



- Populate based on use or prediction
- Variety of algorithms for lookup, allocation, eviction and aging
- Can be tuned for workload





- Populate based on use or prediction
- Variety of algorithms for lookup, allocation, eviction and aging
- Can be tuned for workload





- Populate based on use or prediction
- Variety of algorithms for lookup, allocation, eviction and aging
- Can be tuned for workload





- Populate based on use or prediction
- Variety of algorithms for lookup, allocation, eviction and aging
- Can be tuned for workload





#### Promotion and demotion

- Less frequently used
- Uses less space
- Requires more data movement

#### SDXI (Smart Data Accelerator Interface)

- Rapid memory to memory data mover
- SNIA working group
- "Most Innovative" at Flash Memory Summit in June 2023
- Tuesday's SDXI talk at SDC





#### **Built-in caching**

- VMware automatic memory tiering
- Linux numactl(8) and allows built-in tiering
  - Promote/demote
  - Based on node distance

| Summary | Monitor Configure   | e Permissions | VMs | Datastores | Networks | Updates |
|---------|---------------------|---------------|-----|------------|----------|---------|
| 0       | Logical Processors: | 96            |     |            |          |         |
|         | NICs:               | 3             |     |            |          |         |
|         | Virtual Machines:   | 1             |     |            |          |         |
|         | Memory Tiering:     | Hardware      |     |            |          |         |
|         |                     | DETAILS •     |     |            |          |         |
|         | State:              | Connecte      |     |            |          |         |
|         | Uptime:             | 6 hours       |     |            |          |         |

You can also view the size of DRAM and PMEM under Configure > Hardware > Overview > Memory.

| Summary                                             | Monitor                         | Configure | Permissions      | VMs                      | Datastores     | Networks | Updates |  |  |  |  |
|-----------------------------------------------------|---------------------------------|-----------|------------------|--------------------------|----------------|----------|---------|--|--|--|--|
| System Resource Reservation<br>Firewall<br>Services |                                 | ervation  | Memory           |                          |                |          |         |  |  |  |  |
|                                                     |                                 |           | Total            | 503.68 GB<br>385.17 MB   |                |          |         |  |  |  |  |
| Security                                            | Security Profile<br>System Swap |           | System           |                          |                |          |         |  |  |  |  |
| System S                                            |                                 |           | Virtual machines | 503.3 GB<br>Hardware (i) |                |          |         |  |  |  |  |
| Packages                                            |                                 |           | Memory Tiering   |                          |                |          |         |  |  |  |  |
| Hardware                                            |                                 | ~         | Tier 0           | 256 GB [                 | DRAM (Cache)   |          |         |  |  |  |  |
| Overview                                            | v                               |           | Tier 1           | 503 67 6                 | B PMem (Memor  | V)       |         |  |  |  |  |
| Graphics                                            |                                 |           |                  | 505.07 0                 | D Pinem (Memor | y/       |         |  |  |  |  |



#### Multi-level caching

- Currently used in Content Delivery Networks
- Useful for new HPC and large-scale hosts for main memory





#### Multi-level caching within CXL hosts









24 | ©2023 SNIA. All Rights Reserved.





25 | ©2023 SNIA. All Rights Reserved.





26 | ©2023 SNIA. All Rights Reserved.

#### Multi-level caching within CXL hosts





#### Multi-level caching within CXL hosts





28 | ©2023 SNIA. All Rights Reserved.









30 | ©2023 SNIA. All Rights Reserved.

 



 



 







 



35 | ©2023 SNIA. All Rights Reserved.

 



36 | ©2023 SNIA. All Rights Reserved.

 



 

7 different NUMA nodes plus contention



## Modeling and optimizing

- Workload dependent
- Static vs. dynamic reallocation



## Workloads matter

- No artificial workloads
- Content delivery
- Inference

- Learning
- File serving

semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UND0}], 1) = 0 semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UNDO}], 1) = 0 poll([{fd=61, events=POLLIN}], 1, 3000) = 0 (Timeout) poll([{fd=61, events=POLLIN}], 1, 3000) = 0 (Timeout) semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM UNDO}], 1) = 0 semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UND0}], 1) = 0 semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UNDO}], 1) = 0 poll([{fd=61, events=POLLIN}], 1, 3000) = 0 (Timeout) poll([{fd=61, events=POLLIN}], 1, 3000) = 0 (Timeout) semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UNDO}], 1) = 0 poll([{fd=61, events=POLLIN}], 1, 3000) = 0 (Timeout) poll([{fd=61, events=POLLIN}], 1, 3000) = 0 (Timeout) semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, -1, SEM\_UND0}], 1) = 0 semop(8126470, [{0, 1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, -1, SEM\_UNDO}], 1) = 0 semop(8126470, [{0, 1, SEM\_UNDO}], 1) = 0 [poll([{fd=61, events=POLLIN}], 1, 3000^Cstrace: Process 14046 detached



## Static Analysis with Simulations

- Determine initial configurations
- Build behavioral simulators
- Mix pre-built components and custom as needed







•

**SD**<sup>®</sup>

































Cheaper, faster, more flexible than system building





- Cheaper, faster, more flexible than system building
- Engineering design uses simulations, why not software?





## Code as System Simulation

- Cheaper, faster, more flexible than system building
- Engineering design uses simulations. Why not software?





## Details, details

- Each component can be modeled
- Variables are easy to introduce





## Results are easy to compare

- Run millions of runs
- More variables = more options

AVG RTT (microseconds) L1:urlhashLB, LRU; L2:urlhashLB, LRU

|         |     |           |           |           | Num       | ber of    | L1 ca     | ches      |           |           |           |         |
|---------|-----|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|---------|
|         |     | 1         | 2         | 3         | 4         | 5         | 6         | ,<br>7    | 8         | 9         | 10        | -       |
|         |     | 76318.700 | 74428.500 | 72901.600 | 70929.100 | 69798.700 |           | 67696.800 | 66969.100 | 66322.400 | 65513.000 | - 66000 |
|         | - 7 | 76550.700 | 74428.500 | 72901.600 | 70929.100 | 69798.700 |           | 67696.800 | 66969.100 | 66322.400 | 65513.000 | - 68000 |
| 2       | m - | 77460.700 | 74076.400 | 72901.600 | 70929.100 | 69798.700 |           | 67734.000 | 66969.100 | 66322.400 | 65513.000 |         |
| Number  | 4 - | 78335.600 | 74152.200 | 73351.500 | 70929.100 | 69798.700 |           | 67786.600 | 66969.100 | 66322.400 | 65513.000 | - 70000 |
| er of L | - ت | 78335.600 | 74152.200 | 73351.500 | 70967.800 | 69798.700 |           | 67786.600 | 66969.100 | 66314.800 | 65513.000 | - 72000 |
| .2 cac  | 9 - | 78599.100 | 74134.600 | 73316.300 | 70965.800 | 69705.000 |           | 67786.600 | 66969.100 | 66322.400 | 65513.000 |         |
| caches  | 2   | 78423.900 | 74372.400 | 73521.300 | 70909.600 | 69709.500 |           | 67696.800 | 66969.100 | 66314.800 | 65513.000 | - 74000 |
|         | ∞ - | 78423.900 | 74398.400 | 73336.100 | 70945.800 | 69708.300 |           | 67786.600 | 66969.100 | 66318.500 | 65513.000 | - 76000 |
|         | o - | 78492.200 | 74341.000 | 73483.100 | 71068.300 | 69840.000 |           | 67760.800 | 67041.600 | 66322.400 | 65513.000 |         |
|         | 10  | 78433.700 | 74424.900 | 73339.200 | 70878.100 | 69759.200 | 68541.900 | 67765.800 | 67036.000 | 66314.300 | 65513.000 | - 78000 |



## Dynamic reconfiguration in running environment

Useful for running varied workloads
Make the most of existing hardware
Limited to software and sizing changes







## Working at enterprise scale

- 100+ cores
- 100+ PCIe lanes
  - CXL capabilities
  - Network capabilities
  - Not limited to memory or IO bound loads
- Clusters of numerous nodes



## What the future holds



7 different NUMA nodes plus contention







CXL offers a lot of flexibility





CXL offers a lot of flexibility and complexity



- CXL offers a lot of flexibility and complexity
- OS vendors are helping



- CXL offers a lot of flexibility and complexity
- OS vendors are helping
- More flexibility requires more system design



- CXL offers a lot of flexibility and complexity
- OS vendors are helping
- More flexibility requires more system design
- Optimized system design requires simulations



## RESULTS WITH MAGNITION

As an example, a current customer has achieved the following measurable outcomes with Magnition:

## Experiments per day per engineer

- Without Magnition: **2**
- With Magnition: 50,000+

Parameter variations tested before prod release

- Without Magnition: 50
- With Magnition: **1,000,000+**

Workload performance improvement using our products to find **optimal out-of-the-box settings**: **10-50%+** 







## AMD Summary

# AMD together we advance\_

- CXL is a high-performance interconnect standard which has strong industry support and roadmap for driving system architecture and memory hierarchy innovation
- AMD is a Board of Director of the CXL Consortium, and supports CXL in current and roadmap products
- AMD works with a rich set of CXL eco-system partners to drive innovative solutions for a variety of applications especially in storage segment





## Please take a moment to rate this session.

Your feedback is important to us.









## Section Title

Section Subtitle



65 | ©2021 Storage Developer Conference ©. Insert Company Name Here. All Rights Reserved.

• ۲ ۲ • • • • • • • • • • ۲ • 🕘 • • . ۲ •

## **Section Title**

Section Subtitle



## 

## Light Slide Title

## Bullets 1

- Bullets 2
  - Bullets 3
    - Bullets 4
      - Bullets 5



## Dark Slide Title

## Bullets 1

- Bullets 2
  - Bullets 3
    - Bullets 4
      - Bullets 5



## **Considerations for Hierarchy Options**

- Performance, read vs. write, granularity
- Latencies
- TCO
- Software tier management
- Dynamic changes flexibility
- Mix of different parts with different performance characteristics
- Interleaving
- Optimization focus, capacity vs. bandwidth
- Persistency
- NUMA

