Stateful Multi-Pipelined Programmable Switches

Vishal Shrivastav
Purdue University
Motivating Example

Consider a packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d
Motivating Example

Consider a packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d
Consider a packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d

```
if C > threshold
  Mark packet "to drop"
```

Switch Pipeline

- Port 0
- Port 1
- hash(dst)
- 0: 0
- 1: 0
- 2: 0
- 3: 0
- If C > threshold
- Mark packet "to drop"
Motivating Example

Consider a packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d

Switch Pipeline: hash(dst) → 0: 0, 1: 0, 2: 0, 3: 0 → If C > threshold Mark packet “to drop”
Motivating Example

Consider a packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d

```
  hash(D: X) = 1
```

```
  0: 0
  1: 0
  2: 0
  3: 0
```

If C > threshold
Mark packet “to drop”
Motivating Example

Consider a packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d

Switch Pipeline

- Port 0: $D: X$
- Port 1: $D: X$
- hash($D: X$) = 1
- D: X
- 0: 0
- 1: 1
- 2: 0
- 3: 0

If $C >$ threshold
Mark packet “to drop”
Motivating Example

Consider a packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d

Switch Pipeline
Reality of Today’s Switch Hardware

- Clock speed of a single pipeline has saturated
  - Limits the line rate

- Employ multiple **parallel pipelines** to sustain multi-tbps line rate
  - Each pipeline processes packets **independently** — No co-ordination
Goal

Logical single large pipeline

Rate: R
Goal

Logical single large pipeline → Rate: R

Logical single large pipeline → Rate: R/4

Map

Rate: R/4

Rate: R
Goal

Functional Equivalence
Runtime behavior of program same as on a single large pipeline

Performance Equivalence
Program runs as close to rate of a single large pipeline, i.e., R
w/o violating functional equivalence

Logical single large pipeline

Rate: R

Map

Rate: R/4
Our Contribution

We present a new switch design **MP5** that extends current programmable switch’s **architecture, compiler, and runtime** to guarantee **functional equivalence** with **high performance**.
Naive Approaches

Consider a *stateless* packet processing program:

- Switch increments the ttl value in packet header by 1
- If ttl value exceeds a threshold
  - Switch drops the packet
Consider a *stateless* packet processing program:

- Switch increments the ttl value in packet header by 1
- If ttl value exceeds a threshold
  - Switch drops the packet

```
Consider a stateless packet processing program:
  - Switch increments the ttl value in packet header by 1
  - If ttl value exceeds a threshold
    - Switch drops the packet
```
Consider a stateless packet processing program:

- Switch increments the ttl value in packet header by 1
- If ttl value exceeds a threshold
  - Switch drops the packet

**Try 1: Replicate stateless processing on all pipelines**
# Goals and Techniques

<table>
<thead>
<tr>
<th>Techniques</th>
<th>Functional Equivalence</th>
<th>Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>Stateless</td>
<td>Stateful</td>
<td>Stateless</td>
</tr>
</tbody>
</table>

- Replicate stateless processing

![Checkmark for Stateless and Stateful functional equivalence and performance]
Naive Approaches

Consider a \textit{stateful} packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination $d$ exceeds a threshold
  - Switch drops all subsequent packets destined to $d$

\begin{figure}
\centering
\includegraphics[width=\textwidth]{diagram}
\caption{Diagram of packet processing}
\end{figure}
Consider a **stateful** packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d

**Try 1: Replicate stateful processing on all pipelines**
Consider a *stateful* packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination $d$ exceeds a threshold
  - Switch drops all subsequent packets destined to $d$

**Try 1: Replicate stateful processing on all pipelines**

Violates functional equivalence!
Consider a *stateful* packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d

**Try 2: Limit stateful processing to a single “shared” pipeline**
Naive Approaches

Consider a *stateful* packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d

**Try 2: Limit stateful processing to a single “shared” pipeline**
Steer all packets to the “shared” pipeline

```
hash(D: X) = 1
```

<table>
<thead>
<tr>
<th>Port 0</th>
<th>D: X</th>
<th>hash(D: X) = 1</th>
<th>0: 0</th>
<th>1: 2</th>
<th>2: 0</th>
<th>3: 0</th>
<th>If C &gt; threshold Mark packet “to drop”</th>
</tr>
</thead>
<tbody>
<tr>
<td>Port 1</td>
<td>D: X</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Consider a stateful packet processing program:

- Switch maintains packet counters for each destination IP
- If the counter value for destination d exceeds a threshold
  - Switch drops all subsequent packets destined to d

Try 2: Limit stateful processing to a single “shared” pipeline
Steer all packets to the “shared” pipeline
Limits speed of stateful processing!
Goals and Techniques

<table>
<thead>
<tr>
<th>Techniques</th>
<th>Functional Equivalence</th>
<th>Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Stateless</td>
<td>Stateful</td>
</tr>
<tr>
<td>Replicate stateless processing</td>
<td>✔</td>
<td>✔</td>
</tr>
<tr>
<td>+</td>
<td>✔</td>
<td>✔</td>
</tr>
<tr>
<td>Limit stateful processing to single pipeline</td>
<td>✔</td>
<td>✔</td>
</tr>
</tbody>
</table>
Question

How to improve performance?
(without violating functional equivalence)
Problem

How to store shared state that enables high packet processing throughput?

If C > threshold
Mark packet “to drop”
Solution

How to store shared state that enables high packet processing throughput?

[Diagram showing a logic flow for storing and processing packets based on a threshold]

Shard the shared state across pipelines
How to store shared state that enables high packet processing throughput?

Port 0

- hash(D: X) = 0
- D: X
- If C > threshold
- Mark packet “to drop”

Port 1

- hash(D: Y) = 3
- D: Y
- If C > threshold
- Mark packet “to drop”

Shard the shared state across pipelines
Solution

How to store shared state that enables high packet processing throughput?

**Shard** the shared state across pipelines

...but what is the optimal sharding strategy?
Solution

How to store shared state that enables high packet processing throughput?

**Shard** the shared state across pipelines

...but what is the optimal sharding strategy?

<table>
<thead>
<tr>
<th>Port 0</th>
<th>Port 1</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="image" alt="hash(D: X)" /></td>
<td><img src="image" alt="hash(D: Y)" /></td>
</tr>
<tr>
<td>= 0</td>
<td>= 3</td>
</tr>
</tbody>
</table>

If \( C > \text{threshold} \)
Mark packet “to drop”

If \( C > \text{threshold} \)
Mark packet “to drop”
Solution

How to store shared state that enables high packet processing throughput?

*Shard* the shared state across pipelines

...but what is the optimal sharding strategy?

<table>
<thead>
<tr>
<th>Port 0</th>
<th>D: X</th>
<th>hash(D: X) = 0</th>
<th>Optimal</th>
<th>If C &gt; threshold, mark packet “to drop”</th>
</tr>
</thead>
<tbody>
<tr>
<td>D: X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Port 1</th>
<th>D: Y</th>
<th>hash(D: Y) = 3</th>
<th></th>
<th>If C &gt; threshold, mark packet “to drop”</th>
</tr>
</thead>
</table>
Solution

How to store shared state that enables high packet processing throughput?

Shard the shared state across pipelines

...but what is the optimal sharding strategy?

Port 0

hash(D: Z) = 2

If C > threshold
Mark packet “to drop”

Port 1

hash(D: Y) = 3

If C > threshold
Mark packet “to drop”
Solution

How to store shared state that enables high packet processing throughput?

Shard the shared state across pipelines

…but what is the optimal sharding strategy?

<table>
<thead>
<tr>
<th>Port 0</th>
<th>Port 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>hash(D: Z) = 2</td>
<td>hash(D: Y) = 3</td>
</tr>
<tr>
<td>0: 0</td>
<td></td>
</tr>
<tr>
<td>1: 0</td>
<td></td>
</tr>
<tr>
<td>2: 0</td>
<td></td>
</tr>
<tr>
<td>3: 0</td>
<td></td>
</tr>
<tr>
<td>If C &gt; threshold Mark packet “to drop”</td>
<td>If C &gt; threshold Mark packet “to drop”</td>
</tr>
</tbody>
</table>
Solution

How to store shared state that enables high packet processing throughput?

**Shard** the shared state across pipelines

...but what is the optimal sharding strategy?

Port 0

hash(\textbf{D: Z}) = 2

0: 0
1: 0
2: 0

If C > threshold
Mark packet “to drop”

Port 1

hash(\textbf{D: Y}) = 3

3: 0

If C > threshold
Mark packet “to drop”
Solution

How to store shared state that enables high packet processing throughput?

Shard the shared state across pipelines

…but what is the optimal sharding strategy?

Port 0

hash(D: Z) = 2

hash(D: Z) = 2

If C > threshold
Mark packet “to drop”

Port 1

hash(D: Y) = 3

If C > threshold
Mark packet “to drop”
Solution

How to store shared state that enables high packet processing throughput?

Shard the shared state across pipelines

…but what is the optimal sharding strategy?

Ensure state accesses are uniformly distributed across pipelines
Solution

How to store shared state that enables high packet processing throughput?

Shard the shared state across pipelines

...but what is the optimal sharding strategy?

Ensure state accesses are uniformly distributed across pipelines

...depends upon the packet arrival pattern (hard to predict)
Solution

How to store shared state that enables high packet processing throughput?

**Shard** the shared state across pipelines

...but what is the optimal sharding strategy?

*Ensure state accesses are uniformly distributed across pipelines*

...depends upon the packet arrival pattern (hard to predict)

**Dynamically shard** the shared state across pipelines by **monitoring** the state access patterns at runtime
Solution

How to store shared state that enables high packet processing throughput?

Dynamically shard the shared state across pipelines by monitoring the state access patterns at runtime.

Reduces to a variant of bin packing problem (NP-Hard!)
Solution

How to store shared state that enables high packet processing throughput?

*Dynamically shard* the shared state across pipelines by *monitoring* the state access patterns at runtime.

Reduces to a variant of *bin packing* problem (NP-Hard!)
Solution

How to store shared state that enables high packet processing throughput?

**Dynamically shard** the shared state across pipelines by **monitoring** the state access patterns at runtime.

Reduces to a variant of **bin packing** problem (NP-Hard!)

MP5 uses a heuristic to approximate bin packing that is amenable to fast hardware implementation.
If $C > \text{threshold}$
Mark packet “to drop”

Port 0

$\text{hash}(D: Z) = 2$

Port 1

$\text{hash}(D: Y) = 3$

If $C > \text{threshold}$
Mark packet “to drop”
If C > threshold
Mark packet
"to drop"

Packet and the corresponding shared state may be on different pipelines!

One Missing Detail
Packet may need to go back and forth between pipelines to access the shared states!
One Missing Detail

How to steer packets to a shared state in a remote pipeline?
Existing Solution

How to steer packets to a shared state in a remote pipeline?

Packet Re-circulation
Existing Solution

How to steer packets to a shared state in a remote pipeline?

Packet Re-circulation
How to steer packets to a shared state in a remote pipeline?

Packet Re-circulation
How to steer packets to a shared state in a remote pipeline?

Existing Solution

Packet Re-circulation
How to steer packets to a shared state in a remote pipeline?

Packet Re-circulation results in **throughput penalty** and **increased latency**

...because packets re-visit same stages multiple times!
Existing Solution

How to steer packets to a shared state in a remote pipeline?

Packet Re-circulation results in **throughput penalty** and **increased latency**

...because packets re-visit same stages multiple times!

Need a **feed-forward-only** packet steering design
How to steer packets to a shared state in a remote pipeline?

**Current switch design**

A packet in stage \( i \) of pipeline \( j \) could move to stage \( i+1 \) of only pipeline \( j \).
Our Solution

How to steer packets to a shared state in a remote pipeline?

**Feed-forward-only packet steering design**

A packet in stage $i$ of pipeline $j$ could move to stage $i+1$ of only pipeline $j$ any pipeline
Our Solution

How to steer packets to a shared state in a remote pipeline?

**Feed-forward-only packet steering design**

A packet in stage $i$ of pipeline $j$ could move to stage $i+1$ of only pipeline $j$ any pipeline
Question Re-visited

How to improve performance? (without violating functional equivalence)
## Goals and Techniques

<table>
<thead>
<tr>
<th>Techniques</th>
<th>Functional Equivalence</th>
<th>Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Stateless</td>
<td>Stateful</td>
</tr>
<tr>
<td>Replicate stateless processing</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Limit stateful processing to single pipeline</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Dynamic state sharding</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Feed-forward pkt steering</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>
### Goals and Techniques

<table>
<thead>
<tr>
<th>Techniques</th>
<th>Functional Equivalence</th>
<th>Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>Replicate stateless processing</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>+ Limit stateful processing to single pipeline</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>+ Dynamic state sharding</td>
<td>✓</td>
<td>?</td>
</tr>
<tr>
<td>&amp; Feed-forward pkt steering</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>
# Goals and Techniques

<table>
<thead>
<tr>
<th>Techniques</th>
<th>Functional Equivalence</th>
<th>Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Stateless</td>
<td>Stateful</td>
</tr>
<tr>
<td>Replicate stateless processing</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Limit stateful processing to single pipeline</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Dynamic state sharding</td>
<td>✓</td>
<td>✗</td>
</tr>
<tr>
<td>Feed-forward pkt steering</td>
<td>✓</td>
<td>✗</td>
</tr>
</tbody>
</table>
Each pipeline can process 1 packet per time unit
Each pipeline can process 1 packet per time unit

On a single-pipeliined switch, D will always access register index 1 in stage 2 before E
Problem

Each pipeline can process 1 packet per time unit
Each pipeline can process 1 packet per time unit
Each pipeline can process 1 packet per time unit
Problem

Each pipeline can process 1 packet per time unit

E will access index 1 in stage 2 before D!
(may violate functional equivalence)
Problem

Each pipeline can process 1 packet per time unit

E will access index 1 in stage 2 before D!
(may violate functional equivalence)

Packet re-ordering can also impact application performance
e.g., if D and E belong to same TCP flow
Problem

How to avoid packet re-ordering and out-of-order state access?
Solution

How to avoid packet re-ordering and out-of-order state access?

Enforce ordering **preemptively** (i.e., before a packet reaches a stateful stage)
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 1: Preemptively figure out all states a packet would access
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 1: Preemptively figure out all states a packet would access

Hard in general (even impossible in some cases)
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 1: Preemptively figure out all states a packet would access

Hard in general (even impossible in some cases)

Insight: Most packet processing programs access register index based on hash of a subset of packet header fields
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 1: Preemptively figure out all states a packet would access

Hard in general (even impossible in some cases)

**Insight:** Most packet processing programs access register index based on hash of a subset of packet header fields

…can be known as soon as a packet arrives at the switch
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 1: Preemptively figure out all states a packet would access.

Compiler adds a new stage before any stateful stage.

```
Port 0
state index = hash(p.hdr)
```

```
Port 1
state index = hash(p.hdr)
```

```
0: 1
1: 8
```

```
2: 6
3: 10
```

```
0: 2
1: 16
```

```
2: 5
3: 9
```
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 2: Enforce ordering in the stateful stages

Compiler adds a new stage before any stateful stage

Port 0

state index = hash(p.hdr)

Port 1

state index = hash(p.hdr)
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 2: Enforce ordering in the stateful stages

Compiler adds a new stage before any stateful stage

Port 0

state index = hash(p.hdr)
+ Timestamp pkts

Port 1

state index = hash(p.hdr)
+ Timestamp pkts

Timestamp Packets?

0: 1
1: 8
2: 6
3: 10

0: 2
1: 5
2: 16
3: 9
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 2: Enforce ordering in the stateful stages

Compiler adds a new stage before any stateful stage

Stateful operation

Timestamp Packets? - won't work!

Port 0

Port 1
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 2: Enforce ordering in the stateful stages

Compiler adds a new stage before any stateful stage

Generate "placeholders" for data packets

Port 0

state index = hash(p.hdr)
Gen Phantom pkt

Port 1

state index = hash(p.hdr)
Gen Phantom pkt

0: 1
1: 8
2: 6
3: 10

0: 2
1: 5
2: 16
3: 9
Step 2: Enforce ordering in the stateful stages

Solution

How to avoid packet re-ordering and out-of-order state access?

Compiler adds a new stage before any stateful stage

Generate “placeholders” for data packets

Port 0

state index = hash(p.hdr)

Gen Phantom pkt

Port 1

state index = hash(p.hdr) + Gen Phantom pkt
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 2: Enforce ordering in the stateful stages

Compiler adds a new stage before any stateful stage

Generate “placeholders” for data packets

Port 0

Port 1

state index = hash(p.hdr)
Gen Phantom pkt

state index = hash(p.hdr) + Gen Phantom pkt
How to avoid packet re-ordering and out-of-order state access?

**Step 2: Enforce ordering in the stateful stages**

Compiler adds a new stage before any stateful stage

*Generate “placeholders” for data packets*

- Port 0
  - state index = hash(p.hdr) + Gen Phantom pkt
  - 0: 1
  - 1: 8
  - 2: D
  - 3: C

- Port 1
  - state index = hash(p.hdr)
  - Gen Phantom pkt
  - 2: 6
  - 3: 10
Solution

How to avoid packet re-ordering and out-of-order state access?

Step 2: Enforce ordering in the stateful stages

Compiler adds a new stage before any stateful stage

Generate “placeholders” for data packets

Port 0

Port 1

state index = hash(p.hdr)
Gen Phantom pkt

state index = hash(p.hdr)
Gen Phantom pkt

Order enforced
# Goals and Techniques

<table>
<thead>
<tr>
<th>Techniques</th>
<th>Functional Equivalence</th>
<th>Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Stateless</td>
<td>Stateful</td>
</tr>
<tr>
<td>Replicate stateless processing</td>
<td>✓</td>
<td>x</td>
</tr>
<tr>
<td>Limit stateful processing to single pipeline</td>
<td>✓</td>
<td>x</td>
</tr>
<tr>
<td>Dynamic state sharding</td>
<td>✓</td>
<td>x</td>
</tr>
<tr>
<td>Feed-forward pkt steering</td>
<td>✓</td>
<td>x</td>
</tr>
<tr>
<td>Preemptive state access order enforcement</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>
Realistic Workloads & Applications

**Flowlet Switching**

**CONGA load balancing**

**Priority calculation for WFQ**

**Network Sequencer**
Summary

**Functional Equivalence**
Runtime behavior of program same as on a single large pipeline

**Performance Equivalence**
Program runs as close to rate of a single large pipeline, i.e., R/4 without violating functional equivalence.

Rate: R

Rate: R/4
Summary

Functional Equivalence
Runtime behavior of program same as on a single large pipeline

Performance Equivalence
Program runs as close to rate of a single large pipeline, i.e., $R/4$
Functional Equivalence
Runtime behavior of program same as on a single large pipeline

Performance Equivalence
Program runs as close to rate of a single large pipeline, i.e., $R$

Rate: $R/4$
Summary

Functional Equivalence
Runtime behavior of program same as on a single large pipeline

Performance Equivalence
Program runs as close to rate of a single large pipeline, i.e., R

Dynamically shard shared state based on runtime state access pattern
Summary

Functional Equivalence
Runtime behavior of program same as on a single large pipeline

Performance Equivalence
Program runs as close to rate of a single large pipeline, i.e., R

Dynamically shard shared state based on runtime state access pattern

Preemptive enforce state access order

Rate: R
Rate: R/4
Thank you!