When Idling is Ideal: Optimizing Tail-Latency for Highly-Dispersed Datacenter Workloads with Perséphone

Introduction

The latency of requests can be different, some are in 2us, and others like SCAN and EVAL can take 100+ us or 1+ ms. A short request will block behind long requests. For instance, Google reports that machines spend most of their time in the 10-50% utilization range. Related work aims to multiplex the shared resource such as congestion control schemes and CPU schedulers.

key insight: Perséphone takes advantage of parallelism and abundance of cores to solve the long-tail problem. It lets applications define request classifiers and use these classifiers to profile the workload dynamically. It then uses some reserved CPU cores to handle the short request workloads.

DARC Scheduling

Protect short requests at backend servers by extracting their type, understanding their CPU demand, and dedicating enough resources to satisfy their demand.

Challenge:

Predicting how long each request type occupies a CPU
- low-overhead workload profiling
- queuing delay monitoring
Partitioning CPU resources among types, retaining the ability to handle bursts of arrivals and minimizing CPU waste.
- enabling cycles stealing from shorter types to longer ones

Scheduling model

Dispatch cores to request type, then use the core to schedule the request of the type queue (first in first serve) but also steal core cycle from long request core

Worker reservation

Disperse requests into different groups depending on their service time. Compute the core for each group and ensure as-sign at least one worker to a group for the extra fractional request of each group, they can be handled by “spillwall” core or steal from the long request core

Perséphone architecture

Steps to dispatch requests:

worker takes packets from the network
add user-defined classifier for request
store requests into type queues
dispatcher give a request to a worker
worker process the request
response to the NIC
dispatch

Evaluation

Compared to the d-FCFS and c-FCFS, the overall latency and short request latency in DARC decrease a lot.

Concept of Work-conserving scheduler

In computing and communication systems, a work-conserving scheduler is a scheduler that always tries to keep the scheduled resource(s) busy if there are submitted jobs ready to be scheduled. In contrast, a non-work conserving scheduler is a scheduler that, in some cases, may leave the scheduled resource(s) idle despite the presence of jobs ready to be scheduled.