R-Pingmesh

R-Pingmesh is still under heavy development. Please do not use it in production.
The service-aware RoCE network monitoring and diagnostic system based on end-to-end active probing.
R-Pingmesh is a production-ready monitoring system designed for RDMA over Converged Ethernet (RoCE) networks. Built on cutting-edge research from SIGCOMM 2024, it delivers unprecedented visibility into RoCE network performance, enabling rapid detection and precise localization of network problems that can severely impact distributed services.
Why R-Pingmesh?
Modern data centers rely heavily on RoCE networks for high-performance computing workloads like distributed machine learning and storage systems. As these networks scale to tens of thousands of RNICs, traditional monitoring approaches fall short:
- Single-point failures can devastate entire training clusters
- Performance bottlenecks masquerade as network issues
- Troubleshooting becomes time-consuming and error-prone
- Service impact assessment remains largely guesswork
R-Pingmesh solves these challenges with active probing, precise measurements, and service-aware monitoring.
π Key Capabilities
- Accurate RTT measurement using commodity RDMA NICs
- End-host processing delay separation from network latency
- Sub-microsecond precision with CQE timestamps
%%{init: {'theme':'base', 'themeVariables': {'primaryTextColor':'#333333', 'fontSize':'14px'}}}%%
sequenceDiagram
participant P as Prober
participant PN as Prober RNIC
participant N as RoCE Network
participant RN as Responder RNIC
participant R as Responder
Note over P,R: RTT Measurement Process
P->>P: T1: Application post send
P->>PN: Post probe packet
PN->>PN: T2: CQE send completion (HW timestamp)
PN->>N: Probe packet transmission
N->>RN: Network delivery
RN->>RN: T3: CQE receive (HW timestamp)
RN->>R: Deliver to application
R->>R: Process probe packet
R->>RN: Post ACK packet
RN->>RN: T4: CQE ACK send (HW timestamp)
RN->>N: ACK transmission
N->>PN: Network delivery
PN->>PN: T5: CQE ACK receive (HW timestamp)
PN->>P: Completion notification
P->>P: T6: Application poll complete
Note over P,R: Calculations
Note over P: Network RTT = (T5-T2) - (T4-T3)
Note over P: Prober Delay = (T6-T1) - (T5-T2)
Note over R: Responder Delay = T4-T3
Intelligent Problem Detection (WIP)
- RNIC vs. network failure distinction through ToR-mesh probing
- Real-time anomaly detection with minimal false positives
- Service impact assessment to prioritize critical issues
Service-Aware Monitoring (WIP)
- Automatic service flow discovery using eBPF tracing
- Path-specific probing following actual service traffic
- 5-tuple aware measurements for ECMP environments
%%{init: {'theme':'base', 'themeVariables': {'primaryTextColor':'#333333', 'fontSize':'14px'}}}%%
flowchart TD
subgraph "Service Discovery Process"
A[Application creates RDMA connection] --> B[eBPF hooks modify_qp syscall]
B --> C{QP State = RTR?}
C -->|Yes| D[Extract 5-tuple:<br/>Src/Dst GID, Src/Dst QPN]
C -->|No| E[Ignore event]
D --> F[Send event to userspace via ring buffer]
F --> G[Agent receives connection event]
G --> H[Query Controller for target RNIC info]
H --> I[Start service-specific probing]
I --> J[Monitor actual service path]
end
subgraph "Monitoring Modes Comparison"
direction LR
K[Cluster Monitoring<br/>β’ Always-on<br/>β’ ToR-mesh coverage<br/>β’ Network health]
L[Service Tracing<br/>β’ Dynamic<br/>β’ Follows real traffic<br/>β’ Service-aware]
end
style A fill:#E8F5E8
style D fill:#FFF3E0
style I fill:#E3F2FD
style J fill:#F3E5F5
ποΈ Architecture
R-Pingmesh consists of three core components.
%%{init: {'theme':'base', 'themeVariables': {'primaryTextColor':'#333333', 'lineColor':'#666666', 'fontSize':'16px'}}}%%
flowchart TD
%% Agent Layer
A1["π₯οΈ Agent (Host 1)<br/>ββββββββββββββββ<br/>RDMA Manager<br/>eBPF Service Tracer<br/>Active Probing Engine<br/>Path Tracer<br/>Controller Client<br/>Upload Client"]
A1_HW["βοΈ Hardware Layer<br/>ββββββββββββββββ<br/>RDMA Hardware<br/>UD Queue Pairs<br/>CQE Timestamps"]
A1_KERNEL["π§ Kernel Layer<br/>ββββββββββββββββ<br/>eBPF Programs<br/>modify_qp/destroy_qp<br/>Ring Buffer Events"]
AN["π₯οΈ Agent (Host N)<br/>ββββββββββββββββ<br/>Core Modules<br/>RDMA Hardware<br/>eBPF Programs"]
%% Network Infrastructure
NET["π RoCE Network<br/>ββββββββββββββββ<br/>RoCE Fabric<br/>ToR Switches<br/>Spine Switches<br/>Active Probing Paths"]
%% Controller
C["ποΈ Controller<br/>ββββββββββββββββ<br/>RNIC Registry<br/>Pinglist Generator<br/>gRPC Server<br/>Configuration Manager"]
C_DB["πΎ Controller Storage<br/>ββββββββββββββββ<br/>RNIC Database<br/>GID β RNIC Info<br/>ToR ID β RNIC List"]
%% Analyzer
AZ["π Analyzer<br/>ββββββββββββββββ<br/>Data Ingestion API<br/>Anomaly Detection<br/>Root Cause Analysis<br/>SLA Tracker"]
%% Monitoring Capabilities
MONITORING["π Monitoring Modes<br/>ββββββββββββββββ<br/>β’ Cluster Monitoring<br/> (ToR-mesh, Inter-ToR)<br/>β’ Service Tracing<br/> (eBPF Flow Discovery)<br/>β’ Path Tracing<br/> (Network Topology)<br/>β’ Anomaly Detection<br/> (RNIC vs Network)"]
%% OpenTelemetry Integration
OTLP["π‘ OpenTelemetry (OTLP)<br/>ββββββββββββββββ<br/>RTT Metrics Export"]
%% Vertical Flow
A1 --> A1_HW
A1 --> A1_KERNEL
A1_HW --> NET
A1_KERNEL --> NET
AN --> NET
NET --> C
C --> C_DB
C --> AZ
AZ --> MONITORING
A1 -.-> MONITORING
AN -.-> MONITORING
%% OpenTelemetry Integration
A1 -->|"OTLP Export<br/>RTT Metrics"| OTLP
AN -->|"OTLP Export"| OTLP
OTLP --> MONITORING
%% Communication Labels
A1 -.->|"Active Probing<br/>RTT Measurement"| NET
AN -.->|"Active Probing"| NET
A1 <-.->|"gRPC Registration<br/>Pinglists"| C
AN <-.->|"gRPC"| C
A1 -->|"gRPC Upload<br/>Probe Results"| AZ
AN -->|"Data Upload"| AZ
%% Styling
classDef agentClass fill:#4CAF50,stroke:#2E7D32,stroke-width:3px,color:#fff,font-weight:bold
classDef controllerClass fill:#2196F3,stroke:#1565C0,stroke-width:3px,color:#fff,font-weight:bold
classDef analyzerClass fill:#FF9800,stroke:#E65100,stroke-width:3px,color:#fff,font-weight:bold
classDef networkClass fill:#E3F2FD,stroke:#1976D2,stroke-width:3px,color:#1976D2,font-weight:bold
classDef storageClass fill:#9E9E9E,stroke:#424242,stroke-width:3px,color:#fff,font-weight:bold
classDef monitoringClass fill:#F3E5F5,stroke:#7B1FA2,stroke-width:3px,color:#7B1FA2,font-weight:bold
classDef otlpClass fill:#E8F5E8,stroke:#4CAF50,stroke-width:3px,color:#2E7D32,font-weight:bold
class A1,AN agentClass
class C controllerClass
class AZ analyzerClass
class NET networkClass
class A1_HW,A1_KERNEL,C_DB storageClass
class MONITORING monitoringClass
class OTLP otlpClass
Agent
Deployed on every RoCE host, the Agent performs:
- Active probing using UD Queue Pairs
- Service flow monitoring via eBPF programs
- Path tracing for network topology discovery
- Real-time measurements with hardware timestamps
Controller
Centralized coordination service providing:
- RNIC registry management
- Pinglist generation (ToR-mesh and Inter-ToR)
- Target resolution for service tracing
- Configuration distribution
Analyzer
Advanced analytics engine delivering:
- Anomaly detection and root cause analysis
- SLA tracking and performance trending
- Service impact assessment
- Alert generation and escalation
Communication Flow
-
Agent β Controller (gRPC):
- Agent registers RNICs with Controller on startup
- Agent requests Pinglists for Cluster Monitoring (ToR-mesh, Inter-ToR)
- Agent requests target RNIC information for Service Tracing
-
Agent β Analyzer (gRPC):
- Agent uploads probe results (RTT, delays, timeouts)
- Agent uploads path trace information
- Agent uploads aggregated local statistics
-
Controller β Agent (gRPC responses):
- Controller provides Pinglists and RNIC information based on Agent requests
Technical Stack
- Go: with Cgo for RDMA integration
- RDMA Verbs:
libibverbs Cgo wrapper for low-level RDMA operations
- gRPC: for communication with each component
- RQLite: Database for Controller https://rqlite.io/
- OpenTelemetry: for probe metrics instrumentation
- eBPF:
cilium/ebpf library for service flow monitoring
π οΈ Quick Start
Prerequisites
- Linux kernel 5.8+ with eBPF support
- RDMA-capable network interfaces
- Docker (recommended) or native Go 1.25+ environment
- Root privileges or appropriate capabilities
TBD
π Monitoring Modes
Cluster Monitoring
Continuous network health assessment across the entire RoCE cluster:
- ToR-mesh probing: Detects faulty RNICs and local issues
- Inter-ToR probing: Monitors switch and link health
- Always-on operation: Independent of running services
- Comprehensive SLA tracking: RTT, packet loss, and processing delays
Service Tracing (WIP)
Dynamic monitoring of active service communications:
- Automatic flow discovery: eBPF-based connection tracking
- Path-specific measurements: Follows actual service traffic
- Service impact correlation: Links network issues to service performance
- Real-time adaptation: Adjusts to changing service patterns
π§ Configuration
Agent Configuration
# agent.yaml
controller:
address: "controller.example.com:8080"
analyzer:
address: "analyzer.example.com:8081"
probing:
interval: "1s"
timeout: "5s"
ebpf:
enabled: true
buffer_size: 1024
Controller Configuration
# controller.yaml
server:
address: ":8080"
database:
type: "sqlite"
path: "/data/controller.db"
pinglist:
tor_mesh_size: 10
inter_tor_coverage: 0.1
R-Pingmesh is designed for production environments with minimal overhead:
- CPU Usage: <1% per RNIC under normal load
- Memory Footprint: ~50MB per Agent instance
- Network Overhead: <0.1% of link capacity
- Measurement Accuracy: Sub-microsecond precision
- Scalability: Tested with 10,000+ RNICs
π¬ Research Foundation
R-Pingmesh is based on the research paper:
Kefei Liu, Zhuo Jiang, Jiao Zhang, Shixian Guo, Xuan Zhang, Yangyang Bai, Yongbin Dong, Feng Luo, Zhang Zhang, Lei Wang, Xiang Shi, Haohan Xu, Yang Bai, Dongyang Song, Haoran Wei, Bo Li, Yongchen Pan, Tian Pan, Tao Huang, "R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic System", the 38th annual conference of the ACM Special Interest Group on Data Communication (SIGCOMM), 2024.
Key innovations include:
- Novel timestamp-based RTT measurement using CQE events
- ToR-mesh probing for RNIC anomaly detection
- eBPF-based service flow discovery with minimal overhead
- Service-aware impact assessment methodology
π€ Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Clone repository
git clone https://github.com/yuuki/rpingmesh.git
cd rpingmesh
# Run tests
make test
# Build and test locally
make build-local
make test-local
π Documentation
π License
This project is licensed under the MIT License - see the LICENSE file for details.
The eBPF programs in internal/ebpf/bpf/ are dual-licensed under MIT and GPLv2.
π Acknowledgments
- The original R-Pingmesh research team
- The Go, RDMA, eBPF, and Linux communities