metadata-center

module
v0.0.0-...-4b532c3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 7, 2025 License: Apache-2.0

README ยถ

Metadata Center

A near real-time load metric collection component, designed for intelligent inference scheduler in large-scale inference services.

CI codecov Go Report Card

English | ไธญๆ–‡

Status

Early & quick developing

Background

Load metrics is very import for LLM inference scheduler.

Typically, the following four load metrics are very important: (for each engine level)

  1. Total number of requests
  2. Token usage (KVCache usage)
  3. Number of requests in Prefill
  4. Prompt length in Prefill

Timeliness is critical in large scale service. Poor timeliness will lead to large races, may choosing the same inference engine before the load metrics are updated.

There will be a fixed periodic delay, when polling metrics from engines. Especially in large-scale scenarios, as the QPS (throughput) increases, the race will also increase significantly.

Architecture

Architecture

Cooperating with Inference Gateway(i.e. AIGW), we can achieve near real-time load metric collection by the following steps:

  1. Request proxy to Inference Engine:

    a. prefill & total request number: +1

    b. prefill prompt length: +prompt-length

  2. First token responded

    a. prefill request number: -1

    b. prefill prompt length: -prompt-length

  3. Request done

    a. total request number: -1

Even more, we can introduce CAS API to reduce race, when it is required in the feature.

๐Ÿ“š Documentation

๐Ÿ“œ License

This project is licensed under Apache 2.0.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL