collector

package
v0.11.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 2, 2025 License: GPL-3.0 Imports: 80 Imported by: 0

Documentation

Overview

Package collector implements different collectors of the exporter

Index

Constants

View Source
const CEEMSExporterAppName = "ceems_exporter"

CEEMSExporterAppName is kingpin app name.

View Source
const Namespace = "ceems"

Namespace defines the common namespace to be used by all metrics.

Variables

View Source
var CEEMSExporterApp = *kingpin.New(
	CEEMSExporterAppName,
	"Prometheus Exporter and Pyroscope client to export compute (job, VM, pod) resource usage and ebpf based profiling metrics.",
)

CEEMSExporterApp is kingpin CLI app.

View Source
var (
	ErrIPMIUnavailable = errors.New("ipmi dcmi power readings are not active")
)

Custom errors.

View Source
var ErrNoData = errors.New("collector returned no data")

ErrNoData indicates the collector found no data to collect, but had no other error.

Functions

func DisableDefaultCollectors

func DisableDefaultCollectors()

DisableDefaultCollectors sets the collector state to false for all collectors which have not been explicitly enabled on the command line.

func IsNoDataError

func IsNoDataError(err error) bool

IsNoDataError returns true if error is ErrNoData.

func KernelStringToNumeric

func KernelStringToNumeric(ver string) int64

KernelStringToNumeric converts the kernel version string into a numerical value that can be used to make comparison.

func KernelVersion

func KernelVersion() (int64, error)

KernelVersion returns kernel version of current host.

func NewCgroupCollector

func NewCgroupCollector(logger *slog.Logger, cgManager *cgroupManager, opts cgroupOpts) (*cgroupCollector, error)

NewCgroupCollector returns a new cgroupCollector exposing a summary of cgroups.

func NewCgroupManager

func NewCgroupManager(name manager, logger *slog.Logger) (*cgroupManager, error)

NewCgroupManager returns an instance of cgroupManager based on resource manager.

func NewEbpfCollector

func NewEbpfCollector(logger *slog.Logger, cgManager *cgroupManager) (*ebpfCollector, error)

NewEbpfCollector returns a new instance of ebpf collector.

func NewGokitLogger

func NewGokitLogger(lvl string, logger *slog.Logger) log.Logger

NewGokitLogger creates a new Go-kit logger from slog.Logger.

func NewPerfCollector

func NewPerfCollector(logger *slog.Logger, cgManager *cgroupManager) (*perfCollector, error)

NewPerfCollector returns a new perf based collector, it creates a profiler per compute unit.

func NewRDMACollector

func NewRDMACollector(logger *slog.Logger, cgManager *cgroupManager) (*rdmaCollector, error)

NewRDMACollector returns a new Collector exposing RAPL metrics.

func RegisterCollector

func RegisterCollector(
	collector string,
	isDefaultEnabled bool,
	factory func(logger *slog.Logger) (Collector, error),
)

RegisterCollector registers collector into collector factory.

func SanitizeMetricName

func SanitizeMetricName(metricName string) string

SanitizeMetricName sanitize the given metric name by replacing invalid characters by underscores.

OpenMetrics and the Prometheus exposition format require the metric name to consist only of alphanumericals and "_", ":" and they must not start with digits. Since colons in MetricFamily are reserved to signal that the MetricFamily is the result of a calculation or aggregation of a general purpose monitoring system, colons will be replaced as well.

Note: If not subsequently prepending a namespace and/or subsystem (e.g., with prometheus.BuildFQName), the caller must ensure that the supplied metricName does not begin with a digit.

func TargetsHandlerFor

func TargetsHandlerFor(discoverer Discoverer, opts promhttp.HandlerOpts) http.Handler

TargetsHandlerFor returns http.Handler for Alloy targets.

Types

type AMDASIC

type AMDASIC struct {
	NumCUs uint64 `json:"num_compute_units"`
}

type AMDBoard

type AMDBoard struct {
	Serial string `json:"product_serial"`
	Name   string `json:"product_name"`
}

type AMDBus

type AMDBus struct {
	BDF string `json:"bdf"`
}

type AMDGPU

type AMDGPU struct {
	ID        int64         `json:"gpu"`
	ASIC      *AMDASIC      `json:"asic"`
	Bus       *AMDBus       `json:"bus"`
	Board     *AMDBoard     `json:"board"`
	Partition *AMDPartition `json:"partition"`
}

type AMDNodeProperties

type AMDNodeProperties struct {
	DevID          uint64 // Unique ID for each physical GPU
	RenderID       uint64 // renderDx for each GPU (physical or partition)
	DevicePluginID string // ID used in k8s device plugin
	NumCUs         uint64
}

type AMDPartition

type AMDPartition struct {
	Compute string `json:"compute_partition"`
	Memory  string `json:"memory_partition"`
	ID      uint64 `json:"partition_id"`
}

type Address

type Address struct {
	XMLName  xml.Name `xml:"address"`
	UUID     string   `xml:"uuid,attr"`
	Type     string   `xml:"type,attr"`
	Domain   string   `xml:"domain,attr"`
	Bus      string   `xml:"bus,attr"`
	Slot     string   `xml:"slot,attr"`
	Function string   `xml:"function,attr"`
}

type BusID

type BusID struct {
	// contains filtered or unexported fields
}

BusID is a struct that contains PCI bus address of GPU device.

func (*BusID) Compare

func (b *BusID) Compare(bTest BusID) bool

Compare compares the provided bus ID with current bus ID and returns true if they match and false in all other cases.

func (BusID) String

func (b BusID) String() string

String implements Stringer interface of the BusID struct.

type CEEMSCollector

type CEEMSCollector struct {
	Collectors map[string]Collector
	// contains filtered or unexported fields
}

CEEMSCollector implements the prometheus.Collector interface.

func NewCEEMSCollector

func NewCEEMSCollector(logger *slog.Logger) (*CEEMSCollector, error)

NewCEEMSCollector creates a new CEEMSCollector.

func (CEEMSCollector) Close

func (n CEEMSCollector) Close(ctx context.Context) error

Close stops all the collectors and release system resources.

func (CEEMSCollector) Collect

func (n CEEMSCollector) Collect(ch chan<- prometheus.Metric)

Collect implements the prometheus.Collector interface.

func (CEEMSCollector) Describe

func (n CEEMSCollector) Describe(ch chan<- *prometheus.Desc)

Describe implements the prometheus.Collector interface.

type CEEMSExporter

type CEEMSExporter struct {
	App kingpin.Application
	// contains filtered or unexported fields
}

CEEMSExporter represents the `ceems_exporter` cli.

func NewCEEMSExporter

func NewCEEMSExporter() (*CEEMSExporter, error)

NewCEEMSExporter returns a new CEEMSExporter instance.

func (*CEEMSExporter) Main

func (b *CEEMSExporter) Main() error

Main is the entry point of the `ceems_exporter` command.

type CEEMSExporterServer

type CEEMSExporterServer struct {
	// contains filtered or unexported fields
}

CEEMSExporterServer struct implements HTTP server for exporter.

func NewCEEMSExporterServer

func NewCEEMSExporterServer(c *Config) (*CEEMSExporterServer, error)

NewCEEMSExporterServer creates new CEEMSExporterServer struct instance.

func (*CEEMSExporterServer) Shutdown

func (s *CEEMSExporterServer) Shutdown(ctx context.Context) error

Shutdown stops CEEMS exporter HTTP server.

func (*CEEMSExporterServer) Start

func (s *CEEMSExporterServer) Start() error

Start launches CEEMS exporter HTTP server.

type CEEMSProfilerConfig

type CEEMSProfilerConfig struct {
	Session   SessionConfig   `yaml:"ebpf"`
	Pyroscope PyroscopeConfig `yaml:"pyroscope"`
}

type Collector

type Collector interface {
	// Get new metrics and expose them via prometheus registry.
	Update(ch chan<- prometheus.Metric) error
	// Stops each collector and cleans up system resources
	Stop(ctx context.Context) error
}

Collector is the interface a collector has to implement.

func NewCPUCollector

func NewCPUCollector(logger *slog.Logger) (Collector, error)

NewCPUCollector returns a new Collector exposing kernel/system statistics.

func NewCrayPMCCollector

func NewCrayPMCCollector(logger *slog.Logger) (Collector, error)

NewCrayPMCCollector returns a new Collector exposing Cray's `pm_counters` metrics.

func NewEmissionsCollector

func NewEmissionsCollector(logger *slog.Logger) (Collector, error)

NewEmissionsCollector returns a new Collector exposing emission factor metrics.

func NewHwmonCollector

func NewHwmonCollector(logger *slog.Logger) (Collector, error)

NewHwmonCollector returns a new Collector exposing /sys/class/hwmon stats (similar to lm-sensors).

func NewIPMICollector

func NewIPMICollector(logger *slog.Logger) (Collector, error)

NewIPMICollector returns a new Collector exposing IMPI DCMI power metrics.

func NewInfiniBandCollector

func NewInfiniBandCollector(logger *slog.Logger) (Collector, error)

NewInfiniBandCollector returns a new Collector exposing InfiniBand stats.

func NewK8sCollector

func NewK8sCollector(logger *slog.Logger) (Collector, error)

NewK8sCollector returns a new Collector exposing a summary of cgroups.

func NewLibvirtCollector

func NewLibvirtCollector(logger *slog.Logger) (Collector, error)

NewLibvirtCollector returns a new libvirt collector exposing a summary of cgroups.

func NewMeminfoCollector

func NewMeminfoCollector(logger *slog.Logger) (Collector, error)

NewMeminfoCollector returns a new Collector exposing memory stats.

func NewNetdevCollector

func NewNetdevCollector(logger *slog.Logger) (Collector, error)

NewNetdevCollector returns a new Collector exposing node network stats.

func NewRaplCollector

func NewRaplCollector(logger *slog.Logger) (Collector, error)

NewRaplCollector returns a new Collector exposing RAPL metrics.

func NewRedfishCollector

func NewRedfishCollector(logger *slog.Logger) (Collector, error)

NewRedfishCollector returns a new Collector to fetch power usage from redfish API.

func NewSlurmCollector

func NewSlurmCollector(logger *slog.Logger) (Collector, error)

NewSlurmCollector returns a new Collector exposing a summary of cgroups.

type ComputeUnit

type ComputeUnit struct {
	UUID      string
	Hostname  string // Only applicable to SLURM when multiple daemons are enabled on same physical host
	NumShares uint64 // In case of time slicing/shards/MPS
}

ComputeUnit contains the unit details that will be associated with each GPU.

type Config

type Config struct {
	Logger     *slog.Logger
	Collector  *CEEMSCollector
	Discoverer Discoverer
	Web        WebConfig
}

Config makes a server config.

type Device

type Device struct {
	Minor            string
	Index            string
	Name             string
	UUID             string
	BusID            BusID
	NumSMs           uint64
	ComputeUnits     []ComputeUnit
	CurrentShares    uint64 // A share can be time slicing or MPS
	AvailableShares  uint64
	MdevUUIDs        []string
	Instances        []GPUInstance
	InstancesEnabled bool
	VGPUEnabled      bool
	// contains filtered or unexported fields
}

Device contains the details of physical GPU devices.

func (*Device) CompareBusID

func (d *Device) CompareBusID(id string) bool

CompareBusID compares the provided bus ID with device bus ID and returns true if they match and false in all other cases.

func (Device) ID

func (d Device) ID() string

ID return device ID that will be used by k8s requests.

func (*Device) ResetUnits

func (d *Device) ResetUnits()

ResetUnits will remove existing compute unit UUIDs.

func (Device) String

func (d Device) String() string

String implements Stringer interface of the Device struct.

type DeviceAttrs

type DeviceAttrs struct {
	XMLName xml.Name          `xml:"device_attributes"`
	Shared  DeviceAttrsShared `xml:"shared"`
}

type DeviceAttrsShared

type DeviceAttrsShared struct {
	XMLName  xml.Name `xml:"shared"`
	SMCount  uint64   `xml:"multiprocessor_count"`
	CECount  uint64   `xml:"copy_engine_count"`
	EncCount uint64   `xml:"encoder_count"`
	DecCount uint64   `xml:"decoder_count"`
}

func (*DeviceAttrsShared) UnmarshalXML added in v0.10.1

func (p *DeviceAttrsShared) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error

UnmarshalXML implements the xml.Unmarshaler interface.

type Devices

type Devices struct {
	HostDevs []HostDev `xml:"hostdev"`
}

type Discoverer

type Discoverer interface {
	Discover() ([]Target, error)
	Enabled() bool
}

func NewTargetDiscoverer

func NewTargetDiscoverer(c *discovererConfig) (Discoverer, error)

NewTargetDiscoverer returns a new profiling target discoverer.

type DomStatus added in v0.11.0

type DomStatus struct {
	Domain Domain `xml:"domain"`
}

Domstatus is the top level XML field for runtime XML files.

type Domain

type Domain struct {
	Devices Devices `xml:"devices"`
	Name    string  `xml:"name"`
	UUID    string  `xml:"uuid"`
}

Domain is the top level XML field for persistent XML files.

type GPUInstance

type GPUInstance struct {
	InstanceIndex   uint64
	Index           string
	UUID            string
	ComputeInstID   uint64
	GPUInstID       uint64
	SMFraction      float64
	NumSMs          uint64
	ComputeUnits    []ComputeUnit
	CurrentShares   uint64 // A share can be time slicing or MPS
	AvailableShares uint64
	MdevUUIDs       []string
}

GPUInstance is abstraction for NVIDIA MIG instance or AMD GPU partition.

func (GPUInstance) ID

func (d GPUInstance) ID() string

ID return instance ID that will be used by k8s requests.

func (*GPUInstance) ResetUnits

func (d *GPUInstance) ResetUnits()

ResetUnits will remove existing compute unit UUIDs.

func (GPUInstance) String

func (d GPUInstance) String() string

String implements Stringer interface of the Device struct.

type GPUSMI

type GPUSMI struct {
	Devices []Device
	// contains filtered or unexported fields
}

GPUSMI is a vendor neutral SMI interface for GPUs.

func NewGPUSMI

func NewGPUSMI(k8sClient *ceems_k8s.Client, logger *slog.Logger) (*GPUSMI, error)

NewGPUSMI returns a new instance of GPUSMI struct to query GPUs.

func (*GPUSMI) Discover

func (g *GPUSMI) Discover() error

Discover finds devices on the host.

func (*GPUSMI) ReindexGPUs

func (g *GPUSMI) ReindexGPUs(orderMap string)

ReindexGPUs reindexes GPU globalIndex based on orderMap string.

func (*GPUSMI) UpdateGPUMdevs

func (g *GPUSMI) UpdateGPUMdevs() error

UpdateGPUMdevs updates GPU devices slice with mdev UUIDs.

type HostDev

type HostDev struct {
	XMLName xml.Name `xml:"hostdev"`
	Mode    string   `xml:"mode,attr"`
	Type    string   `xml:"type,attr"`
	Managed string   `xml:"managed,attr"`
	Model   string   `xml:"model,attr"`
	Display string   `xml:"display,attr"`
	Source  Source   `xml:"source"`
	Address Address  `xml:"address"`
}

type Ksyms

type Ksyms struct {
	// contains filtered or unexported fields
}

Ksyms is a structure for kernel symbols.

func NewKsyms

func NewKsyms() (*Ksyms, error)

NewKsyms creates a new Ksyms structure (by reading procfs/kallsyms).

func (*Ksyms) GetArchSpecificName

func (k *Ksyms) GetArchSpecificName(name string) (string, error)

GetArchSpecificName returns architecture specific symbol (if exists) of a given kernel symbol.

func (*Ksyms) IsAvailable

func (k *Ksyms) IsAvailable(name string) bool

IsAvailable returns true if the given name is available on current kernel.

type MIGDevice

type MIGDevice struct {
	XMLName       xml.Name    `xml:"mig_device"`
	Index         uint64      `xml:"index"`
	GPUInstID     uint64      `xml:"gpu_instance_id"`
	ComputeInstID uint64      `xml:"compute_instance_id"`
	DeviceAttrs   DeviceAttrs `xml:"device_attributes"`
	FBMemory      Memory      `xml:"fb_memory_usage"`
	Bar1Memory    Memory      `xml:"bar1_memory_usage"`
	UUID          string
}

func (*MIGDevice) UnmarshalXML added in v0.10.1

func (p *MIGDevice) UnmarshalXML(d *xml.Decoder, start xml.StartElement) error

UnmarshalXML implements the xml.Unmarshaler interface.

type MIGDevices

type MIGDevices struct {
	XMLName xml.Name    `xml:"mig_devices"`
	Devices []MIGDevice `xml:"mig_device"`
}

type MIGMode

type MIGMode struct {
	XMLName    xml.Name `xml:"mig_mode"`
	CurrentMIG string   `xml:"current_mig"`
}

type Memory

type Memory struct {
	Total string `xml:"total"`
}

type NVIDIASMILog

type NVIDIASMILog struct {
	XMLName xml.Name    `xml:"nvidia_smi_log"`
	GPUs    []NvidiaGPU `xml:"gpu"`
}

type NvidiaGPU

type NvidiaGPU struct {
	XMLName      xml.Name   `xml:"gpu"`
	ID           string     `xml:"id,attr"`
	ProductName  string     `xml:"product_name"`
	ProductBrand string     `xml:"product_brand"`
	ProductArch  string     `xml:"product_architecture"`
	MIGMode      MIGMode    `xml:"mig_mode"`
	VirtMode     VirtMode   `xml:"gpu_virtualization_mode"`
	MIGDevices   MIGDevices `xml:"mig_devices"`
	UUID         string     `xml:"uuid"`
	MinorNumber  string     `xml:"minor_number"`
}

type PMCDomain

type PMCDomain struct {
	Name string // name of PM counter domain zone from filename
	Path string // filesystem path of PM counters
}

PMCDomain stores the information for one Cray's domain PM counter.

func GetCrayPMCDomains

func GetCrayPMCDomains(sysFSPath string) ([]PMCDomain, error)

GetCrayPMCDomains returns a slice of Cray's `pm_counters` domains. - https://cray-hpe.github.io/docs-csm/en-10/operations/power_management/user_access_to_compute_node_power_data/

func (PMCDomain) GetEnergyJoules

func (pd PMCDomain) GetEnergyJoules() (uint64, error)

GetEnergyJoules returns the current joule value from the domain counter.

func (PMCDomain) GetPowerLimitWatts

func (pd PMCDomain) GetPowerLimitWatts() (uint64, error)

GetPowerLimitWatts returns the current power limit watt value from the domain counter.

func (PMCDomain) GetPowerWatts

func (pd PMCDomain) GetPowerWatts() (uint64, error)

GetPowerWatts returns the current watt value from the domain counter.

func (PMCDomain) GetTempCelsius

func (pd PMCDomain) GetTempCelsius() (uint64, error)

GetTempCelsius returns the current node temperature in C value from the domain counter.

type Profiler

type Profiler interface {
	Start(ctx context.Context) error
	Stop()
	Enabled() bool
}

Profiler is the interface different profilers must implement.

func NewProfiler

func NewProfiler(c *profilerConfig) (Profiler, error)

NewProfiler returns a new instance of continuous profiler based on eBPF.

type ProfilerConfig

type ProfilerConfig struct {
	Profiler CEEMSProfilerConfig `yaml:"ceems_profiler"`
}

type PyroscopeConfig

type PyroscopeConfig struct {
	URL              string                  `yaml:"url"`
	ExternalLabels   map[string]string       `yaml:"external_labels"`
	HTTPClientConfig config.HTTPClientConfig `yaml:",inline"`
}

func (*PyroscopeConfig) UnmarshalYAML

func (c *PyroscopeConfig) UnmarshalYAML(unmarshal func(any) error) error

UnmarshalYAML implements the yaml.Unmarshaler interface.

func (*PyroscopeConfig) Validate

func (c *PyroscopeConfig) Validate() error

Validate validates the config.

type ROCMSMI

type ROCMSMI struct {
	Bus              string `json:"PCI Bus"`
	Serial           string `json:"Serial Number"`
	Name             string `json:"Card Vendor"`
	Node             string `json:"Node ID"`
	ComputePartition string `json:"Compute Partition"`
	MemoryPartition  string `json:"Memory Partition"`
}

type SessionConfig

type SessionConfig struct {
	CollectInterval   model.Duration `yaml:"collect_interval"`
	DiscoverInterval  model.Duration `yaml:"discover_interval"`
	CollectUser       bool           `yaml:"collect_user_profile"`
	CollectKernel     bool           `yaml:"collect_kernel_profile"`
	PythonEnabled     bool           `yaml:"python_enabled"`
	SampleRate        int            `yaml:"sample_rate"`
	Demangle          string         `yaml:"demangle"`
	BuildIDCacheSize  int            `yaml:"build_id_cache_size"`
	PIDCacheSize      int            `yaml:"pid_cache_size"`
	PIDMapSize        uint32         `yaml:"pid_map_size"`
	SameFileCacheSize int            `yaml:"same_file_cache_size"`
	SymbolsMapSize    uint32         `yaml:"symbols_map_size"`
	CacheRounds       int            `yaml:"cache_rounds"`
}

func (*SessionConfig) UnmarshalYAML

func (c *SessionConfig) UnmarshalYAML(unmarshal func(any) error) error

UnmarshalYAML implements the yaml.Unmarshaler interface.

func (*SessionConfig) Validate

func (c *SessionConfig) Validate() error

Validate validates the config.

type Source

type Source struct {
	XMLName xml.Name `xml:"source"`
	Address Address  `xml:"address"`
}

type Target

type Target struct {
	Targets []string           `json:"targets"`
	Labels  sd.DiscoveryTarget `json:"labels"`
}

type VirtMode

type VirtMode struct {
	XMLName  xml.Name `xml:"gpu_virtualization_mode"`
	Mode     string   `xml:"virtualization_mode"`
	HostMode string   `xml:"host_vgpu_mode"`
}

type WebConfig

type WebConfig struct {
	Addresses              []string
	WebSystemdSocket       bool
	WebConfigFile          string
	MetricsPath            string
	TargetsPath            string
	MaxRequests            int
	IncludeExporterMetrics bool
	EnableDebugServer      bool
	LandingConfig          *web.LandingConfig
}

WebConfig makes HTTP web config from CLI args.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL