gang

package
v0.0.0-...-d09a64d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 12, 2024 License: Apache-2.0 Imports: 26 Imported by: 0

README

Gang

This is Gang plugin, aiming to achieve an efficient scheduling for groups of your Pods.

Gang is still in the early stage, subject to get a breaking change for some reason.

Motivation

In PFN, we need to schedule a group of Pods at the same time for machine learning.

In in-tree scheduler plugins, there's no plugin that can ensure that all Pods are scheduled at the same time. In kubernetes-sigs/scheduler-plugins, they have coscheduling plugin, which is very similar idea to this gang plugin.

However, they've not improved much in recent years while there's many parts to be improved/fixed.

Therefore, we founded gang plugin here:

  • Simple configuration: It doesn't require any CRD, simply configured by Pod annotations.
  • Enhanced requeueing: The gang plugin tracks why each Pod in a gang is rejected and requeue Pods only when all Pods in gang are ready. It contributes to the efficiency of scheduling and preventing many Pods from reserving resources in vain.

Usage

Gang is made of quite simple configuration; all the configuration is done within Pod's anntoations and doesn't require any CRD installed.

metadata:
  generateName: gangpod-
  namespace: default
  annotations:
    "gang-scheduling.preferred.jp/gang-name": "awesome_pfn"
    "gang-scheduling.preferred.jp/gang-size": "2"
    "gang-scheduling.preferred.jp/gang-schedule-timeout-seconds": "100"
spec:
  containers:
    - image: registry.k8s.io/pause:3.9
      name: pause
      ports:
        - containerPort: 80
      resources:
        limits:
          cpu: 100m
          memory: 500Mi
        requests:
          cpu: 100m
          memory: 500Mi
  • gang-scheduling.preferred.jp/gang-name(gangAnnotationPrefix+-name): The name of gang which should be unique within the namespace.
  • gang-scheduling.preferred.jp/gang-size(gangAnnotationPrefix+-size): The number of Pods belongs to this gang, that is, this Pod won't be scheduled successfully until this number of Pods is created and all can be scheduled to some Nodes.
  • gang-scheduling.preferred.jp/gang-schedule-timeout-seconds(gangAnnotationPrefix+-schedule-timeout-seconds): This is the timeout configuration. e.g., let's say one Pod is ready to get scheduled and waiting at WaitOnPermit, while another Pod cannot be scheduled at the moment, the waiting Pod is rejected once this timeout pass. Meaning, all the Pods in gang is put back to the queue and have to go through the scheduling cycle again.

KubeSchedulerConfiguration

The cluster admin can configure gang plugin like following:

kind: KubeSchedulerConfiguration
apiVersion: kubescheduler.config.k8s.io/v1
profiles:
  - schedulerName: awesome-pfn-scheduler
    plugins:
      multiPoint:
        enabled:
          - name: gang
    pluginConfig:
    - name: gang
      args:
        # GangAnnotationPrefix is the prefix of all gang annotations.
        # This configuration is required; if not set, the plugin will return an error during its initialization.
        gangAnnotationPrefix: "gang-scheduling.preferred.jp/gang"
        # SchedulerName is the name of the scheduler.
        # This field is optional; if not set, the default scheduler name will be used.
        schedulerName: "awesome-pfn-scheduler"
        # GangScheduleTimeoutSecondsLimit is the maximum timeout in seconds for gang scheduling.
        # If the timeout configured in the pod annotation exceeds this limit, the timeout will be set to this limit.
        # This field is optional; if not set, 100 will be used as a default value.
        gangScheduleTimeoutSecondsLimit: 100
        # GangScheduleDefaultTimeoutSeconds is the default timeout in seconds,
        # which will be used if the timeout is not set in the pod annotation.
        # This field is optional; if not set, 30 will be used as a default value.
        gangScheduleDefaultTimeoutSeconds: 30
        # GangScheduleTimeoutJitterSeconds is the jitter in seconds for timeout.
        # This field is optional; if not set, 30 will be used as a default value.
        gangScheduleTimeoutJitterSeconds: 30

Documentation

Index

Constants

View Source
const (
	// PluginName is the name of the plugin.
	PluginName = names.Gang

	// GangScheduleTimeoutSecondsDefault is the default value for gang schedule timeout
	GangScheduleTimeoutSecondsDefault = 30

	// GangScheduleTimeoutSecondsLimitDefault is the default value of the upper limit value
	// for gang schedule timeout which user can define.
	// If user defines the value over this limit, scheduler apply the upper limit implicitly
	// This value will be used unless Plugin.GangScheduleTimeoutSecondsLimitDefault is set.
	GangScheduleTimeoutSecondsLimitDefault = 100

	// GangScheduleTimeoutJitterSecondsDefault is the default value of maximum jitter of gang
	// schedule timeout.
	// Timeout seconds of a gang scheduling will be calculated as:
	//   user-specified timeout + random value sampled from [0, maximum jitter)
	GangScheduleTimeoutJitterSecondsDefault = GangScheduleTimeoutSecondsDefault

	// StateKeyGangFirstPod is a key of CycleState.
	// Gang PreFilter plugin writes a value with this key if a given pod is the first of a new gang.
	// UniqueZone PreFilter plugin later reads CycleState with this key.
	StateKeyGangFirstPod = "GangFirstPod"
)

Variables

This section is empty.

Functions

func GangNameAnnotationKey

func GangNameAnnotationKey(prefix string) string

GangNameAnnotationKey is the annotation key to define gangs. Scheduler recognizes the pod belongs to gang "__gang_name__" in your namespace.

func GangScheduleTimeoutSecondsAnnotationKey

func GangScheduleTimeoutSecondsAnnotationKey(prefix string) string

GangScheduleTimeoutSecondsAnnotationKey is the annotation key to define schedule timeout of the gang. If all the pods in the gang are not scheduled in this time period, scheduler mark all the pods in the gang as 'unschedulable' and try to schedule another gang.

func GangSizeAnnotationKey

func GangSizeAnnotationKey(prefix string) string

GangSizeAnnotationKey is the annotation key to define size of the gang. Scheduler waits until the number pods which belongs to the gang specified are created.

func GangSizeOf

func GangSizeOf(pod *v1.Pod, annotationPrefix string) int

TODO: Return (int, error) or (int, bool) to handle error

func IsGang

func IsGang(pod *v1.Pod, gangAnnotationPrefix string) bool

func NewPlugin

func NewPlugin(configuration runtime.Object, fwkHandle framework.Handle) (framework.Plugin, error)

Types

type Gang

type Gang interface {
	fmt.Stringer

	// NameAndSpec returns the GangNameAndSpec of this Gang.
	NameAndSpec() *GangNameAndSpec

	// AddOrUpdate adds a given Pod to this Gang, or updates a Pod in this Gang.
	AddOrUpdate(*corev1.Pod)
	// Delete deletes a given Pod from this Gang. If the Pod is not in this Gang, Delete is a no-op.
	Delete(*corev1.Pod)

	// Pods returns all Pods in the gang.
	Pods() []*corev1.Pod
	// CountPod returns the number of Pods in this Gang.
	CountPod() int
	// CountPodIf returns the number of Pods that meets a given predicate in this Gang.
	CountPodIf(predicate func(*corev1.Pod) bool) int

	// SatisfiesInvariantForgScheduling checks that this Gang satisfy an invariant for gang
	// scheduling. If this Gang does not sastisfy the invariant, this plugin does not start
	// scheduling of this Gang and rejects it immediately.
	SatisfiesInvariantForScheduling(ScheduleTimeoutConfig) (bool, GangSchedulingEvent)

	// IterateOverPods iterates over the Pods in this Gang while applying a given function.
	// The function can mutate the Pods.
	IterateOverPods(func(pod *corev1.Pod))

	// IsAllNonCompleletedSpecIdenticalTo checks that the gang specs of all non-completed Pods in
	// this Gang are identical to a given gang spec.
	IsAllNonCompletedSpecIdenticalTo(GangSpec, ScheduleTimeoutConfig) bool

	// Mark this Gang as "being deleted now".
	SetDeleting(deleting bool)
	// Returns whether this Gang is now being deleted.
	IsDeleting() bool

	// EventMessage returns a message to be notified as a plugin response or Kubernetes event.
	EventMessage(event GangSchedulingEvent, pod *corev1.Pod) string
	EventMessageForPodFunc(event GangSchedulingEvent) func(*corev1.Pod) string

	GetPosition(podUID types.UID) PodPosition
	PutPosition(pod *corev1.Pod, position PodPosition)
	ReadyToGetSchedule() bool
	UnreadyToSchedulePodNames() []string
}

func NewGang

func NewGang(nameSpec GangNameAndSpec, gangAnnotationPrefix string) Gang

NewGang creates a new Gang. Gang interface methods are *not* thread-safe. But as far as accessed from Gangs, Gang methods are called sequentially because Gangs acquires a lock when accessing its gangs map.

type GangName

type GangName types.NamespacedName

func GangNameOf

func GangNameOf(pod *v1.Pod, annotationPrefix string) (GangName, bool)

func (GangName) String

func (gn GangName) String() string

type GangNameAndSpec

type GangNameAndSpec struct {
	Name GangName
	Spec GangSpec
}

func GangNameAndSpecOf

func GangNameAndSpecOf(pod *v1.Pod, config ScheduleTimeoutConfig, gangAnnotationPrefix string) (GangNameAndSpec, bool)

type GangSchedulingEvent

type GangSchedulingEvent string
const (
	// GangNotReady means the number of gang Pods isn't sufficient.
	GangNotReady GangSchedulingEvent = "GangNotReady"
	// GangNotReadyToSchedule means the number of gang Pods is sufficient, but not all Pods are ready to schedule,
	// which technically means some gang Pods are rejected by scheduler plugins other than gang in the scheduling cycle.
	GangNotReadyToSchedule   GangSchedulingEvent = "GangNotReadyToSchedule"
	GangSpecInvalid          GangSchedulingEvent = "GangSpecInvalid"
	GangWaitForTerminating   GangSchedulingEvent = "GangWaitForTerminating"
	GangFullyScheduled       GangSchedulingEvent = "GangFullyScheduled"
	GangWaitForReady         GangSchedulingEvent = "GangWaitForReady"
	GangSchedulingTimedOut   GangSchedulingEvent = "GangSchedulingTimedOut"
	GangReady                GangSchedulingEvent = "GangReady"
	GangOtherPodGetsRejected GangSchedulingEvent = "GangOtherPodGetsRejected"
	FillingRunningGang       GangSchedulingEvent = "FillingRunningGang"
	DeletedAsPartOfGang      GangSchedulingEvent = "DeletedAsPartOfGang"
)

type GangSpec

type GangSpec struct {
	Size                 int
	TimeoutBase          time.Duration
	TimeoutJitterSeconds int
}

type Gangs

type Gangs struct {
	// contains filtered or unexported fields
}

TODO: More distinguishable name

func NewGangs

func NewGangs(fwkHandle framework.Handle, client kubernetes.Interface, timeoutConfig ScheduleTimeoutConfig, gangAnnotationPrefix string) *Gangs

func (*Gangs) AddOrUpdate

func (gangs *Gangs) AddOrUpdate(pod *corev1.Pod, recorder events.EventRecorder)

func (*Gangs) Delete

func (gangs *Gangs) Delete(pod *corev1.Pod)

func (*Gangs) Permit

func (gangs *Gangs) Permit(state *framework.CycleState, pod *corev1.Pod) (retStatus *framework.Status, _ time.Duration)

func (*Gangs) PostFilter

func (gangs *Gangs) PostFilter(ctx context.Context, pod *corev1.Pod)

func (*Gangs) PreEnqueue

func (gangs *Gangs) PreEnqueue(pod *corev1.Pod) *framework.Status

PreEnqueue accept gang pods to be enqueued only when gang's position is PodPositionSchedulingCycle or PodPositionActiveQ. Thus, the possible scenarios the gang pods can pass here are: - All gang pods are rejected by this gang plugin, and PodsToActivate for them gets issued by the Permit. - Preemption happened for a gang Pod in the past scheduling cycle, and that Pod is moved to activeQ right after it moved to the unschedulable Pod pool.

func (*Gangs) PreFilter

func (gangs *Gangs) PreFilter(ctx context.Context, state *framework.CycleState, pod *corev1.Pod) (status *framework.Status)

func (*Gangs) String

func (gangs *Gangs) String() string

String implements fmt.Stringer. Returns a human-readable string representation of the gangs.

Does not acquire the lock.

func (*Gangs) Unreserve

func (gangs *Gangs) Unreserve(pod *corev1.Pod, recorder events.EventRecorder)

type Plugin

type Plugin struct {
	// contains filtered or unexported fields
}

func (*Plugin) EventsToRegister

func (p *Plugin) EventsToRegister() []framework.ClusterEventWithHint

func (*Plugin) Name

func (p *Plugin) Name() string

func (*Plugin) Permit

func (p *Plugin) Permit(
	ctx context.Context, state *framework.CycleState, pod *corev1.Pod, nodeName string,
) (*framework.Status, time.Duration)

func (*Plugin) PreEnqueue

func (p *Plugin) PreEnqueue(ctx context.Context, pod *corev1.Pod) *framework.Status

func (*Plugin) PreFilter

func (p *Plugin) PreFilter(
	ctx context.Context, state *framework.CycleState, pod *corev1.Pod,
) (*framework.PreFilterResult, *framework.Status)

func (*Plugin) PreFilterExtensions

func (p *Plugin) PreFilterExtensions() framework.PreFilterExtensions

func (*Plugin) Reserve

func (p *Plugin) Reserve(ctx context.Context, _ *framework.CycleState, pod *corev1.Pod, nodeName string) *framework.Status

func (*Plugin) Unreserve

func (p *Plugin) Unreserve(ctx context.Context, _ *framework.CycleState, pod *corev1.Pod, nodeName string)

Unreserve is called when a waiting gang pod is rejected due to time out, and this rejects all waiting pods in the gang

type PluginConfig

type PluginConfig struct {
	// GangAnnotationPrefix is the prefix of all gang annotations.
	// This configuration is required; if not set, the plugin will return an error during its initialization.
	GangAnnotationPrefix string `json:"gangAnnotationPrefix"`
	// SchedulerName is the name of the scheduler.
	// This field is optional; if not set, the default scheduler name will be used.
	SchedulerName string `json:"schedulerName,omitempty"`
	// GangScheduleTimeoutSecondsLimit is the maximum timeout in seconds for gang scheduling.
	// If the timeout configured in the pod annotation exceeds this limit, the timeout will be set to this limit.
	// This field is optional; if not set, 100 will be used as a default value.
	GangScheduleTimeoutSecondsLimit int `json:"gangScheduleTimeoutSecondLimit,omitempty"`
	// GangScheduleDefaultTimeoutSeconds is the default timeout in seconds,
	// which will be used if the timeout is not set in the pod annotation.
	// This field is optional; if not set, 30 will be used as a default value.
	GangScheduleDefaultTimeoutSeconds int `json:"gangScheduleDefaultTimeoutSecond,omitempty"`
	// GangScheduleTimeoutJitterSeconds is the jitter in seconds for timeout.
	// This field is optional; if not set, 30 will be used as a default value.
	GangScheduleTimeoutJitterSeconds int `json:"gangScheduleTimeoutJitterSecond,omitempty"`
}

func DecodePluginConfig

func DecodePluginConfig(configuration runtime.Object) (*PluginConfig, error)

func (*PluginConfig) TimeoutConfig

func (config *PluginConfig) TimeoutConfig() ScheduleTimeoutConfig

type PodPosition

type PodPosition int

PodPosition represents the place where a pod can be.

const (
	// PodPositionUnknown represents that we don't know where a Pod is in the scheduler.
	//
	// The gang PreEnqueue is responsible to register a position for a newly created Pod.
	// Until that, a newly created Pod's position will be Unknown.
	PodPositionUnknown PodPosition = iota
	// PodPositionUnschedulablePodPool represents that a Pod is or should be in the Unschedulable Pod Pool.
	//
	// The gang PostFilter is responsible to change a position for a rejected Pod to PodPositionUnschedulablePodPool.
	PodPositionUnschedulablePodPool
	// PodPositionReadyToSchedule represents that a Pod is in the Unschedulable Pod Pool, but ready to get schedule AND PodsToActivate isn't issued for it yet.
	// When all Pods in the gang get PodPositionReadyToSchedule, then PodsToActivate is expected to get issued soon from the gang Permit.
	//
	// There are multiple scenario that a Pod can get PodPositionReadyToSchedule,
	// the most popular one among them is PreEnqueue changing the given Pod's position to PodPositionReadyToSchedule.
	// The scheduler tries to move a Pod from unschedulable Pod Pool to activeQ when an event which may make a Pod schedulable happens.
	// So, we can regard a Pod which is coming on PreEnqueue as a ready-to-schedule Pod.
	PodPositionReadyToSchedule
	// PodPositionActiveQ represents that a Pod is in ActiveQ. Or in the Unschedulable Pod Pool but PodsToActivate has been issued for the Pod.
	//
	// The gang Permit is responsible to change a position to PodPositionActiveQ when it issues PodsToActivate for the Pod.
	// Those Pods will soon reach the gang PreEnqueue, and get accepted to enqueued to activeQ.
	PodPositionActiveQ
	// PodPositionSchedulingCycle represents that a Pod is under scheduling.
	//
	// The gang PreFilter is responsible to change a position to PodPositionSchedulingCycle when it accepts the Pod.
	PodPositionSchedulingCycle
	// PodPositionWaitingOnPermit represents that a Pod is waiting on permit.
	//
	// The gang Permit is responsible to change a position to PodPositionWaitingOnPermit when the Pod go through Permit plugin with Wait status.
	PodPositionWaitingOnPermit
)

func (PodPosition) String

func (p PodPosition) String() string

type ScheduleTimeoutConfig

type ScheduleTimeoutConfig struct {
	DefaultSeconds int
	LimitSeconds   int
	JitterSeconds  int
}

type SchedulingGang

type SchedulingGang interface {
	Gang

	PreFilter(pod *corev1.Pod, timeoutConfig ScheduleTimeoutConfig) *framework.Status
	Permit(state *framework.CycleState, pod *corev1.Pod, timeoutConfig ScheduleTimeoutConfig) (*framework.Status, time.Duration)

	// Refresh rejects all waiting Pods and marks this SchedulingGang as done if it no longer
	// satisfies the invariant for gang scheduling.
	Refresh(ScheduleTimeoutConfig)

	// Timeout rejects all waiting Pods and mark this gang as done.
	// This is called by Unreserve plugin on timeout for a waiting gang Pod.
	Timeout()

	IsDone() bool

	// NonSchedulingGang returns the underlying Gang.
	NonSchedulingGang() Gang

	RejectWaitingPods(completionStatus GangSchedulingEvent, msgF msgForPodFunc)
}

func NewSchedulingGang

func NewSchedulingGang(gang Gang, fwkHandle framework.Handle, timeout time.Duration, gangAnnotationPrefix string) SchedulingGang

NewSchedulingGang creates a new SchedulingGang. SchedulingGang methods are thread-safe.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL