autoscaler

package
v0.0.0-...-ea02baa Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 5, 2018 License: Apache-2.0 Imports: 9 Imported by: 0

README

Autoscaling

Knative Serving Revisions are automatically scaled up and down according incoming traffic.

Definitions

  • Knative Serving Revision -- a custom resource which is a running snapshot of the user's code (in a Container) and configuration.
  • Knative Serving Route -- a custom resource which exposes Revisions to clients via an Istio ingress rule.
  • Kubernetes Deployment -- a k8s resource which manages the lifecycle of individual Pods running Containers. One of these is running user code in each Revision.
  • Knative Serving Autoscaler -- another k8s Deployment (one per Revision) running a single Pod which watches request load on the Pods running user code. It increases and decreases the size of the Deployment running the user code in order to compensate for higher or lower traffic load.
  • Knative Serving Activator -- a k8s Deployment running a single, multi-tenant Pod (one per Cluster for all Revisions) which catches requests for Revisions with no Pods. It brings up Pods running user code (via the Revision controller) and forwards caught requests.
  • Concurrency -- the number of requests currently being served at a given moment. More QPS or higher latency means more concurrent requests.

Behavior

Revisions have three autoscaling states which are:

  1. Active when they are actively serving requests,
  2. Reserve when they are scaled down to 0 Pods but is still in service, and
  3. Retired when they will no longer receive traffic.

When a Revision is actively serving requests it will increase and decrease the number of Pods to maintain the desired average concurrent requests per Pod. When requests are no longer being served, the Revision will be put in a Reserve state. When the first request arrives, the Revision is put in an Active state, and the request is queued until it becomes ready.

In the Active state, each Revision has a Deployment which maintains the desired number of Pods. It also has an Autoscaler which watches traffic metrics and adjusts the Deployment's desired number of pods up and down. Each Pod reports its number of concurrent requests each second to the Autoscaler (how many clients are connected at that moment).

In the Reserve state, the Revision has no scheduled Pods and consumes no CPU. The Istio route rule for the Revision points to the single multi-tenant Activator which will catch traffic for all Reserve Revisions. When the Activator catches a request for a Reserve Revision, it will flip the Revision to an Active state and then forward requests to the Revision when it ready.

In the Retired state, the Revision has provisioned resources. No requests will be served for the Revision.

Context

   +---------------------+
   | ROUTE               |
   |                     |
   |   +-------------+   |
   |   | Istio Route |---------------+
   |   +-------------+   |           |
   |         |           |           |
   +---------|-----------+           |
             |                       |
             |                       |
             | inactive              | active
             |  route                | route
             |                       |
             |                       |
             |                +------|---------------------------------+
             V         watch  |      V                                 |
       +-----------+   first  |   +- ----+  create   +------------+    |
       | Activator |------------->| Pods |<----------| Deployment |    |
       +-----------+          |   +------+           +------------+    |
             |                |       |                     ^          |
             |   activate     |       |                     | resize   |
             +--------------->|       |                     |          |
                              |       |    metrics    +------------+   |
                              |       +-------------->| Autoscaler |   |
                              |                       +------------+   |
                              | REVISION                               |
                              +----------------------------------------+
                              

Design Goals

  1. Make if fast. Revisions should be able to scale from 0 to 1000 concurrent requests in 30 seconds or less.
  2. Make it light. Wherever possible the system should be able to figure out the right thing to do without the user's intervention or configuration.
  3. Make everything better. Creating custom components is a short-term strategy to get something working now. The long-term strategy is to make the underlying components better so that custom code can be replaced with configuration. E.g. Autoscaler should be replaced with the K8s Horizontal Pod Autoscaler and Custom Metrics.
Slow Brain / Fast Brain

The Knative Serving Autoscaler is split into two parts:

  1. Fast Brain that maintains the desired level of concurrent requests per Pod (satisfying Design Goal #1), and the
  2. Slow Brain that comes up with the desired level based on CPU, memory and latency statistics (satisfying Design Goal #2).

Fast Brain Implementation

This is subject to change as the Knative Serving implementation changes.

Code
Autoscaler

There is a proxy in the Knative Serving Pods (queue-proxy) which is responsible for enforcing request queue parameters (single or multi threaded), and reporting concurrent client metrics to the Autoscaler. If we can get rid of this and just use Envoy, that would be great (see Design Goal #3). The Knative Serving controller injects the identity of the Revision into the queue proxy environment variables. When the queue proxy wakes up, it will find the Autoscaler for the Revision and establish a websocket connection. Every 1 second, the queue proxy pushes a gob serialized struct with the observed number of concurrent requests at that moment.

The Autoscaler is also given the identity of the Revision through environment variables. When it wakes up, it starts a websocket-enabled http server. Queue proxies start sending their metrics to the Autoscaler and it maintains a 60-second sliding window of data points. The Autoscaler has two modes of operation, Panic Mode and Stable Mode.

Stable Mode

In Stable Mode the Autoscaler adjusts the size of the Deployment to achieve the desired average concurrency per Pod (currently hardcoded, later provided by the Slow Brain). It calculates the observed concurrency per pod by averaging all data points over the 60 second window. When it adjusts the size of the Deployment it bases the desired Pod count on the number of observed Pods in the metrics stream, not the number of Pods in the Deployment spec. This is important to keep the Autoscaler from running away (there is delay between when the Pod count is increased and when new Pods come online to serve requests and provide a metrics stream).

Panic Mode

The Autoscaler evaluates its metrics every 2 seconds. In addition to the 60-second window, it also keeps a 6-second window (the panic window). If the 6-second average concurrency reaches 2 times the desired average, then the Autoscaler transitions into Panic Mode. In Panic Mode the Autoscaler bases all its decisions on the 6-second window, which makes it much more responsive to sudden increases in traffic. Every 2 seconds it adjusts the size of the Deployment to achieve the stable, desired average (or a maximum of 10 times the current observed Pod count, whichever is smaller). To prevent rapid fluctuations in the Pod count, the Autoscaler will only increase Deployment size during Panic Mode, never decrease. 60 seconds after the last Panic Mode increase to the Deployment size, the Autoscaler transistions back to Stable Mode and begins evaluating the 60-second windows again.

Deactivation

When the Autoscaler has observed an average concurrency per pod of 0.0 for some time (#305), it will transistion the Revision into the Reserve state. This causes the Deployment and the Autoscaler to be turned down (or scaled to 0) and routes all traffic for the Revision to the Activator.

Activator

The Activator is a single multi-tenant component that catches traffic for all Reserve Revisions. It is responsible for activating the Revisions and then proxying the caught requests to the appropriate Pods. It woud be preferable to have a hook in Istio to do this so we can get rid of the Activator (see Design Goal #3). When the Activator gets a request for a Reserve Revision, it calls the Knative Serving control plane to transistion the Revision to an Active state. It will take a few seconds for all the resources to be provisioned, so more requests might arrive at the Activator in the meantime. The Activator establishes a watch for Pods belonging to the target Revision. Once the first Pod comes up, all enqueued requests are proxied to that Pod. Concurrently, the Knative Serving control plane will update the Istio route rules to take the Activator back out of the serving path.

Slow Brain Implementation

Currently the Slow Brain is not implemented and the desired concurrency level is hardcoded at 1.0 (code).

Documentation

Overview

Package autoscaler calculates the number of pods necessary for the desired level of concurrency per pod (stableConcurrencyPerPod). It operates in two modes, stable mode and panic mode.

Stable mode calculates the average concurrency observed over the last 60 seconds and adjusts the observed pod count to achieve the target value. Current observed pod count is the number of unique pod names which show up in the last 60 seconds.

Panic mode calculates the average concurrency observed over the last 6 seconds and adjusts the observed pod count to achieve the stable target value. Panic mode is engaged when the observed 6 second average concurrency reaches 2x the target stable concurrency. Panic mode will last at least 60 seconds--longer if the 2x threshold is repeatedly breached. During panic mode the number of pods is never decreased in order to prevent flapping.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Autoscaler

type Autoscaler struct {
	Config
	// contains filtered or unexported fields
}

Autoscaler stores current state of an instance of an autoscaler

func NewAutoscaler

func NewAutoscaler(config Config, reporter StatsReporter) *Autoscaler

NewAutoscaler creates a new instance of autoscaler

func (*Autoscaler) Record

func (a *Autoscaler) Record(ctx context.Context, stat Stat)

Record a data point. No safe for concurrent access or concurrent access with Scale.

func (*Autoscaler) Scale

func (a *Autoscaler) Scale(ctx context.Context, now time.Time) (int32, bool)

Scale calculates the desired scale based on current statistics given the current time. Not safe for concurrent access or concurrent access with Record.

type Config

type Config struct {
	TargetConcurrency    *k8sflag.Float64Flag
	MaxScaleUpRate       *k8sflag.Float64Flag
	StableWindow         *k8sflag.DurationFlag
	PanicWindow          *k8sflag.DurationFlag
	ScaleToZeroThreshold *k8sflag.DurationFlag
}

Config defines the tunable autoscaler parameters

type Measurement

type Measurement int

Measurement represents the type of the autoscaler metric to be reported

const (
	// DesiredPodCountM is used for the pod count that autoscaler wants
	DesiredPodCountM Measurement = 0
	// RequestedPodCountM is used for the requested pod count from kubernetes
	RequestedPodCountM Measurement = 1
	// ActualPodCountM is used for the actual number of pods we have
	ActualPodCountM Measurement = 2
	// PanicM is used as a flag to indicate if autoscaler is in panic mode or not
	PanicM Measurement = 3
)

type Reporter

type Reporter struct {
	// contains filtered or unexported fields
}

Reporter holds cached metric objects to report autoscaler metrics

func NewStatsReporter

func NewStatsReporter(podNamespace string, config string, revision string) (*Reporter, error)

NewStatsReporter creates a reporter that collects and reports autoscaler metrics

func (*Reporter) Report

func (r *Reporter) Report(m Measurement, v int64) error

Report captures value v for measurement m

type Stat

type Stat struct {
	// The time the data point was collected on the pod.
	Time *time.Time

	// The unique identity of this pod.  Used to count how many pods
	// are contributing to the metrics.
	PodName string

	// Average number of requests currently being handled by this pod.
	AverageConcurrentRequests float64

	// Number of requests received since last Stat (approximately QPS).
	RequestCount int32
}

Stat defines a single measurement at a point in time

type StatsReporter

type StatsReporter interface {
	Report(m Measurement, v int64) error
}

StatsReporter defines the interface for sending autoscaler metrics

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL