gke-test-cluster-operator

command module
v0.0.0-...-d10fae5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 20, 2021 License: Apache-2.0 Imports: 15 Imported by: 0

README

Test Cluster Operator for GKE

This operator provides an API-driven cluster provisioning for integration and performance testing of software that integrates very deeply with Kubernetes APIs, cloud provider APIs and other elements of Kubernetes cluster infrastructure.

This project was developed at Isovalent for very specific needs and later open-sourced as-is. At present, a few code changes will be needed to remove assumption about how it's deployed, and deployment docs are also needed. If you wish to use this operator and contribute to the project, please join the #testing channel on Cilium & eBPF Slack.

Motivation

NB: Current implementation focuses on GKE (and GCP), most of the ideas described below would apply to any managed Kubernetes providers, and some are even more general.

It is relatively easy to test a workload on Kubernetes, whether it's just an application comprised of multiple components, or a basic operator that glues together a handful of Kubernetes APIs.

Cluster-sharing is one option, that is when application under test is deployed into one or more namespaces in a large shared cluster. Another option is to run setup small test clusters using something like kind or k3s/k3d within a CI environment.

If the application under test depends on non-namespaced resources, cluster-sharing is still possible with VirtualCluster. That way instances of the application under test can be isolated from one another, but only if Kubernetes API boundaries are fully respected. It implies that a large underlying cluster will still be used, but virtually divided into small "pretend-clusters". However, that will only work if the application doesn't make assumptions about cloud provider APIs and doesn't attempt non-trivial modes of access to the underlying host OS or the network infrastructure.

When the application under test interacts with multiple Kubernetes APIs, and presumes cluster-wide access, and even attempts to interact with the underlying host OS or the network infrastructure, any kind of cluster-sharing setup may prove challenging to use. It may also be deemed unrepresentative of the clusters that end-users run. Additionally, testing integrations with cloud provider APIs may have other implications. Applications that enable core functionality of Kubernetes cluster often fall into this category, e.g. CNI implementations, storage controllers, service meshes, ingress controller, etc. Cluster-sharing becomes just not viable for some of these use-cases. And something like kind or k3d is of very limited use.

All of the above applies to testing of applications by developers that are directly responsible for the application itself. End-users may need to test off-the-shelf application they are deploying also. Quite commonly in large organisation, an operations team will assemble a bundle of Kubernetes addons that defines a platform that their organisation relies on. The operations team may not be able to make direct changes to source code of some of the components in order to improved testability for cluster-sharing, or they just won't have the confidence in testing those components in a shared cluster. Even if one version was easily testable in a shared cluster, it may change in the future. While testing on kind or k3s remains an option, it may be undesirable due to cloud provider integration that needs to be tested also, and could be just unrepresentative of the deployment target. Therefore, the operations team may have strong preference to test in a cluster that is provisioned in exactly the same way as the deployment target and has mostly identical or comparable properties.

These are just some of the use-cases that illustrate a need for getting a dedicated cluster for running integrations or performance tests, one that matches deployment target as closely as possible.

What does it take to obtain a cluster in GKE? Technically, it's possible to simply write a script that calls gcloud commands, or relies on something like Terraform or use API client to provision a cluster. This approach inevitably adds a lot of complexity to the CI job by inheriting all the different failure modes there are to the provisioning and destruction processes, it needs to carry any additional infrastructure configuration (e.g. metric & log gathering), widens access scopes etc. Aside from all of the steps that take time and are hard to optimise, it is possible to have a pool of pre-built clusters, yet make the script even more complex. It is hard to maintain complex scripts of this kind long-term, as by nature scripts don't offers a clear contract (especially the shell scripts). The lack of contract makes it too easy for anyone to tweak a shell script for an ad-hoc use-case without adding any tests. Over time, script evolution is hard to unwind, especially in a context where many developers contribute to the project. In contrast, an API offers many advantages - it's a contract, and the implementation can be optimised more easily.

Architectural goals of this project
  • Test Cluster API
    • enables developer and CI jobs to request clusters for running tests in a consistent and well-defined manner
    • provider abstraction that will enable future optimisations, e.g. pooling of pre-built cluster
  • Asynchronous workflow
    • avoid heavy-lifting logic in CI jobs that doesn't directly relate to building binaries or executing tests
    • avoid polling for status
      • once cluster is ready, launch a test runner job inside the management cluster, report the results back to GitHub
  • Enable support multiple test cluster templates
    • do not assume there is only one type of test cluster configuration that's going to be used for all purposes
    • allow for pooling pre-built clusters base on commonly used templates
  • Include a common set of components in each test cluster
    • Prometheus
    • Log exporter for CI
You may ask...

How is this different from something like Cluster API?

The Test Cluster API is aimed to be much more high-level and shouldn't need to expose as many parameters as Cluster API does, in fact, it can be implemented on top of Cluster API. The initial implementation targets GKE, and relies on Config Connector, which is similar to Cluster API in spirit.

What about other providers?

This is something that authors of this project are planning on exploring, albeit it may not be done as part of the same project to begin with. One of the ideas is to create a generic provider based on either Terraform or Cluster API, possibly both.

How it works

There is a management cluster that runs on GKE, it has Config Connector, Cert Manager and Contour along with the GKE Test Cluster operator ("the operator" from here onwards).

User creates a CR similar to this:

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  name: testme-1
  namespace: test-clusters

spec:
  configTemplate: basic
  jobSpec:
    runner:
      image: cilium/cilium-test:8cfdbfe
      command:
      - /usr/local/bin/test-gke.sh
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  location: europe-west2-b
  region: europe-west2

The operator renders various objects for Config Connector and other APIs as defined in basic template, it substitutes the given parameters, i.e. machineType, nodes etc, and then it creates all of these objects and monitors the cluster until it's ready.

Once the test cluster is ready, it deploys the job using the given image and command, and ensures the job is authenticated to run against the test cluster. The job runs inside management cluster. The test cluster is deleted upon job completion.

The template is defined using CUE and can define any Kubernetes objects, such as Config Connector objects that define additional GCP resources or some other objects in the management cluster to support test execution. That being said, the implementation currently expects to find exactly one ContainerCluster as part of the template and it's not fully generalised.

As part of test cluster provisioning, Prometheus is deployed in the test cluster and metrics are federated to the Prometheus server in the management cluster, so all metrics from all test runs can be accessed centrally. In the future other components can be added as needed.

Example 2

Here is what a TestClusterGKE object may look like with additional fields and status.

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  name: test-c6v87
  namespace: test-clusters

spec:
  configTemplate: basic
  jobSpec:
    runner:
      command:
      - /usr/local/bin/run_in_test_cluster.sh
      - --prom-name=prom
      - --prom-ns=prom
      - --duration=30m
      configMap: test-c6v87-user
      image: cilium/hubble-perf-test:8cfdbfe
      initImage: quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f
  location: europe-west2-b
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  region: europe-west2

status:
  clusterName: test-c6v87-fn86p
  conditions:
  - lastTransitionTime: "2020-11-17T09:29:33Z"
    message: All 2 dependencies are ready
    reason: AllDependenciesReady
    status: "True"
    type: Ready
  dependencyConditions:
    ContainerCluster:test-clusters/test-c6v87-fn86p:
    - lastTransitionTime: "2020-11-17T09:29:22Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready
    ContainerNodePool:test-clusters/test-c6v87-fn86p:
    - lastTransitionTime: "2020-11-17T09:29:33Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready

Using Test Cluster Requester

There is a simple Go program that serves as a client to the GKE Test Cluster Operator.

It can be use by CI jobs as well as developers.

Developer Usage

To run this program outside CI, you must ensure that Google Cloud SDK Application credentials are setup correctly, to do so, run:

gcloud auth application-default login

Run:

go run ./requester --namespace=test-clusters-dev --description="<your name and purpose of this cluster>"
CI Usage

This program supports the traditional GOOGLE_APPLICATION_CREDENTIALS environment variable, but also for convenience it has GCP_SERVICE_ACCOUNT_KEY that is expected to contain a base64-encoded JSON service account key (i.e. no need to have the data written to a file).

For GitHub Actions, it's recommended to use the official image:

      - name: Request GKE test cluster
        uses: docker://quay.io/isovalent/gke-test-cluster-requester:ad06d7c2151d012901fc2ddc92406044f2ffba2d
        env:
          GCP_SERVICE_ACCOUNT_KEY: ${{ secrets.GCP_SERVICE_ACCOUNT_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          args: --namespace=... --image=...

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
api
cnrm
Package cnrm contains API Schema definitions for the CNRM API
Package cnrm contains API Schema definitions for the CNRM API
v1alpha1
Package v1alpha1 contains API Schema definitions for the clusters v1alpha1 API group +kubebuilder:object:generate=true +groupName=clusters.ci.cilium.io
Package v1alpha1 contains API Schema definitions for the clusters v1alpha1 API group +kubebuilder:object:generate=true +groupName=clusters.ci.cilium.io
v1alpha2
Package v1alpha2 contains API Schema definitions for the clusters v1alpha2 API group +kubebuilder:object:generate=true +groupName=clusters.ci.cilium.io
Package v1alpha2 contains API Schema definitions for the clusters v1alpha2 API group +kubebuilder:object:generate=true +groupName=clusters.ci.cilium.io
config
logview module
pkg
client
based on https://gist.github.com/ahmetb/548059cdbf12fb571e4e2f1e29c48997
based on https://gist.github.com/ahmetb/548059cdbf12fb571e4e2f1e29c48997
github
based on https://gist.github.com/ahmetb/548059cdbf12fb571e4e2f1e29c48997
based on https://gist.github.com/ahmetb/548059cdbf12fb571e4e2f1e29c48997
promview module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL