Nodes?— Clayton Coleman (@smarterclayton) May 5, 2021
Where we're going, we don't need.... nodes.
(what if kube was just the API that could talk to anything?)
prototype and call to ideate at https://t.co/XCv3iCd0rI
kcp is exploring re-use of the Kubernetes API at a higher level to orchestrate many different workloads and services across the hybrid cloud. It is presently a modified version of the Kubernetes API server running standalone with an embedded etcd, and a subset of types (including CustomResourceDefinitions). It doesn’t know Nodes Pods or Deployments by default.
You can communicate with kcp with kubectl and likely lots of other Kubernetes tooling as it is effectively just a Kubernetes API server.
An important construct in kcp is currently called a “logical cluster” (LC), a segment within the kcp etcd storage where CRDs can be registered independent of other LCs. Application developers or service providers could then use these LCs to define CRDs that represent their workloads, and then implement controllers reconciling against their types in kcp to distribute those workloads across actual Kubernetes clusters and reconcile status back. This process should be transparent to the user most of the time, similar to how you typically don’t need to think a lot about what node your pods are landing on with Kubernetes today. Standard Kubernetes RBAC is then a mechanism to control access to those workloads/services, and to integrate with other workloads/services.
Why I Care
I’ve led the OpenShift Hive engineering team since it’s inception 3 years ago. Hive is a CRD based Kubernetes operator for provisioning OpenShift clusters at scale. You define your desired OpenShift clusters, and our controllers make that happen. It’s also a backing component of OpenShift Dedicated and Red Hat OpenShift Service on AWS. This obviously requires far more scale and resiliency than one instance of an operator in one Kubernetes cluster can provide, and requires a number of Hive shards across various failure domains.
My team has often felt some discomfort with parts of what we ended up with, and have been discussing with Clayton and others in the OpenShift organization how we can improve some of the pain points. The crux of all of those concerns generally involves the fact that CRDs reside in the Kubernetes cluster’s etcd where your operator is running, both in terms of scale, and continuity.
When your data resides in the Kubernetes etcd server and you start exploring the limits you can scale to, the harsh reality is that when you find them, your Kube cluster is coming down. This is the same cluster that is responsible for keeping your application up and running, and debugging problems. Recovery at that point becomes non-trivial, you’re often then looking at scaling up memory, disk IOPS or CPU for that cluster to get things to come back online.
We invested quite heavily in finding those limits for Hive, and they actually began to surface a lot earlier than I expected. Some view this as a good thing as you don’t want massive failure domains, but it can be difficult as an engineer to get your head around working on something that falls over in O(100) or O(1000). You’re also at the mercy of other operators in that Kubernetes cluster. You may be able to identify your limits on a particular set of hardware, but you don’t necessarily know what else might be running in that cluster that might collectively push you over the top.
When your data resides in a Kubernetes cluster etcd, restoring from a disaster, or even moving data around to rebalance, becomes a real challenge. Naturally those clusters are HA and quite resilient in themselves, but if you do happen to lose one, restoring your data in another cluster is quite difficult. You cannot simply take an etcd snapshot and restore to another cluster as it contains substantial amounts of information about the cluster itself, it’s nodes, certificates, etc. This means you need to use some exiting backup solution or roll your own, and you need to be exceedingly careful about catching all the related resources, and how and what you store in your status subresources. (as that is typically lost on a restore)
Part of this exposes a problem with source of truth. In our case that line is somewhat fuzzy, the API service layer above all our shards in some ways is the source of truth, but there are important bits of data that live only in our shard etcds and their backups, and that service layer is not presently able to automatically move that data around as needed. This is an area where our architecture needs to improve.
Overall my team enjoys the type of applications we build. Declarative APIs and controllers are a pattern we enjoy and have confidence in, and working closely with Kubernetes internals has been great experience to gain. The question is how could we improve to move past the limitations of having this type of application and it’s data lumped into Kubernetes cluster etcd and tightly coupled to that cluster.
While all of this is very new to us and only a prototype, I want to understand how we could:
- leverage kcp as a source of truth
- leverage logical clusters to possibly model our failure domains
- write or reuse controllers to schedule the clusters we provision and manage onto actual Hive shards
- react to shard cluster failure and reschedule automatically
- react to shard cluster load and rebalance automatically
I’m hoping this will be the first in a series of posts exploring kcp. Up next will be some notes and thoughts on getting kcp up and running and doing something interesting.