Blog

Read about the newest updates in the community.

Metal3

By Eduardo Minguez

Originally posted at https://www.underkube.com/posts/2019-06-25-metal3/

In this blog post, I’m going to try to explain in my own words a high level overview of what Metal3 is, the motivation behind it and some concepts related to a ‘baremetal operator’.

Let’s have some definitions!

Custom Resource Definition

The k8s API provides out-of-the-box objects such as pods, services, etc. There are a few methods of extending the k8s API (such as API extensions) but since a few releases back, the k8s API can be extended easily with custom resources definitions (CRDs). Basically, this means you can virtually create any type of object definition in k8s (actually only users with cluster-admin capabilities) with a yaml such as:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  # name must match the spec fields below, and be in the form: <plural>.<group>
  name: crontabs.stable.example.com
spec:
  # group name to use for REST API: /apis/<group>/<version>
  group: stable.example.com
  # list of versions supported by this CustomResourceDefinition
  versions:
    - name: v1
      # Each version can be enabled/disabled by Served flag.
      served: true
      # One and only one version must be marked as the storage version.
      storage: true
  # either Namespaced or Cluster
  scope: Namespaced
  names:
    # plural name to be used in the URL: /apis/<group>/<version>/<plural>
    plural: crontabs
    # singular name to be used as an alias on the CLI and for display
    singular: crontab
    # kind is normally the CamelCased singular type. Your resource manifests use this.
    kind: CronTab
    # shortNames allow shorter string to match your resource on the CLI
    shortNames:
      - ct
  preserveUnknownFields: false
  validation:
    openAPIV3Schema:
      type: object
      properties:
        spec:
          type: object
          properties:
            cronSpec:
              type: string
            image:
              type: string
            replicas:
              type: integer

And after kubectl apply -f you can kubectl get crontabs.

There is a ton of information with regards to CRDs, like the k8s official documentation.

The CRD by himself is not useful per se as nobody will take care of it (that’s why I said definition). It requires a controller to watch for those new objects and react to different events affecting the object.

Controller

A controller is basically a loop that watches the current status of an object and if it is different from the desired status, it fixes it (reconciliation). This is why k8s is ‘declarative’, you specify the object desired status instead ‘how to do it’ (imperative).

Again, there are tons of documentation (and examples) around the controller pattern which is basically the k8s roots, so I’ll let your google-foo take care of it :)

Operator

An Operator (in k8s slang) is an application running in your k8s cluster that deploys, manages and maintains (so, operates) a k8s application.

This k8s application (the one that the operator manages), can be as simple as a ‘hello world’ application containerized and deployed in your k8s cluster or it can be a much more complex thing, such as a database cluster.

The ‘operator’ is like an ‘expert sysadmin’ containerized that takes care of your application.

Bear in mind that the ‘expert’ tag (meaning the automation behind the operator) depends on the operator implementation… so there can be basic operators that only deploy your application or complex operators that handle day 2 operations such as upgrades, failovers, backup/restore, etc.

See the CoreOS operator definition for more information.

Cloud Controller Manager

k8s code is smart enough to be able to leverage the underlying infrastructure where the cluster is running, such as being able of creating ‘LoadBalancer’ services, understanding the cluster topology based on the cloud provider AZs where the nodes are running (for scheduling reasons), etc.

This task of ‘talking to the cloud provider’ is performed by the Cloud Controller Manager (CCM) and for more information, you can take a look at the official k8s documentation with regards the architecture and the administration (also, if you are brave enough, you can create your own cloud controller manager )

Cluster API

The Cluster API implementation is a WIP ‘framework’ that allows a k8s cluster to manage itself, including the ability to create new clusters, add more nodes, etc. in a ‘k8s way’ (declarative, controllers, CRDs, etc.), so there are objects such as Cluster that can be expressed as k8s objects:

apiVersion: "cluster.k8s.io/v1alpha1"
kind: Cluster
metadata:
  name: mycluster
spec:
  clusterNetwork:
    services:
      cidrBlocks: ["10.96.0.0/12"]
    pods:
      cidrBlocks: ["192.168.0.0/16"]
    serviceDomain: "cluster.local"
  providerSpec: ...

but also:

There are some provider implementations in the wild such as AWS, Azure, GCP, OpenStack, vSphere, etc. ones and the Cluster API project is driven by the SIG Cluster Lifecycle.

Please review the official Cluster API repository for more information.

Actuator

The actuator is a Cluster API interface that reacts to changes to Machine objects reconciliating the Machine status.

The actuator code is tightly coupled with the provider (that’s why it is an interface) such as the AWS one.

MachineSet vs Machine

To simplify, let’s say that MachineSets are to Machines what ReplicaSets are to Pods. So you can scale the Machines in your cluster just by changing the number of replicas of a MachineSet.

Cluster API vs Cloud Providers

As we have seen, the Cluster API leverages the provider related to the k8s infrastructure itself (clusters and nodes) and the CCM and the cloud provider integration for k8s is to leverage the cloud provider to provide support infrastructure.

Let’s say Cluster API is for the k8s administrators and the CCM is for the k8s users :)

Machine API

The OpenShift 4 Machine API is a combination of some of the upstream Cluster API with custom OpenShift resources and it is designed to work in conjunction with the Cluster Version Operator.

OpenShift’s Machine API Operator

The machine-api-operator is an operator that manages the Machine API objects in an OpenShift 4 cluster.

The operator is capable of creating machines in AWS and libvirt (more providers coming soon) via the Machine Controller and it is included out of the box with OCP 4 (and can be deployed in a k8s vanilla as well)

Baremetal

A baremetal server (or bare-metal) is just a computer server.

The last year’s terms such as virtualization, containers, serverless, etc. have been popular but at the end of the day, all the code running on top of a SaaS, PaaS or IaaS is actually running in a real physical server stored in a datacenter wired to routers, switches and power. That server is a ‘baremetal’ server.

If you are used to cloud providers and instances, you probably don’t know the pains of baremetal management… including things such as connecting to the virtual console (usually it requires an old Java version) to debug issues, configuring pxe for provisioning baremetal hosts (or attach ISOs via the virtual console… or insert a CD/DVD physically into the CD carry if you are ‘lucky’ enough…), configuring VLANs for traffic isolation, etc.

That kind of operation is not ‘cloud’ ready and there are tools that provide baremetal management, such as maas or ironic.

Ironic

OpenStack bare metal provisioning (or ironic) is an open source project (or even better, a number of open source projects) to manage baremetal hosts. Ironic avoids the administrator dealing with pxe configuration, manual deployments, etc. and provides a defined API and a series of plugins to interact with different baremetal models and vendors.

Ironic is used in OpenStack to provide baremetal objects but there are some projects (such as bifrost) to use Ironic ‘standalone’ (so, no OpenStack required)

Metal3

Metal3 is a project aimed at providing a baremetal operator that implements the Cluster API framework required to be able to manage baremetal in a k8s way (easy peasy!). It uses ironic under the hood to avoid reinventing the wheel, but consider it as an implementation detail that may change.

The Metal3 baremetal operator watches for BareMetalHost (CRD) objects defined as:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: my-worker-0
spec:
  online: true
  bootMACAddress: 00:11:22:33:44:55
  bmc:
    address: ipmi://my-worker-0.ipmi.example.com
    credentialsName: my-worker-0-bmc-secret

There are a few more fields in the BareMetalHost object such as the image, hardware profile, etc.

The Metal3 project is actually divided into two different components:

baremetal-operator

The Metal3 baremetal-operator is the component that manages baremetal hosts. It exposes a new BareMetalHost custom resource in the k8s API that lets you manage hosts in a declarative way.

cluster-api-provider-baremetal

The Metal3 cluster-api-provider-baremetal includes the integration with the Cluster API project. This provider currently includes a Machine actuator that acts as a client of the BareMetalHost custom resources.

BareMetalHost vs Machine vs Node

  • BareMetalHost is a Metal3 object
  • Machine is a Cluster API object
  • Node is where the pods run :)

Those three concepts are linked in a 1:1:1 relationship meaning:

A BareMetalHost created with Metal3 maps to a Machine object and once the installation procedure finishes, a new kubernetes node will be added to the cluster.

$ kubectl get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
my-node-0.example.com                        Ready    master   25h   v1.14.0


$ kubectl get machines --all-namespaces
NAMESPACE               NAME                  INSTANCE   STATE   TYPE   REGION   ZONE   AGE
openshift-machine-api   my-node-0                                                   25h


$ kubectl get baremetalhosts --allnamespaces
NAMESPACE             NAME      STATUS PROVISIONING STATUS MACHINE BMC HARDWARE PROFILE ONLINE ERROR
openshift-machine-api my-node-0 OK     provisioned  my-node-0.example.com ipmi://1.2.3.4 unknown true

The 1:1 relationship for the BareMetalHost and the Machine is stored in the machineRef field in the BareMetalHost object:

$ kubectl  get baremetalhost/my-node-0 -n openshift-machine-api -o jsonpath='{.spec.machineRef}'


map[name:my-node-0 namespace:openshift-machine-api]

In a Machine annotation:

$ kubectl get machines my-node-0 -n openshift-machine-api -o jsonpath='{.metadata.annotations}'
map[metal3.io/BareMetalHost:openshift-machine-api/my-node-0]

The node reference is stored in the .status.nodeRef.name field in the Machine object:

$ kubectl get machine my-node-0 -o jsonpath='{.status.nodeRef.name}'


my-node-0.example.com

Recap

Being able to ‘just scale a node’ in k8s means a lot of underlying concepts and technologies involved behind the scenes :)