Concepts for single-cluster deployments

Understand single-cluster deployment with synchronous replication.

This topic describes a single-cluster setup and the behavior to expect. It outlines the requirements of the high availability architecture and describes the benefits and tradeoffs.

When to use this setup

Use this setup to provide Keycloak deployments that are deployed to a setup with transparent networking.

To provide a more concrete example, the following chapter assumes a deployment contained within a single Kubernetes cluster. The same concepts could be applied to a set of virtual or physical machines and a manual or scripted deployment.

Single or multiple availability-zones

The behaviour and high-availability guarantees of the Keycloak deployment are ultimately determined by the configuration of the Kubernetes cluster. Typically, Kubernetes clusters are deployed on a single availability-zone, however in order to increase fault-tolerance, it is possible to deploy the cluster across multiple availability-zones.

The Keycloak Operator defines the following topology spread constraints by default to prefer that Keycloak pods are deployed on distinct nodes and distinct availability-zones when possible:

      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: "topology.kubernetes.io/zone"
          whenUnsatisfiable: "ScheduleAnyway"
          labelSelector:
            matchLabels:
              app: "keycloak"
              app.kubernetes.io/managed-by: "keycloak-operator"
              app.kubernetes.io/instance: "keycloak"
              app.kubernetes.io/component: "server"
        - maxSkew: 1
          topologyKey: "kubernetes.io/hostname"
          whenUnsatisfiable: "ScheduleAnyway"
          labelSelector:
            matchLabels:
              app: "keycloak"
              app.kubernetes.io/managed-by: "keycloak-operator"
              app.kubernetes.io/instance: "keycloak"
              app.kubernetes.io/component: "server"

In order to ensure high-availability with multiple availability-zones, it is crucial that the Database is also able to withstand zone failures as Keycloak depends on the underlying database to remain available.

Failures which this setup can survive

Deploying Keycloak on a single or across multiple availability-zones changes the high-availability characteristics significantly, therefore we consider these architectures independently.

Single Zone

Failure Recovery RPO1 RTO2

Keycloak Pod

Multiple Keycloak Pods run in a cluster. If one instance fails some incoming requests might receive an error message or are delayed for some seconds.

No data loss

Less than 30 seconds

Kubernetes Node

Multiple Keycloak Pods run in a cluster. If the host node dies, then all pods on that node will fail and some incoming requests might receive an error message or are delayed for some seconds.

No data loss

Less than 30 seconds

Keycloak Clustering Connectivity

If the connectivity between Kubernetes nodes is lost, data cannot be sent between Keycloak pods hosted on those nodes. Incoming requests might receive an error message or be delayed for some seconds. The Keycloak will eventually remove the unreachable pods from its local view and will stop sending data to them.

No data loss

Seconds to minutes

Table footnotes:

1 Recovery point objective, assuming all parts of the setup were healthy at the time this occurred.
2 Recovery time objective.

Multiple Zones

Failure Recovery RPO1 RTO2

Database node3

If the writer instance fails, the database can promote a reader instance in the same or other zone to be the new writer.

No data loss

Seconds to minutes (depending on the database)

Keycloak pod

Multiple Keycloak instances run in a cluster. If one instance fails some incoming requests might receive an error message or are delayed for some seconds.

No data loss

Less than 30 seconds

Kubernetes Node

Multiple Keycloak pods run in a cluster. If the host node dies, then all pods on that node will fail and some incoming requests might receive an error message or are delayed for some seconds.

No data loss

Less than 30 seconds

Availability zone failure

If an availability-zone fails, all Keycloak pods hosted in that zone will also fail. Deploying at least the same number of Keycloak replicas as availability-zones should ensure that no data is lost and minimal downtime occurs as there will be other pods available to service requests.

No data loss

Seconds

Connectivity database

If the connectivity between availability-zones is lost, the synchronous replication will fail. Some requests might receive an error message or be delayed for a few seconds. Manual operations might be necessary depending on the database.

No data loss3

Seconds to minutes (depending on the database)

Keycloak Clustering Connectivity

If the connectivity between Kubernetes nodes is lost, data cannot be sent between Keycloak pods hosted on those nodes. Incoming requests might receive an error message or be delayed for some seconds. The Keycloak will eventually remove the unreachable pods from its local view and will stop sending data to them.

No data loss

Seconds to minutes

Table footnotes:

1 Recovery point objective, assuming all parts of the setup were healthy at the time this occurred.
2 Recovery time objective.
3 Assumes that the database is also replicated across multiple availability-zones

Known limitations

  1. Downtime during rollouts of Keycloak upgrades

    This can be overcome for patch releases by enabling Checking if rolling updates are possible.

  2. Multiple node failures can result in a loss of entries from the authenticationSessions, loginFailures and actionTokens caches if the number of node failures is greater than or equal to the cache’s configured num_owners, which by default is 2.

  3. Deployments using the default topologySpreadConstraints with whenUnsatisfiable: ScheduleAnyway, may experience data-loss on node/availability-zone failure if multiple pods are scheduled on the failed node/zone.

    Users can mitigate against this scenario by defining topologySpreadConstraints with whenUnsatisfiable: DoNotSchedule, to ensure that pods are always evenly scheduled across zones and nodes. However, this can result in some Keycloak instances not being deployed if the constraints cannot be satisfied.

    As Infinispan is unaware of the network topology when distributing cache entries, it is still possible for data-loss to occur on node/availability-zone failure if all num_owner copies of cached data are stored in the failed node/zone. You can restrict the total number of Keycloak instances to the number of nodes or availability-zones available by defining a requiredDuringSchedulingIgnoredDuringExecution for nodes and zones. However, this comes at the expense of scalability as the number of Keycloak instances that can be provisioned will be restricted to the number of nodes/availability-zones in your Kubernetes cluster.

    See the Operator Advanced configuration details of how to configure custom anti-affinity topologySpreadConstraints policies.

  4. The Operator does not configure the site’s name (see Configuring distributed caches) in the Pods as its value is not available via the Downward API. The machine name option is configured using the spec.nodeName from the node where the Pod is scheduled.

Next steps

Continue reading in the Building blocks single-cluster deployments guide to find blueprints for the different building blocks.

On this page