Concepts for single-cluster deployments

When to use this setup

Use this setup to deploy Keycloak to a setup with transparent networking.

To provide a more concrete example, the following chapter assumes a deployment contained within a single Kubernetes cluster. The same concepts could be applied to a set of virtual or physical machines and a manual or scripted deployment.

Single or multiple availability-zones

The behaviour and high-availability performance of the Keycloak deployment are ultimately determined by the configuration of the Kubernetes cluster. Typically, Kubernetes clusters are deployed on a single availability-zone, however in order to increase fault-tolerance, it is possible to deploy the cluster across multiple availability-zones.

The Keycloak Operator defines the following topology spread constraints by default to prefer that Keycloak pods are deployed on distinct nodes and distinct availability-zones when possible:

      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: "topology.kubernetes.io/zone"
          whenUnsatisfiable: "ScheduleAnyway"
          labelSelector:
            matchLabels:
              app: "keycloak"
              app.kubernetes.io/managed-by: "keycloak-operator"
              app.kubernetes.io/instance: "keycloak"
              app.kubernetes.io/component: "server"
        - maxSkew: 1
          topologyKey: "kubernetes.io/hostname"
          whenUnsatisfiable: "ScheduleAnyway"
          labelSelector:
            matchLabels:
              app: "keycloak"
              app.kubernetes.io/managed-by: "keycloak-operator"
              app.kubernetes.io/instance: "keycloak"
              app.kubernetes.io/component: "server"

In order to configure high-availability with multiple availability-zones, it is crucial that the Database is also able to withstand zone failures as Keycloak depends on the underlying database to remain available.

Failures which this setup can survive

Deploying Keycloak on a single cluster in a single zone, or across multiple availability-zones, or data centers with the required network latency and database configuration, changes the high-availability characteristics significantly, therefore we consider these architectures independently.

Single Zone

During testing of the high availability Single-cluster deployments, we observed the following restore times for the events described:

Failure	Recovery	RPO¹	RT²
Keycloak Pod	Multiple Keycloak Pods run in a cluster. If one instance fails some incoming requests might receive an error message or are delayed for some seconds.	No data loss	Less than 30 seconds
Kubernetes Node	Multiple Keycloak Pods run in a cluster. If the host node dies, then all pods on that node will fail and some incoming requests might receive an error message or are delayed for some seconds.	No data loss	Less than 30 seconds
Keycloak Clustering Connectivity	If the connectivity between Kubernetes nodes is lost, data cannot be sent between Keycloak pods hosted on those nodes. Incoming requests might receive an error message or be delayed for some seconds. The Keycloak will eventually remove the unreachable pods from its local view and will stop sending data to them.	No data loss	Seconds to minutes

Failure

Recovery

RPO¹

RT²

Keycloak Pod

Multiple Keycloak Pods run in a cluster. If one instance fails some incoming requests might receive an error message or are delayed for some seconds.

No data loss

Less than 30 seconds

Kubernetes Node

Multiple Keycloak Pods run in a cluster. If the host node dies, then all pods on that node will fail and some incoming requests might receive an error message or are delayed for some seconds.

No data loss

Less than 30 seconds

Keycloak Clustering Connectivity

If the connectivity between Kubernetes nodes is lost, data cannot be sent between Keycloak pods hosted on those nodes. Incoming requests might receive an error message or be delayed for some seconds. The Keycloak will eventually remove the unreachable pods from its local view and will stop sending data to them.

No data loss

Seconds to minutes

Table footnotes:

¹ Tested Recovery Point Objective, assuming all parts of the setup were healthy at the time this occurred.
² Maximum Recovery Time observed.

Multiple Zones

During testing of the high availability Multi-cluster deployments, we observed the following restore times for the events described:

Failure	Recovery	RPO¹	RT²
Database node³	If the writer instance fails, the database can promote a reader instance in the same or other zone to be the new writer.	No data loss	Seconds to minutes (depending on the database)
Keycloak pod	Multiple Keycloak instances run in a cluster. If one instance fails some incoming requests might receive an error message or are delayed for some seconds.	No data loss	Less than 30 seconds
Kubernetes Node	Multiple Keycloak pods run in a cluster. If the host node dies, then all pods on that node will fail and some incoming requests might receive an error message or are delayed for some seconds.	No data loss	Less than 30 seconds
Availability zone failure	If an availability-zone fails, all Keycloak pods hosted in that zone will also fail. Deploying at least the same number of Keycloak replicas as availability-zones should ensure that no data is lost and minimal downtime occurs as there will be other pods available to service requests.	No data loss	Seconds
Connectivity database	If the connectivity between availability-zones is lost, the synchronous replication will fail. Some requests might receive an error message or be delayed for a few seconds. Manual operations might be necessary depending on the database.	No data loss³	Seconds to minutes (depending on the database)
Keycloak Clustering Connectivity	If the connectivity between Kubernetes nodes is lost, data cannot be sent between Keycloak pods hosted on those nodes. Incoming requests might receive an error message or be delayed for some seconds. The Keycloak will eventually remove the unreachable pods from its local view and will stop sending data to them.	No data loss	Seconds to minutes

Failure

Recovery

RPO¹

RT²

Database node³

If the writer instance fails, the database can promote a reader instance in the same or other zone to be the new writer.

No data loss

Seconds to minutes (depending on the database)

Keycloak pod

Multiple Keycloak instances run in a cluster. If one instance fails some incoming requests might receive an error message or are delayed for some seconds.

No data loss

Less than 30 seconds

Kubernetes Node

Multiple Keycloak pods run in a cluster. If the host node dies, then all pods on that node will fail and some incoming requests might receive an error message or are delayed for some seconds.

No data loss

Less than 30 seconds

Availability zone failure

If an availability-zone fails, all Keycloak pods hosted in that zone will also fail. Deploying at least the same number of Keycloak replicas as availability-zones should ensure that no data is lost and minimal downtime occurs as there will be other pods available to service requests.

No data loss

Seconds

Connectivity database

If the connectivity between availability-zones is lost, the synchronous replication will fail. Some requests might receive an error message or be delayed for a few seconds. Manual operations might be necessary depending on the database.

No data loss³

Seconds to minutes (depending on the database)

Keycloak Clustering Connectivity

No data loss

Seconds to minutes

Table footnotes:

¹ Tested Recovery Point Objective, assuming all parts of the setup were healthy at the time this occurred.
² Maximum Recovery Time observed.
³ Assumes that the database is also replicated across multiple availability-zones

Known limitations

Downtime during rollouts of Keycloak upgrades

This can be overcome for patch releases by enabling Checking if rolling updates are possible.
Multiple node failures can result in a loss of entries from the authenticationSessions, loginFailures and actionTokens caches if the number of node failures is greater than or equal to the cache’s configured num_owners, which by default is 2.
Deployments using the default topologySpreadConstraints with whenUnsatisfiable: ScheduleAnyway, may experience data-loss on node/availability-zone failure if multiple pods are scheduled on the failed node/zone.

Users can mitigate against this scenario by defining topologySpreadConstraints with whenUnsatisfiable: DoNotSchedule, to ensure that pods are always evenly scheduled across zones and nodes. However, this can result in some Keycloak instances not being deployed if the constraints cannot be satisfied.

As Infinispan is unaware of the network topology when distributing cache entries, it is still possible for data-loss to occur on node/availability-zone failure if all num_owner copies of cached data are stored in the failed node/zone. You can restrict the total number of Keycloak instances to the number of nodes or availability-zones available by defining a requiredDuringSchedulingIgnoredDuringExecution for nodes and zones. However, this comes at the expense of scalability as the number of Keycloak instances that can be provisioned will be restricted to the number of nodes/availability-zones in your Kubernetes cluster.

See the Operator Advanced configuration details of how to configure custom anti-affinity topologySpreadConstraints policies.
The Operator does not configure the site’s name (see Configuring distributed caches) in the Pods as its value is not available via the Downward API. The machine name option is configured using the spec.nodeName from the node where the Pod is scheduled.

Nightly release

Concepts for single-cluster deployments

When to use this setup

Single or multiple availability-zones

Failures which this setup can survive

Single Zone

Multiple Zones

Known limitations

Next steps