Recover from site failures with a Multi-Site Setup

December 18 2023 by Alexander Schwartz, Kamesh Akella

This post is more than one year old. The content within the blog post is likely to be out of date.

For a Customer Identity and Access Management (CIAM) system, high availability is essential as it is a single point for all systems where customers log in. For Keycloak 23, there is a new and updated High Availability guide describing multi-site setups. With detailed instructions and blueprints targeting cloud infrastructure, this is documented, tested, and ready to be tried out.

Read on to find out what is new, and take a peek behind the scenes how this setup has been evaluated, tested and improved. And finally, we are providing an outlook when this will no longer be a preview feature.

Improved documentation and new blueprints

The recent updates to Keycloak’s multi-site setup mark a significant milestone. Keycloak 23 includes an opinionated guide on setting up Keycloak in a multi-site configuration including blueprints for a cloud setup.

The high-level topics of this documentation are:

Concept and building block overview

These guides include step-by-step instructions to bring up different components of the Keycloak multi-site architecture such as:

What does an active-passive setup with Keycloak architecture look like?
How to use an external database?
How to tune the resources for each of these architectural components?

Blueprints for building blocks

A series of guides around how to deploy Keycloak in various configurations on Amazon Web Service.

Operational procedures

These guides include detailed operational procedures, ensuring that users can set up and operate their multi-site Keycloak instances efficiently.

Validation of the multi-site setup

Before we published the guides above, we worked on the tooling that allows us both experimenting and getting reproducible results for performance, scalability and chaos testing our solution.

With these tools, we tested first a single-site setup, and once that worked sufficiently well, also a multi-site setup.

All these tools are available as open source, and we invite you to review them to give us feedback, and use them in your environment to run your own performance benchmark and regression tests:

Dataset Provider: Install this into a Keycloak server in a test environment, and create as many users, clients, groups, etc. as you need to run your performance benchmark. Keycloak caches a lot of information in its internal caches, and so does the database, so you will be able to spot some problems only when you have the right amount of data in your database.
Benchmark: This contains ready-to-be used scenarios for authentication flows and for Keycloak’s admin REST endpoints. If it does not fit your needs yet, use it as a library to create your own Gatling scenarios based on existing and custom steps. These tests are deployed as a JAR and a shell script wrapper, so you will only need to install Java on your load runners and you are ready to go.
Dedicated EC2 load drivers: Use these Ansible playbooks to spin up a set of EC2 instances to drive load against a Keycloak test installation, and aggregate the results.
Automated OpenShift installation on AWS: Based on Red Hat OpenShift Service on AWS (ROSA), use the scripts to provision an instance with monitoring, logging and useful Operators preconfigured, ready to deploy Keycloak.
Automated Aurora installation: Set up an Aurora in different variants regional or global, and connect it to a ROSA environment.
Opinionated Keycloak deployment for Minikube or OpenShift: This deploys Keycloak with additional monitoring and debugging tools so we can look at metrics, logs and traces as needed
Scripted AWS Route 53 load balancer: Set up Route 53 for an active-passive setup to distribute the load to two Keycloak deployments in different OpenShift clusters
Scripted Multi-AZ deployment: Every weekday we create a new Multi-AZ setup from scratch using GitHub actions, a performance testsuite, and record the results. This way we catch functional and performance regressions as they occur.

Thank you to everyone in the community who has already tried out these tools, found bugs and submitted ideas for improvements!

Keycloak got better for everyone

When using the tools listed above, we were able to reproduce several situations where Keycloak needed to improve. Here are of the improvements which are available in Keycloak 23 for both single-site and multi-site setups:

Non-Blocking liveness probe: When running Keycloak under a high load, requests might queue up in a Keycloak instance. The more requests queue up, the longer it takes to reply to the requests. In previous versions also the requests to the liveness probe (/health/live) were queued, and the probe eventually timed out, and then Kubernetes restarted the Pod. In the latest version of Keycloak, the probe is re-implemented to be non-blocking, so it will not queue, and therefore will not time out and the Pod is not restarted under a high load.
Load Shedding: When requests are queued as described above, the caller will not get a response in time, and the Pod might eventually run out of resources like memory or network connections. The recommended recipe is to drop requests early when an instance will not be able to serve the requests in time, which is called load shedding. Keycloak 23 now supports the new option http-max-queued-requests that can limit the number of concurrent blocking requests. When the number is exceeded, Keycloak immediately returns the response 503 Server not Available. This has two benefits: The caller receives an immediate response and can retry later, and resources are freed on the server side immediately.
Prevented cache stampede for realms and clients: When a new Keycloak instance starts or restarts, its caches are empty. If under high load parallel requests arrive for the same realm or the same client on a node of Keycloak, previous versions of Keycloak loaded the data from the database in each parallel request. This caused a spike in database connection usage and an initial response delay. The same happens when a cache or realm entry in the cache is evicted, for example, because it was modified. The latest version of Keycloak prevents this so that each Keycloak instance will fetch the data from the database once, and all other parallel requests then use this data without querying the database again (see #21521 and #22988, #24202).
Align the number of JGroup threads with the number of Quarkus threads: The more Keycloak instances run in a cluster, and the more requests are processed in parallel, the higher is the load on the JGroups thread pool. The JGroups thread pool ensures smooth communication for the embedded Infinispan of Keycloak, and could lead to timeouts on the internal Infinispan communications if its capacity is exceeded. The high-availability docs now contain documentation on how to set the Quarkus thread pool to not exceed the JGroup thread pool.
Improved Infinispan Metrics: The embedded Infinispan provides improved metrics that allow you to monitor your cluster. The metrics exposed by the Keycloak’s metrics endpoint now contain only Infinispan metrics for the current node, so they will not block if another Pod is currently starting up or shutting down (ISPN-15042 and ISPN-15072). This way you have better visibility of your cluster during those critical moments. The metrics can now expose the cache names as labels, so they can be plotted simpler in dashboards by adding a <metrics names-as-tags="true" /> to the Infinispan XML configuration. Additional metrics are available for the latencies between sites.
Reliable Infinispan operations: We tested Infinispan and its communication layer JGroups thoroughly, and we were able to fix situations where a state transfer stalled (ISPN-14982), or an initial state transfer failed. The Gossip router used in the multi-site setup now works even in situations where a load balancer has multiple IP addresses (JGRP-2722, JGRP-2721, infinispan-operator#1857, and infinispan-operator#1856).

Can the blueprints or scripts be used in production?

As part of the testing we did, we optimized Keycloak and those optimizations are built into Keycloak. They are available without the need for additional configuration except for the JGroup thread pool configuration. While the configuration of Keycloak on Kubernetes might match a production environment quite closely, we expect the database, network, load balancer and security hardening to be different in every organization, so you will need to adapt it to your needs.

This is why we chose to document the blueprints as text, so you can learn about the choices we made and why different aspects are configured in one setup, while others are at their default settings.

The scripts we use for the automated setup in the Keycloak Benchmark project focus on high availability and mix this with configurations that are simple to debug and analyze from an engineering perspective. A production-ready setup would not have that functionality, so we do not recommend using the scripts as is. Still, they can serve as a starting point for your own automation.

Read the guides and give it a try!

At the moment, we are running the final tests for an active/passive setup and work toward automating more tests. We are also looking for feedback from the community in this GitHub discussion on multi-site setups: Do you like what you see here? Is something missing? Your feedback is essential!

Once our tests are complete, and we receive feedback from the community, we plan to make it a fully supported feature. This is a huge opportunity for the community to engage with this setup, try it in your environment, and share your findings. Let’s build a stronger and more resilient Keycloak together!