Simulate failures of Keycloak in Kubernetes

How to automate the simulation of failures Keycloak Pods in a Kubernetes environment to test the recovery of Keycloak after a failure.

Why failure testing

There is an excellent writeup about why we need chaos testing tools in general in the introduction to the chaos testing tool krkn.

Running the failure test using kc-chaos.sh script

Preparations

  • Extract the keycloak-benchmark-${version}.[zip|tar.gz] file

  • Preparing Keycloak for load testing

  • Make sure you can access the Kubernetes cluster from where you are planning to run the failure tests and run commands such as kubectl get pods -n keycloak-keycloak

Simulating load

Use the Running benchmarks from the CLI guide to simulate load against a specific Kubernetes environment.

Running the failure tests

Once there is enough load going against the Keycloak application hosted on an existing Kubernetes/OpenShift cluster, execute below command to:

./kc-chaos.sh <RESULT_DIR_PATH>

Set the environment variables below to configure on how and where this script gets executed.

INITIAL_DELAY_SECS

Time in seconds the script waits before it triggers the first failure.

CHAOS_DELAY_SECS

Time in seconds the script waits between simulating failures.

PROJECT

Namespace of the Keycloak pods.

Collecting the results

The chaos script also collects information about the Keycloak failures, Keycloak pod utilization, Keycloak pod restarts, Keycloak logs before killing the keycloak pod and at the end of the run and store them under the results/logs directory.

Running the failure test using Krkn Chaos testing framework

We integrated a Chaos testing framework krkn as part of a Taskfile Taskfile.yaml and created individual tasks to run the pod-scenarios test against different components within the multi-site setup of Keycloak on Kubernetes. It focuses on simulating Pod failure scenarios for Keycloak and Infinispan applications.

Preparations

  • This Taskfile requires Podman to be installed and configured on the system.

  • Make sure to set the required environment variables before running the tasks.

  • You can customize the behavior of the tasks by overriding the default values for the variables.

Pod failure scenario

This is an internal task that provides the core functionality for running Kraken pod failure scenarios. It uses the pod-scenarios image from the krkn-chaos/krkn-hub repository. The task requires the following variables:

ROSA_CLUSTER_NAME

The name of the ROSA cluster (if ROSA_LOGIN is set to true)

NAMESPACE

The Kubernetes namespace.

POD_LABEL

A label selector to identify the target pods

EXPECTED_POD_COUNT

The expected number of pods after the disruption

Optionally, the following variables may be set too:

ROSA_LOGIN (default: true)

If true, it uses rosa and oc login to authenticate into the OpenShift cluster.

KUBECONFIG

If ROSA_LOGIN is set to false, set this variable to the authenticated Kubernetes cluster.

DISRUPTION_COUNT (default 1)

The number of Pods to kill.

WAIT_DURATION (default 30)

The waiting time in seconds before starting the next iteration (if ITERATIONS > 1).

ITERATIONS (default 1)

How many times the scenario is run. Each iteration kills one or more Pods.

POD_LABEL (depends on the task)

A label selector to identify the target pods.

EXPECTED_POD_COUNT (depends on the task)

The expected number of pods before the disruption.

Infinispan Gossip Router Pod failure

This task kills the JGroups Gossip Router pod in the Infinispan cluster. It calls the kraken-pod-scenarios task with specific values for POD_LABEL, DISRUPTION_COUNT, and EXPECTED_POD_COUNT.

To execute the task, run the following command.

task kill-gossip-router

Right now, the kill-gossip-router task fails with an timeout while waiting for pods to come up error message, which needs to be fixed and currently tracked under a GitHub issue.

Infinispan Pod failure

This task kills a random Infinispan pod. It calls the kraken-pod-scenarios task with appropriate values for POD_LABEL, DISRUPTION_COUNT, and EXPECTED_POD_COUNT. The default value for EXPECTED_POD_COUNT is calculated based on the CROSS_DC_ISPN_REPLICAS variable (or 3 if not set).

To execute the task, run the following command.

task kill-infinispan

Keycloak Pod failure

This task kills a random Keycloak pod. It calls the kraken-pod-scenarios task with specific values for POD_LABEL, DISRUPTION_COUNT, and EXPECTED_POD_COUNT. The default value for EXPECTED_POD_COUNT is calculated based on the KC_INSTANCES variable (or 1 if not set).

To execute the task, run the following command.

task kill-keycloak

Zone Outage scenario

This task disrupts the network and isolates one of the availability zones. It uses the pod-scenarios image from the krkn-chaos/krkn-hub repository.

To execute the task, run the following command.

task zone-outage

The task requires the following variables:

ROSA_CLUSTER_NAME

The name of the ROSA cluster to fetch the node and its subnets.

REGION

The AWS region where the multi-az cluster is deployed.

Optionally, the following variables may be set too:

AVAILABILITY_ZONE

It defaults to the a suffixed availability zone. Sets the availability zone name, for example eu-west-1b. This availability zone will be isolated for the remaining ones.

ROSA_LOGIN (default: true)

If true, it uses rosa and oc login to authenticate into the OpenShift cluster.

KUBECONFIG

If ROSA_LOGIN is set to false, set this variable to the authenticated Kubernetes cluster.

DURATION (default 120)

It sets the duration of the outage in seconds.

WAIT_DURATION (default 30)

The waiting time in seconds before starting the next iteration (if ITERATIONS > 1).

ITERATIONS (default 1)

How many times the scenario is run.

This task requires the aws CLI tool to be installed and authenticated.

It uses the aws configure get <key> to pass the credentials into the Kraken Chaos container, as environment variables.

Limitations

  • Currently, we are not able to peek into the Krkn report which gets generated inside the kraken pod but gets removed as its ephemeral storage. This is currently planned to be fixed and tracked in a GitHub issue.