Switch back to the primary site

This describes the operational procedures necessary

These procedures switch back to the primary site back after a failover or switchover to the secondary site. In a setup as outlined in Concepts for active-passive deployments together with the blueprints outlined in Building blocks active-passive deployments.

When to use this procedure

These procedures bring the primary site back to operation when the secondary site is handling all the traffic. At the end of the guide, the primary site is online again and handles the traffic.

This procedure is necessary when the primary site has lost its state in Infinispan, a network partition occurred between the primary and the secondary site while the secondary site was active, or the replication was disabled as described in the Switch over to the secondary site guide.

If the data in Infinispan on both sites is still in sync, the procedure for Infinispan can be skipped.

See the Multi-site deployments guide for different operational procedures.

Procedures

Infinispan Cluster

For the context of this guide, Site-A is the primary site, recovering back to operation, and Site-B is the secondary site, running in production.

After the Infinispan in the primary site is back online and has joined the cross-site channel (see Deploy Infinispan for HA with the Infinispan Operator#verifying-the-deployment on how to verify the Infinispan deployment), the state transfer must be manually started from the secondary site.

After clearing the state in the primary site, it transfers the full state from the secondary site to the primary site, and it must be completed before the primary site can start handling incoming requests.

Transferring the full state may impact the Infinispan cluster perform by increasing the response time and/or resources usage.

The first procedure is to delete any stale data from the primary site.

  1. Log in to the primary site.

  2. Shutdown Keycloak. This action will clear all Keycloak caches and prevents the state of Keycloak from being out-of-sync with Infinispan.

    When deploying Keycloak using the Keycloak Operator, change the number of Keycloak instances in the Keycloak Custom Resource to 0.

  3. Connect into Infinispan Cluster using the Infinispan CLI tool:

    Command:
    kubectl -n keycloak exec -it pods/infinispan-0 -- ./bin/cli.sh --trustall --connect https://127.0.0.1:11222

    It asks for the username and password for the Infinispan cluster. Those credentials are the one set in the Deploy Infinispan for HA with the Infinispan Operator guide in the configuring credentials section.

    Output:
    Username: developer
    Password:
    [infinispan-0-29897@ISPN//containers/default]>
    The pod name depends on the cluster name defined in the Infinispan CR. The connection can be done with any pod in the Infinispan cluster.
  4. Disable the replication from primary site to the secondary site by running the following command. It prevents the clear request to reach the secondary site and delete all the correct cached data.

    Command:
    site take-offline --all-caches --site=site-b
    Output:
    {
      "offlineClientSessions" : "ok",
      "authenticationSessions" : "ok",
      "sessions" : "ok",
      "clientSessions" : "ok",
      "work" : "ok",
      "offlineSessions" : "ok",
      "loginFailures" : "ok",
      "actionTokens" : "ok"
    }
  5. Check the replication status is offline.

    Command:
    site status --all-caches --site=site-b
    Output:
    {
      "status" : "offline"
    }

    If the status is not offline, repeat the previous step.

    Make sure the replication is offline otherwise the clear data will clear both sites.
  6. Clear all the cached data in primary site using the following commands:

    Command:
    clearcache actionTokens
    clearcache authenticationSessions
    clearcache clientSessions
    clearcache loginFailures
    clearcache offlineClientSessions
    clearcache offlineSessions
    clearcache sessions
    clearcache work

    These commands do not print any output.

  7. Re-enable the cross-site replication from primary site to the secondary site.

    Command:
    site bring-online --all-caches --site=site-b
    Output:
    {
      "offlineClientSessions" : "ok",
      "authenticationSessions" : "ok",
      "sessions" : "ok",
      "clientSessions" : "ok",
      "work" : "ok",
      "offlineSessions" : "ok",
      "loginFailures" : "ok",
      "actionTokens" : "ok"
    }
  8. Check the replication status is online.

    Command:
    site status --all-caches --site=site-b
    Output:
    {
      "status" : "online"
    }

Now we are ready to transfer the state from the secondary site to the primary site.

  1. Log in into your secondary site.

  2. Connect into Infinispan Cluster using the Infinispan CLI tool:

    Command:
    kubectl -n keycloak exec -it pods/infinispan-0 -- ./bin/cli.sh --trustall --connect https://127.0.0.1:11222

    It asks for the username and password for the Infinispan cluster. Those credentials are the one set in the Deploy Infinispan for HA with the Infinispan Operator guide in the configuring credentials section.

    Output:
    Username: developer
    Password:
    [infinispan-0-29897@ISPN//containers/default]>
    The pod name depends on the cluster name defined in the Infinispan CR. The connection can be done with any pod in the Infinispan cluster.
  3. Trigger the state transfer from the secondary site to the primary site.

    Command:
    site push-site-state --all-caches --site=site-a
    Output:
    {
      "offlineClientSessions" : "ok",
      "authenticationSessions" : "ok",
      "sessions" : "ok",
      "clientSessions" : "ok",
      "work" : "ok",
      "offlineSessions" : "ok",
      "loginFailures" : "ok",
      "actionTokens" : "ok"
    }
  4. Check the replication status is online for all caches.

    Command:
    site status --all-caches --site=site-a
    Output:
    {
      "status" : "online"
    }
  5. Wait for the state transfer to complete by checking the output of push-site-status command for all caches.

    Command:
    site push-site-status --cache=actionTokens
    site push-site-status --cache=authenticationSessions
    site push-site-status --cache=clientSessions
    site push-site-status --cache=loginFailures
    site push-site-status --cache=offlineClientSessions
    site push-site-status --cache=offlineSessions
    site push-site-status --cache=sessions
    site push-site-status --cache=work
    Output:
    {
      "site-a" : "OK"
    }
    {
      "site-a" : "OK"
    }
    {
      "site-a" : "OK"
    }
    {
      "site-a" : "OK"
    }
    {
      "site-a" : "OK"
    }
    {
      "site-a" : "OK"
    }
    {
      "site-a" : "OK"
    }
    {
      "site-a" : "OK"
    }

    Check the table in this section for the Cross-Site Documentation for the possible status values.

    If an error is reported, repeat the state transfer for that specific cache.

    Command:
    site push-site-state --cache=<cache-name> --site=site-a
  6. Clear/reset the state transfer status with the following command

    Command:
    site clear-push-site-status --cache=actionTokens
    site clear-push-site-status --cache=authenticationSessions
    site clear-push-site-status --cache=clientSessions
    site clear-push-site-status --cache=loginFailures
    site clear-push-site-status --cache=offlineClientSessions
    site clear-push-site-status --cache=offlineSessions
    site clear-push-site-status --cache=sessions
    site clear-push-site-status --cache=work
    Output:
    "ok"
    "ok"
    "ok"
    "ok"
    "ok"
    "ok"
    "ok"
    "ok"
  7. Log in to the primary site.

  8. Start Keycloak.

    When deploying Keycloak using the Keycloak Operator, change the number of Keycloak instances in the Keycloak Custom Resource to the original value.

Both Infinispan clusters are in sync and the switchover from secondary back to the primary site can be performed.

AWS Aurora Database

Assuming a Regional multi-AZ Aurora deployment, the current writer instance should be in the same region as the active Keycloak cluster to avoid latencies and communication across availability zones.

Switching the writer instance of Aurora will lead to a short downtime. The writer instance in the other site with a slightly longer latency might be acceptable for some deployments. Therefore, this situation might be deferred to a maintenance window or skipped depending on the circumstances of the deployment.

To change the writer instance, run a failover. This change will make the database unavailable for a short time. Keycloak will need to re-establish database connections.

To fail over the writer instance to the other AZ, issue the following command:

aws rds failover-db-cluster  --db-cluster-identifier ...

Route53

If switching over to the secondary site has been triggered by changing the health endpoint, edit the health check in AWS to point to a correct endpoint (/lb-check). After some minutes, the clients will notice the change and traffic will gradually move over to the secondary site.

Further reading

See Concepts to automate Infinispan CLI commands on how to automate Infinispan CLI commands.

On this page