[PRA][ZTA] Rebuild App Gateway Cluster

Background

Rebuilding an App Gateway Cluster is necessary when you need to change the number of App Gateways in the cluster.

If you want to decrease the number of App Gateways in the cluster, you need to rebuild it.
If you want to increase the number, you can simply add new App Gateways. The new gateways will automatically join the cluster, increasing its size.

Example

In a cluster consisting of 4 App Gateways, there is a failure tolerance of 1. This means if 1 App Gateway fails (leaving 3 App Gateways), the cluster will still function in read-write mode because it still has a leader. However, if another App Gateway fails (leaving 2 App Gateways), the cluster will switch to read-only mode because it no longer has a leader.

4 App Gateways → 1 Failure Tolerance
1 App Gateway fails → 3 App Gateways remain → Still in read-write mode
1 more App Gateway fails → 2 App Gateways remain → Switches to read-only mode (no leader)

In this scenario, you might want to rebuild the cluster when 3 App Gateways are still operational. This would create a new cluster consisting of 3 App Gateways with a failure tolerance of 1. If another App Gateway fails (leaving 2 App Gateways), the cluster will continue to function in read-write mode because it still has a leader.

4 App Gateways → 1 Failure Tolerance
1 App Gateway fails → 3 App Gateways remain → Still in read-write mode
Rebuild the cluster → New cluster with 3 App Gateways → 1 Failure Tolerance
1 more App Gateway fails → 2 App Gateways remain → Still in read-write mode

When the first failed App Gateway comes back online, you would want to reset the cluster only on this App Gateway so that it can join the new cluster.

Note: Refer to the documentation here for more details on Safous Cluster behavior. Clusters with 4 and 3 App Gateways have the same failure tolerance of 1.

Check the Cluster Identity

You can check the cluster identity by viewing the log of one of the App Gateways:

sudo docker logs -f config-idac-1

2023/11/18 15:28:01 I [-] idac id: 914ca723
2023/11/18 15:28:01 I [-] idac id: 914ca723
2023/11/18 15:28:01 I [idac] idac pid is 165
2023/11/18 15:28:01 I [idac] idac version: 50dd7a2e93694f2e8b9a664ebf05b5e9f6c845af-build-1690983990
-4.3.3
2023/11/18 15:28:01 I [idac] set process max number of file descriptors to: 1048576
2023/11/18 15:28:01 I [ping-cluster] did not detect a cluster ... continuing (reason: failed to call [tcp.ztna.safous.com:443]: EOF)
2023/11/18 15:28:01 I [raft-main] bootstrapping new cluster
2023/11/18 15:28:01 I [raft-app] bootstrapping new cluster
2023/11/18 15:28:01 I [raft-logs] bootstrapping new cluster
2023/11/18 15:28:01 I [idac] affinity: established a new affinity: "108.137.34.66:443"
2023/11/18 15:28:07 I [raft-main] new state: Candidate
2023/11/18 15:28:07 I [raft-main] new state: Leader
2023/11/18 15:28:07 I [raft-main] new leader: 914ca723
2023/11/18 15:28:07 I [main-db] pushing a seed
2023/11/18 15:28:07 I [main-db] seed applied successfully
2023/11/18 15:28:08 I [raft-logs] new state: Candidate
2023/11/18 15:28:08 I [raft-logs] new state: Leader
2023/11/18 15:28:08 I [raft-logs] new leader: 914ca723
2023/11/18 15:28:09 I [raft-app] new state: Candidate
2023/11/18 15:28:09 I [raft-app] new state: Leader
2023/11/18 15:28:09 I [raft-app] new leader: 914ca723
2023/11/18 15:28:09 I [main-db] waiting for migrations to be done by leader
2023/11/18 15:28:09 I [idac] starting role [blobs]
2023/11/18 15:28:09 I [idac] starting system bootstrap
2023/11/18 15:28:09 I [main-db] no migrations required
2023/11/18 15:28:11 I [idac] finished system bootstrap successfully
2023/11/18 15:28:11 I [main-db] no migrations required
2023/11/18 15:28:12 I [idac] idac started on site "home3"
2023/11/18 15:28:12 I [idac] accepting connections from upstream
2023/11/18 15:28:12 I [idac] serving internal NativeSSH monitoring service on: idac:2222
2023/11/18 15:30:03 I [raft-app] node a91c172d at raft-app-a91c172d.salftest.ztna.safous.com:443 successfully joined at index=26
2023/11/18 15:30:03 I [raft-logs] node a91c172d at raft-logs-a91c172d.salftest.ztna.safous.com:443 successfully joined at index=10
2023/11/18 15:30:03 I [raft-main] node a91c172d at raft-main-a91c172d.salftest.ztna.safous.com:443 successfully joined at index=71

In this example, IDAC (App Gateway container) has ID 914ca723. When it bootstraps, it checks the Safous POP to see if an existing cluster is present. If none is found, it bootstraps a new cluster for the domain and appoints itself as the leader. The other App Gateway (ID a91c172d) then joins the cluster led by 914ca723.

If you check the logs of that other App Gateway, you will see something like this:

2023/11/18 15:30:02 I [-] idac id: a91c172d
2023/11/18 15:30:03 I [-] idac id: a91c172d
2023/11/18 15:30:02 I [idac] idac pid is 165
2023/11/18 15:30:02 I [idac] idac version: 50dd7a2e93694f2e8b9a664ebf05b5e9f6c845af-build-1690983990
-4.3.3
2023/11/18 15:30:02 I [idac] set process max number of file descriptors to: 1048576
2023/11/18 15:30:02 I [idac] affinity dialer set to static host: 108.137.34.66
2023/11/18 15:30:03 I [raft-app] trying to join a cluster
2023/11/18 15:30:03 I [raft-main] trying to join a cluster
2023/11/18 15:30:03 I [raft-logs] trying to join a cluster
2023/11/18 15:30:03 I [raft-app] joining cluster at index 26
2023/11/18 15:30:03 I [raft-logs] joining cluster at index 10
2023/11/18 15:30:03 I [raft-main] joining cluster at index 71
2023/11/18 15:30:04 I [raft-main] new leader: 914ca723
2023/11/18 15:30:04 I [raft-logs] new leader: 914ca723
2023/11/18 15:30:04 I [raft-app] new leader: 914ca723
2023/11/18 15:30:04 I [raft-logs] successfully applied logs up to index 10
2023/11/18 15:30:04 I [raft-app] successfully applied logs up to index 26
2023/11/18 15:30:05 I [main-db] seed applied successfully
2023/11/18 15:30:05 I [raft-main] successfully applied logs up to index 71
2023/11/18 15:30:05 I [main-db] waiting for migrations to be done by leader
2023/11/18 15:30:05 I [idac] starting role [blobs]
2023/11/18 15:30:05 I [idac] starting system bootstrap
2023/11/18 15:30:05 I [idac] finished system bootstrap successfully
2023/11/18 15:30:05 I [idac] idac started on site "Hongkong"
2023/11/18 15:30:06 I [idac] accepting connections from upstream
2023/11/18 15:30:06 I [idac] serving internal NativeSSH monitoring service on: idac:2222

How to Rebuild App Gateway Cluster

Rebuilding an App Gateway cluster involves three main steps:

Shutdown the container (requires downtime).
Remove the raft folder (Note: Best practice is to rename the folder rather than remove it, for rollback purposes).
Bring up the container.

Example:

Let's assume we have 4 App Gateways: App GW 1, 2, 3, and 4. All are part of a cluster in the domain *.tenant.ztna.safous.com, where App GW 1 is the leader. We want to decommission App GW 1 and rebuild the cluster so that it consists of only 3 App Gateways.

Cluster: App GW 1, 2, 3, 4 in *.tenant.ztna.safous.com
Leader: App GW 1
Action: Decommission App GW 1, rebuild the cluster

Steps to Perform on App GW 2, 3, and 4:

Shutdown the container (requires downtime)

cd /etc/cyolo/config
sudo docker-compose down

Rename the raft folder
```
cd /etc/cyolo
mv raft raft-bak
```
Bring up the App Gateways (note: first App GW brought up will become new cluster leader).

App GW 3 (new leader):

cd /etc/cyolo/config
sudo docker-compose up -d

App GW 2

cd /etc/cyolo/config
sudo docker-compose up -d

App GW 4

cd /etc/cyolo/config
sudo docker-compose up -d

Verification
Check the logs to ensure that the new cluster is built, with App GW 3 as the new leader and App GW 2 and 4 joining the new cluster.

docker logs -f config-idac-1

Log output for App GW 3 should show:

2023/11/18 15:28:01 I [ping-cluster] did not detect a cluster ... continuing (reason: failed to call [tcp.ztna.safous.com:443]: EOF)
2023/11/18 15:28:01 I [raft-main] bootstrapping new cluster
2023/11/18 15:28:01 I [raft-app] bootstrapping new cluster
2023/11/18 15:28:01 I [raft-logs] bootstrapping new cluster

2023/11/18 15:30:03 I [raft-app] trying to join a cluster
2023/11/18 15:30:03 I [raft-main] trying to join a cluster
2023/11/18 15:30:03 I [raft-logs] trying to join a cluster
2023/11/18 15:30:03 I [raft-app] joining cluster at index 26
2023/11/18 15:30:03 I [raft-logs] joining cluster at index 10
2023/11/18 15:30:03 I [raft-main] joining cluster at index 71
2023/11/18 15:30:04 I [raft-main] new leader: <idac id of App GW 3>
2023/11/18 15:30:04 I [raft-logs] new leader: <idac id of App GW 3>
2023/11/18 15:30:04 I [raft-app] new leader: <idac id of App GW 3>

2023/11/18 15:30:03 I [raft-app] node <idac id of App GW 2/4> at raft-app-<idac id of App GW 2/4>.<tenant>.ztna.safous.com:443 successfully joined at index=26
2023/11/18 15:30:03 I [raft-logs] node <idac id of App GW 2/4> at raft-logs-<idac id of App GW 2/4>.<tenant>.ztna.safous.com:443 successfully joined at index=10
2023/11/18 15:30:03 I [raft-main] node <idac id of App GW 2/4> at raft-main-<idac id of App GW 2/4>.<tenant>.ztna.safous.com:443 successfully joined at index=71

Log output for App GW 2 and 4 should show:

2023/11/18 15:30:03 I [raft-app] trying to join a cluster
2023/11/18 15:30:03 I [raft-main] trying to join a cluster
2023/11/18 15:30:03 I [raft-logs] trying to join a cluster
2023/11/18 15:30:03 I [raft-app] joining cluster at index 26
2023/11/18 15:30:03 I [raft-logs] joining cluster at index 10
2023/11/18 15:30:03 I [raft-main] joining cluster at index 71
2023/11/18 15:30:04 I [raft-main] new leader: <idac id of App GW 3>
2023/11/18 15:30:04 I [raft-logs] new leader: <idac id of App GW 3>
2023/11/18 15:30:04 I [raft-app] new leader: <idac id of App GW 3>

You now have a new cluster consisting of 3 App Gateways, with App GW 3 as the leader.

Finally, check the Admin Portal (Accounts, Applications, Policies, Configurations) and the User Portal to ensure everything is functioning normally.