Rebuild Safous App Gateway Cluster

Background

Rebuild App Gateway Cluster is necessary when we want to change the amount of App Gateway in the cluster. 

  • Done if we want to decrease the amount of App Gateways in the cluster
  • If want to increase, we can just add the new App Gateways (new App Gateways will join the cluster, and add the number of App Gateways in cluster automatically). 

For example:

In a cluster that consists of 4 App Gateways, it will have failure tolerance of 1. Meaning that if 1 App Gateway fails (3 App Gateways remain), the cluster still function in read-write mode, because it still have a leader. But if 1 more App Gateway fails (2 App Gateways remain), then cluster will change to read-only mode, because no leader in that cluster. 

4 App GW -> 1 Failure Tolerance
1 App GW fails -> 3 App GW remain -> Still in read-write mode
1 more App GW fails -> 2 App GW remain -> change to read-only mode (no leader)

So in this case, we may want to rebuild the cluster when 3 App Gateways still remain, so the new cluster will consists of 3 App Gateway, which has failure tolerance of 1. So when 1 more App Gateway fails (2 App Gateway remain), the cluster will still function in read-write mode, because it still have a leader.

4 App GW -> 1 Failure Tolerance
1 App GW fails -> 3 App GW remain -> Still in read-write mode
Rebuild the cluster -> new cluster of 3 App GW -> 1 Failure Tolerance
1 more App GW fails -> 2 App GW remain -> Still in read-write mode

Then, when the 1st failed App Gateway is back, then we would want to reset cluster only on this App Gateway, so that it will join the new cluster. 

Note: see here for explanation on behavior of Safous Cluster. Cluster that have 4 App Gateways and 3 App Gateways has similar failure tolerance: 1. 

Check the Cluster Identity

We can check the cluster identity by seeing the log of one of the App GW

sudo docker logs -f config_idac_1 
2023/11/18 15:28:01 I [-] idac id: 914ca723
2023/11/18 15:28:01 I [-] idac id: 914ca723
2023/11/18 15:28:01 I [idac] idac pid is 165
2023/11/18 15:28:01 I [idac] idac version: 50dd7a2e93694f2e8b9a664ebf05b5e9f6c845af-build-1690983990
-4.3.3
2023/11/18 15:28:01 I [idac] set process max number of file descriptors to: 1048576
2023/11/18 15:28:01 I [ping-cluster] did not detect a cluster ... continuing (reason: failed to call [tcp.ztna.safous.com:443]: EOF)
2023/11/18 15:28:01 I [raft-main] bootstrapping new cluster
2023/11/18 15:28:01 I [raft-app] bootstrapping new cluster
2023/11/18 15:28:01 I [raft-logs] bootstrapping new cluster
2023/11/18 15:28:01 I [idac] affinity: established a new affinity: "108.137.34.66:443"
2023/11/18 15:28:07 I [raft-main] new state: Candidate
2023/11/18 15:28:07 I [raft-main] new state: Leader
2023/11/18 15:28:07 I [raft-main] new leader: 914ca723
2023/11/18 15:28:07 I [main-db] pushing a seed
2023/11/18 15:28:07 I [main-db] seed applied successfully
2023/11/18 15:28:08 I [raft-logs] new state: Candidate
2023/11/18 15:28:08 I [raft-logs] new state: Leader
2023/11/18 15:28:08 I [raft-logs] new leader: 914ca723
2023/11/18 15:28:09 I [raft-app] new state: Candidate
2023/11/18 15:28:09 I [raft-app] new state: Leader
2023/11/18 15:28:09 I [raft-app] new leader: 914ca723
2023/11/18 15:28:09 I [main-db] waiting for migrations to be done by leader
2023/11/18 15:28:09 I [idac] starting role [blobs]
2023/11/18 15:28:09 I [idac] starting system bootstrap
2023/11/18 15:28:09 I [main-db] no migrations required
2023/11/18 15:28:11 I [idac] finished system bootstrap successfully
2023/11/18 15:28:11 I [main-db] no migrations required
2023/11/18 15:28:12 I [idac] idac started on site "home3"
2023/11/18 15:28:12 I [idac] accepting connections from upstream
2023/11/18 15:28:12 I [idac] serving internal NativeSSH monitoring service on: idac:2222
2023/11/18 15:30:03 I [raft-app] node a91c172d at raft-app-a91c172d.salftest.ztna.safous.com:443 successfully joined at index=26
2023/11/18 15:30:03 I [raft-logs] node a91c172d at raft-logs-a91c172d.salftest.ztna.safous.com:443 successfully joined at index=10
2023/11/18 15:30:03 I [raft-main] node a91c172d at raft-main-a91c172d.salftest.ztna.safous.com:443 successfully joined at index=71

In above example, IDAC (App Gateway container) has ID 914ca723. Then when it bootstrap, it will check the Safous POP whether there's existing cluster. If it haven't found one, it will bootstrap new cluster for the domain, then self-appoint itself as a leader. Then we can see that the other App Gateway (ID a91c172d) joining the cluster of 914ca723.

If we check that other App Gateway, we can see logs like this

2023/11/18 15:30:02 I [-] idac id: a91c172d
2023/11/18 15:30:03 I [-] idac id: a91c172d
2023/11/18 15:30:02 I [idac] idac pid is 165
2023/11/18 15:30:02 I [idac] idac version: 50dd7a2e93694f2e8b9a664ebf05b5e9f6c845af-build-1690983990
-4.3.3
2023/11/18 15:30:02 I [idac] set process max number of file descriptors to: 1048576
2023/11/18 15:30:02 I [idac] affinity dialer set to static host: 108.137.34.66
2023/11/18 15:30:03 I [raft-app] trying to join a cluster
2023/11/18 15:30:03 I [raft-main] trying to join a cluster
2023/11/18 15:30:03 I [raft-logs] trying to join a cluster
2023/11/18 15:30:03 I [raft-app] joining cluster at index 26
2023/11/18 15:30:03 I [raft-logs] joining cluster at index 10
2023/11/18 15:30:03 I [raft-main] joining cluster at index 71
2023/11/18 15:30:04 I [raft-main] new leader: 914ca723
2023/11/18 15:30:04 I [raft-logs] new leader: 914ca723
2023/11/18 15:30:04 I [raft-app] new leader: 914ca723
2023/11/18 15:30:04 I [raft-logs] successfully applied logs up to index 10
2023/11/18 15:30:04 I [raft-app] successfully applied logs up to index 26
2023/11/18 15:30:05 I [main-db] seed applied successfully
2023/11/18 15:30:05 I [raft-main] successfully applied logs up to index 71
2023/11/18 15:30:05 I [main-db] waiting for migrations to be done by leader
2023/11/18 15:30:05 I [idac] starting role [blobs]
2023/11/18 15:30:05 I [idac] starting system bootstrap
2023/11/18 15:30:05 I [idac] finished system bootstrap successfully
2023/11/18 15:30:05 I [idac] idac started on site "Hongkong"
2023/11/18 15:30:06 I [idac] accepting connections from upstream
2023/11/18 15:30:06 I [idac] serving internal NativeSSH monitoring service on: idac:2222

How to Rebuild App Gateway Cluster

Rebuild App Gateway cluster basically consists of 3 steps:

  1. Shutdown the container (requires downtime)
  2. Remove raft folder (note: best practice is to rename, instead remove, for rollback purpose)
  3. Bring up the container

For example: 

Let's say we have 4 App Gateway -> App GW 1, 2, 3, 4. They all are joining a cluster in domain *.tenant.ztna.safous.com where App GW 1 is the leader. We want to decommission the App GW 1, then rebuild the cluster so that it only has 3 App GWs.

Cluster of App GW 1, 2, 3, 4 in *.tenant.ztna.safous.com
App GW 1 is cluster leader
Decommission App GW 1, rebuild the cluster

Here's what we will do in App GW 2, 3, 4

  • Shutdown the container (requires downtime)
On App GW 2, 3, 4: 

cd /etc/cyolo/config
sudo docker-compose down
  • Rename the raft folder
On App GW 2, 3, 4: 

cd /etc/cyolo
mv raft raft-bak
  • Then, we will bring up the App GW (note: first App GW brought up will become new cluster leader). Let's say we want App GW 3 to become new cluster leader.
First App GW brought up -> App GW 3
On App GW 3:

cd /etc/cyolo/config
sudo docker-compose up -d

Then App GW 2:

cd /etc/cyolo/config
sudo docker-compose up -d

Then App GW 4:

cd /etc/cyolo/config
sudo docker-compose up -d
  • Check the logs to verify that new cluster is built, App GW 3 becomes new cluster leader, and App GW 2 & 4 join the new cluster. 
On App GW 2, 3, 4: 

docker logs -f config_idac_1

You should see logs like this is App GW 3

2023/11/18 15:28:01 I [ping-cluster] did not detect a cluster ... continuing (reason: failed to call [tcp.ztna.safous.com:443]: EOF)
2023/11/18 15:28:01 I [raft-main] bootstrapping new cluster
2023/11/18 15:28:01 I [raft-app] bootstrapping new cluster
2023/11/18 15:28:01 I [raft-logs] bootstrapping new cluster

2023/11/18 15:30:03 I [raft-app] trying to join a cluster
2023/11/18 15:30:03 I [raft-main] trying to join a cluster
2023/11/18 15:30:03 I [raft-logs] trying to join a cluster
2023/11/18 15:30:03 I [raft-app] joining cluster at index 26
2023/11/18 15:30:03 I [raft-logs] joining cluster at index 10
2023/11/18 15:30:03 I [raft-main] joining cluster at index 71
2023/11/18 15:30:04 I [raft-main] new leader: <idac id of App GW 3>
2023/11/18 15:30:04 I [raft-logs] new leader: <idac id of App GW 3>
2023/11/18 15:30:04 I [raft-app] new leader: <idac id of App GW 3>

2023/11/18 15:30:03 I [raft-app] node <idac id of App GW 2/4> at raft-app-<idac id of App GW 2/4>.<tenant>.ztna.safous.com:443 successfully joined at index=26
2023/11/18 15:30:03 I [raft-logs] node <idac id of App GW 2/4> at raft-logs-<idac id of App GW 2/4>.<tenant>.ztna.safous.com:443 successfully joined at index=10
2023/11/18 15:30:03 I [raft-main] node <idac id of App GW 2/4> at raft-main-<idac id of App GW 2/4>.<tenant>.ztna.safous.com:443 successfully joined at index=71

And logs like this in App GW 2 & 4

2023/11/18 15:30:03 I [raft-app] trying to join a cluster
2023/11/18 15:30:03 I [raft-main] trying to join a cluster
2023/11/18 15:30:03 I [raft-logs] trying to join a cluster
2023/11/18 15:30:03 I [raft-app] joining cluster at index 26
2023/11/18 15:30:03 I [raft-logs] joining cluster at index 10
2023/11/18 15:30:03 I [raft-main] joining cluster at index 71
2023/11/18 15:30:04 I [raft-main] new leader: <idac id of App GW 3>
2023/11/18 15:30:04 I [raft-logs] new leader: <idac id of App GW 3>
2023/11/18 15:30:04 I [raft-app] new leader: <idac id of App GW 3>
  • And that's it, now you have new cluster that consists of 3 App GW, with App GW 3 as the leader. 
  • Check Admin Portal (Account, Applications, Policies, Configurations) and User Portal to see that everything is normal.