Our Cardano Sentinel

From Organic Design wiki
Revision as of 22:04, 21 January 2020 by Nad (talk | contribs) (Node restarting)

This may be just a problem on the testnet, but nodes seem to get stuck a lot and require a restart. This used to happen a lot with Masternodes as well, and the general solution was to run a "sentinel" script that checks on the node and perform the necessary processes when something's not right. In the case of Jormungandr a simple restart of the node seems to resolve the issue, so I made this sentinel.pl script that is to be run from the crontab every minute. It checks if no new block hash has been created for more than a certain time, and if not it restarts the node. If it's still stuck on the same block even after restarting, then the node has probably gotten itself onto a fork, so the sentinel backs up the chain data and logs and restarts the node from a clean slate.

Dependencies

The script must be stored in the same directory as the jormungandr configuration and there must also be a start.sh script that is able to start jormungandr in the background. The start should in turn call the cardano-update-conf.pl script which updates jormungandr configuration to the currently reachable peers (as shown on adapools.org/peers). This start script should also allow passing of the --quick option to the update script which removes all peers from the configuration to allow for quick restarts when necessary.

Configuration

The configuration for the sentinel is a JSON file called sentinel.conf which contains the following parameters:

Name Meaning Default
period The number of seconds between each call of the script. 5
timeout The number of seconds of being on the same block height after which we should consider the node stuck. 200
maxUptime The maximum number of seconds of uptime before a quick restart is enforced. 86400
poolTool The pooltool.io user name to use for publishing out block height and receiving the current known maximum height. -
portion Our portion of the nodes stake (used in the end of epoch report). 0
email The email address to send the end of epoch report to. -
logFile The name of the sentinel log file. sentinel.log
dataDir The location of the Jormungandr data. storage
snapshots The location and prefix of the data snapshots. snapshots/storage-backup-

The sentinel.log file

The script writes the situation to a file in the same directory called sentinel.log which will look like the following snippet. The first part in square brackets is the Unix timestamp of the entry followed by a duration after the slash. Then the epoch, slot, block height and hash. And finally information on the tax received, rewards sent, amount staked with the pool and the pending leaders at that point in time.

[1579599911/015] Epoch:38 Slot:26145 Block:117783-0 Hash:6b2...7cd Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders:38.26701 38.38287 38.39941
[1579599922/011] Epoch:38 Slot:26151 Block:117784-- Hash:8a4...b25 Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders:38.26701 38.38287 38.39941
[1579599943/000] Doing regular shutdown...
[1579599961/080] Bootstrapping
[1579600046/124] Epoch:38 Slot:26162 Block:117786-8 Hash:fd9...379 Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders:38.26701 38.38287 38.39941
[1579600151/105] Epoch:38 Slot:26265 Block:117795-0 Hash:3ab...a88 Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders:38.26701 38.38287 38.39941

Note1: The sentinel misses blocks that are generated quicker than the polling period, so the log shouldn't be used as a definitive chain report, for this application it's only the slow updates that we're concerned about, so missing blocks are not a problem.

Node restarting

The sentinel restarts the node if it is stuck on the same block for more that timeout seconds. If it finds that it's still stuck on the same height even after restarting, it restarts again using a snapshot of the storage from the day before if one exists (the current one is backed up first), it will look for a file with a name of the format snapshots/storage-backup-YYYY-MM-DD which can be created with the following script called daily on crontab:

cp -pR /root/cardano/storage /root/cardano/snapshots/storage-backup-`date +%Y-%m-%d`

If no snapshot is found from the previous day, the storage directory is removed completely so that it will be rebuild automatically when Jormungandr starts.

It has been found that the CPU usage can slowly increase over many hours, and also even if this doesn't happen, people have found that nodes can begin to suffer from the "Eek" problem after they're been up for a very long time. For these reasons, regular restarts are initiated when the node has been running for longer than maxUptime seconds, but only if the system is not very close to the epoch start or end, and at least 1000 slots away from the next leader slot to ensure that a lot restart doesn't cause a block to be missed, and 100 away from the last leader slot, to ensure no broardcast issues. This regular restart uses the --quick option so that the trusted peers are removed to avoid the slow bootstrap process - this is ok since the node would have been in sync and connected to other peers which will reconnect with it again after restart.

CPU usage graph showing the effect of regular restarts

End of epoch report

The message that is emailed at the end of each epoch reports on the number of blocks that were produced by your node during the epoch and how many should have been produced. See the troubleshooting section below for an explanation as to why the expected number of blocks may not be produced. The email looks like the following example. I've highlighted some parts to show the information that the sentinel extracts from the registration repo.

Pudim a gatinha com fome has produced 6 blocks during epoch 21!
Epoch 21 has finished!

Pudim a gatinha com fome (PUDIM) produced 6 out of the 7 blocks scheduled for the epoch.

All your stake are belong with PUDIM!
https://pudim.od.gy

Note1: You may want to install minimal mail sending capability on the server by using an external SMTP server.

See also