Difference between revisions of "Our Cardano Sentinel"

From Organic Design wiki
(Restarting)
m (Node restarting)
Line 41: Line 41:
 
If no snapshot is found from the previous day, the storage directory is removed completely so that it will be rebuild automatically when ''Jormungandr'' starts.
 
If no snapshot is found from the previous day, the storage directory is removed completely so that it will be rebuild automatically when ''Jormungandr'' starts.
  
Regular restarts also occur when the node has been running for longer than '''maxUptime''' seconds, but only if the system is at least 500 slots into the epoch, and at least 1000 slots away from the next leader slot to ensure that a lot restart doesn't cause a block to be missed. This regular restart uses the '''--quick''' option so that the trusted peers are removed to avoid the slow bootstrap process - this is ok since the node would have been in sync and connected to other peers which will reconnect with it again after restart.
+
It has been found that the CPU usage can slowly increase over many hours, and also even if this doesn't happen, people have found that nodes can begin to suffer from the "Eek" problem after they're been up for a very long time. For these reasons, regular restarts are initiated when the node has been running for longer than '''maxUptime''' seconds, but only if the system is at least 500 slots into the epoch, and at least 1000 slots away from the next leader slot to ensure that a lot restart doesn't cause a block to be missed. This regular restart uses the '''--quick''' option so that the trusted peers are removed to avoid the slow bootstrap process - this is ok since the node would have been in sync and connected to other peers which will reconnect with it again after restart.
  
 
== End of epoch report ==
 
== End of epoch report ==

Revision as of 10:47, 21 January 2020

This may be just a problem on the testnet, but nodes seem to get stuck a lot and require a restart. This used to happen a lot with Masternodes as well, and the general solution was to run a "sentinel" script that checks on the node and perform the necessary processes when something's not right. In the case of Jormungandr a simple restart of the node seems to resolve the issue, so I made this sentinel.pl script that is to be run from the crontab every minute. It checks if no new block hash has been created for more than a certain time, and if not it restarts the node. If it's still stuck on the same block even after restarting, then the node has probably gotten itself onto a fork, so the sentinel backs up the chain data and logs and restarts the node from a clean slate.

Dependencies

The script must be stored in the same directory as the jormungandr configuration and there must also be a start.sh script that is able to start jormungandr in the background. The start should in turn call the cardano-update-conf.pl script which updates jormungandr configuration to the currently reachable peers (as shown on adapools.org/peers). This start script should also allow passing of the --quick option to the update script which removes all peers from the configuration to allow for quick restarts when necessary.

Configuration

The configuration for the sentinel is a JSON file called sentinel.conf which contains the following parameters:

Name Meaning
period The number of seconds between each call of the script.
timeout The number of seconds of being on the same block height after which we should consider the node stuck.
maxUptime The maximum number of seconds of uptime before a quick restart is enforced.
poolTool The pooltool.io user name to use for publishing out block height and receiving the current known maximum height.
portion Our portion of the nodes stake (used in the end of epoch report).
email The email address to send the end of epoch report to.

The sentinel.log file

The script writes the situation to a file in the same directory called sentinel.log which will look like the following snippet. The first part in square brackets is the Unix timestamp of the entry followed by a duration after the slash. Then the epoch, slot, block height and hash. And finally information on the tax received, rewards sent, amount staked with the pool and the pending leaders at that point in time.

[1579599911/015] Epoch:38 Slot:26145 Block:117783-0 Hash:6b2...7cd Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders:38.26701 38.38287 38.39941
[1579599922/011] Epoch:38 Slot:26151 Block:117784-- Hash:8a4...b25 Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders:38.26701 38.38287 38.39941
[1579599943/000] Doing regular shutdown...
[1579599961/080] Bootstrapping
[1579600046/124] Epoch:38 Slot:26162 Block:117786-8 Hash:fd9...379 Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders:38.26701 38.38287 38.39941
[1579600151/105] Epoch:38 Slot:26265 Block:117795-0 Hash:3ab...a88 Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders:38.26701 38.38287 38.39941

Note1: The sentinel misses blocks that are generated quicker than the polling period, so the log shouldn't be used as a definitive chain report, for this application it's only the slow updates that we're concerned about, so missing blocks are not a problem.

Node restarting

The sentinel restarts the node if it is stuck on the same block for more that timeout seconds. If it finds that it's still stuck on the same height even after restarting, it restarts again using a snapshot of the storage from the day before if one exists (the current one is backed up first), it will look for a file with a name of the format snapshots/storage-backup-YYYY-MM-DD which can be created with the following script called daily on crontab:

cp -pR /root/cardano/storage /root/cardano/snapshots/storage-backup-`date +%Y-%m-%d`

If no snapshot is found from the previous day, the storage directory is removed completely so that it will be rebuild automatically when Jormungandr starts.

It has been found that the CPU usage can slowly increase over many hours, and also even if this doesn't happen, people have found that nodes can begin to suffer from the "Eek" problem after they're been up for a very long time. For these reasons, regular restarts are initiated when the node has been running for longer than maxUptime seconds, but only if the system is at least 500 slots into the epoch, and at least 1000 slots away from the next leader slot to ensure that a lot restart doesn't cause a block to be missed. This regular restart uses the --quick option so that the trusted peers are removed to avoid the slow bootstrap process - this is ok since the node would have been in sync and connected to other peers which will reconnect with it again after restart.

End of epoch report

The message that is emailed at the end of each epoch reports on the number of blocks that were produced by your node during the epoch and how many should have been produced. See the troubleshooting section below for an explanation as to why the expected number of blocks may not be produced. The email looks like the following example. I've highlighted some parts to show the information that the sentinel extracts from the registration repo.

Pudim a gatinha com fome has produced 6 blocks during epoch 21!
Epoch 21 has finished!

Pudim a gatinha com fome (PUDIM) produced 6 out of the 7 blocks scheduled for the epoch.

All your stake are belong with PUDIM!
https://pudim.od.gy

Note1: You may want to install minimal mail sending capability on the server by using an external SMTP server.

See also