Our Cardano Sentinel

From Organic Design
Revision as of 14:36, 4 March 2020 by Nad (talk | contribs) (Configuration: minBefore and minAfter)

This may be just a problem on the testnet, but nodes seem to get stuck a lot and require a restart. This used to happen a lot with Masternodes as well, and the general solution was to run a "sentinel" script that checks on the node and perform the necessary processes when something's not right. In the case of Jormungandr a simple restart of the node seems to resolve the issue, so I made this sentinel.pl script which should be run in the background with nohup ./sentinel.pl &. It checks if no new block hash has been created for more than a certain time, and if not it restarts the node. If it's still stuck on the same block even after restarting, then the node has probably gotten itself onto a fork, so the sentinel backs up the chain data and logs and restarts the node from the day before's backup.

The CPU seems to creep up as well and once it gets too high it starts missing blocks, so a regular restart is also done as long as there's no upcoming leader slot (Michael Fazio has analysed this extensively and submitted this PR which looks like it will resolve it if accepted).

Installation

The script must be stored in the same directory as the jormungandr configuration and there must also be a start.sh script that is able to start jormungandr in the background. The start should in turn call the cardano-update-conf.pl script which updates jormungandr configuration to the currently reachable peers (as shown on adapools.org/peers). This start script should also allow passing of the --quick option to the update script which removes all peers from the configuration to allow for quick restarts when necessary.

The script expects that the output from Jormungandr is being logged to a file which by default is called debug.log. The new block announcements from this log are expected to be directed to another file called blocks.log by default, this can be done with a simple tail -f command. This is done so that the size of the block log can be polled by the sentinel instead of doing huge numbers of calls to the REST interface. This saves on resources and also makes the sentinel much closer to real-time because the block log size can be polled much faster without much of a resource overhead. This polling loop only lasts a maximum of 5 seconds though so that the script can perform as usual to check that Jormungandr is running or stuck etc.

A small script should be made and called regularly on the crontab that ensures the tail command and the sentinel itself are running and restarts them if not, for exapmle:

#!/usr/bin/perl
use File::Basename;
use Cwd qw( realpath );
chdir realpath( dirname(__FILE__) );
exec( 'tail -f debug.log | grep  --line-buffered "applied block to storage" >> blocks.log &' ) unless qx( ps x | grep "[t]ail -f debug.log" );
exec( './sentinel.pl &' ) unless qx( ps x | grep "[s]entinel" );

Configuration

The configuration for the sentinel is a JSON file called sentinel.conf which contains the following parameters:

Name Meaning Default
timeout The number of seconds of being on the same block height after which we should consider the node stuck. 200
maxUptime The maximum number of seconds of uptime before a quick restart is enforced, or 0 for no restarting. 0
minBefore Regular restarts cannot occur less than this amount of slots before a leader or epoch transition. 500
minAfter Regular restarts cannot occur less than this amount of slots after a leader or epoch transition. 50
poolTool The pooltool.io user name to use for publishing out block height and receiving the current known maximum height. -
portion Your portion of the nodes stake (used in the end of epoch report). 0
accountHex The hex value of the rewards address that your portion is sent to -
email The email address to send the end of epoch report to. -
logFile The name of the sentinel log file. sentinel.log
debugLog The name of the Jormungandr log file. debug.log
blockLog The name of the log file of block announcements. blocks.log
dataDir The location of the Jormungandr data. storage
snapshots The location and prefix of the data snapshots. snapshots/storage-backup-
explorer The URL of the preferred block explorer to use for the epoch reports. Shelley Explorer

The accountHex value can be a bit tricky to find. First to find the normal bech32 form you can check in PoolTool by clicking on your pool's ticker and check the list of delegators. You may have to check the balance of a few addresses before you find the one that matches your balance. To check the balance of an address use:

./jcli rest v0 account get addr1...0123 --host "http://127.0.0.1:3100/api"


Once you've found which bech32 address is yours, use the following command to get the ed25519 public key the address:

./jcli address info addr1...0123


Then use the public key in the following command to get the final required hex value:

echo "ed25519_..." | ./jcli key to-bytes

The sentinel.log file

The script writes the situation to a file in the same directory called sentinel.log which will look like the following snippet. The first part in square brackets is the Unix timestamp of the entry followed by a duration after the slash. Then the epoch, slot, block height and hash. And finally information on the tax received, rewards sent, amount staked with the pool and the pending leaders at that point in time.

[1579599911/015] Epoch:38 Slot:26145 Block:117783-0 Hash:6b2...7cd Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders(3):26701 38287 39941
[1579599922/011] Epoch:38 Slot:26151 Block:117784-- Hash:8a4...b25 Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders(3):26701 38287 39941
[1579599943/000] Doing regular shutdown...
[1579599961/080] Bootstrapping
[1579600046/124] Epoch:38 Slot:26162 Block:117786-8 Hash:fd9...379 Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders(3):26701 38287 39941
[1579600151/105] Epoch:38 Slot:26265 Block:117795-0 Hash:3ab...a88 Tax:132051923 Rewards:6470539 Stake:26354993728 Leaders(3):26701 38287 39941

Node restarting

The sentinel restarts the node if it is stuck on the same block for more that timeout seconds. If it finds that it's still stuck on the same height even after restarting, it restarts again using a snapshot of the storage from the day before if one exists (the current one is backed up first), it will look for a file with a name of the format snapshots/storage-backup-YYYY-MM-DD which can be created with the following script called daily on crontab:

cp -pR /root/cardano/storage /root/cardano/snapshots/storage-backup-`date -I`

If no snapshot is found from the previous day, the storage directory is removed completely so that it will be rebuild automatically when Jormungandr starts.

It has been found that the CPU usage can slowly increase over many hours, and also even if this doesn't happen, people have found that nodes can begin to suffer from the "Eek" problem after they've been up for a very long time. For these reasons, regular restarts are initiated when the node has been running for longer than maxUptime seconds, but only if the system is not very close to the epoch start or end, and at least 1000 slots away from the next leader slot to ensure that a restart doesn't cause a block to be missed, and 100 away from the last leader slot, to ensure no block broadcast issues. This regular restart uses the --quick option so that the trusted peers are removed to avoid the slow bootstrap process - this is ok since the node would have been in sync and connected to other peers which will reconnect with it again after restart.

CPU usage graph showing the effect of regular restarts

End of epoch report

The message that is emailed at the end of each epoch reports on the number of blocks that were produced by your node during the epoch and how many should have been produced. See the troubleshooting section of the set up article for an explanation as to why the expected number of blocks may not be produced. The email looks like the following example. I've highlighted some parts to show the information that the sentinel extracts from the registration repo.

Pudim a gatinha com fome has produced 7 blocks during epoch 41!
Epoch 41 has finished!

Pudim a gatinha com fome (PUDIM) produced 7 out of the 9 blocks scheduled for the epoch.

Total reward during the epoch was ₳7,634.14 (from total stake of ₳26,755,734.78)

Our reward was ₳297.56 (from ₳972,329.97 stake and ₳20.13 in fees)
Tax received by the pool was ₳155.80

Total earnings for the epoch is ₳453.36

All your stake are belong with PUDIM!
https://pudim.od.gy

Note1: You may want to install minimal mail sending capability on the server by using an external SMTP server.

Using the node's own explorer interface

The epoch report uses the GraphQL explorer interface to query the number of blocks the node created during the epoch that were accepted by the network. By default it uses the Shelly explorer, but this is often quite unreliable, so it can be better to use the node's own explorer interface.


First, the explorer interface needs to be enabled in your Jormungandr configuration by adding the following and restarting the node:

"explorer" : {
   "enabled" : true
}


Then you need to add the "explorer" parameter to the sentinel.conf file which will be in the following format, but with your own REST port:

"explorer" : "http://localhost:3100/explorer/graphql"

Other monitors

See also