Our Cardano Sentinel

From Organic Design wiki
Revision as of 16:18, 20 January 2020 by Nad (talk | contribs) (Configuration)

This may be just a problem on the testnet, but nodes seem to get stuck a lot and require a restart. This used to happen a lot with Masternodes as well, and the general solution was to run a "sentinel" script that checks on the node and perform the necessary processes when something's not right. In the case of Jormungandr a simple restart of the node seems to resolve the issue, so I made this sentinel.pl script that is to be run from the crontab every minute. It checks if no new block hash has been created for more than a certain time, and if not it restarts the node. If it's still stuck on the same block even after restarting, then the node has probably gotten itself onto a fork, so the sentinel backs up the chain data and logs and restarts the node from a clean slate.

Dependencies

The script must be stored in the same directory as the jormungandr configuration and there must also be a start.sh script that is able to start jormungandr in the background. The start should in turn call the cardano-update-conf.pl script which updates jormungandr configuration to the currently reachable peers (as shown on adapools.org/peers). This start script should also allow passing of the --quick option to the update script which removes all peers from the configuration to allow for quick restarts when necessary.

Configuration

The configuration for the sentinel is a JSON file called sentinel.conf which contains the following parameters:

Name Meaning
period The number of seconds between each call of the script.
timeout The number of seconds of being on the same block height after which we should consider the node stuck.
poolTool The pooltool.io user name to use for publishing out block height and receiving the current known maximum height.
portion Our portion of the nodes stake (used in the end of epoch report).
email The email address to send the end of epoch report to.

The sentinel.log file

The script writes the situation to a file in the same directory called sentinel.log which will look like the following snippet. The first part in square brackets is the Unix timestamp of the entry followed by a duration after the slash.

[1577145836/040] Epoch:10 Slot:8709 Block:34597 Hash:7c1db1bbb77d88636349371d5e7ef60e3cb8a2257e2d9aa8d109eecce3236689
[1577145891/055] Epoch:10 Slot:8735 Block:34598 Hash:fb84bb2a0e48233837d1d5773219b3be9661f333f3aaea41867372f0ed224197
[1577145916/025] Epoch:10 Slot:8746 Block:34599 Hash:c3da1a7f72815280f70e1956b33fda993ad788317515dbd2a5c74e96293b5c2b
[1577146041/125] Stuck on 10.8746, restarting node...
[1577146046/000] Status unknown, check if the node is running!
[1577146051/030] Bootstrapping
[1577146086/000] Epoch:10 Slot:8821 Block:34600 Hash:c93189408bcf7e9fe954e72502f720b661002fb365c3283b16b0a8260bd2cf4f
[1577146126/040] Epoch:10 Slot:8854 Block:34601 Hash:c3af8796c932f635067757b4cbaa1c58c6e7623d1660902172f0cf59b439d12e
[1577146136/010] Epoch:10 Slot:8858 Block:34602 Hash:b21af406419002e5b57b7e3bee7dec5378d10f76670d7780a642fbe1d2e60082
[1577146167/031] Epoch:10 Slot:8872 Block:34603 Hash:573c2b9142eff7435d2aa1cd657b808546c592c62e3710d0143670032ad0fecc

Note1: The sentinel misses blocks that are generated quicker than the polling period, so the log shouldn't be used as a definitive chain report, for this application it's only the slow updates that we're concerned about, so missing blocks are not a problem.

Note2: The sentinel appends scheduled leader slots to the first block entry in a new epoch.

End of epoch report

The message that is emailed at the end of each epoch reports on the number of blocks that were produced by your node during the epoch and how many should have been produced. See the troubleshooting section below for an explanation as to why the expected number of blocks may not be produced. The email looks like the following example. I've highlighted some parts to show the information that the sentinel extracts from the registration repo.

Pudim a gatinha com fome has produced 6 blocks during epoch 21!
Epoch 21 has finished!

Pudim a gatinha com fome (PUDIM) produced 6 out of the 7 blocks scheduled for the epoch.

All your stake are belong with PUDIM!
https://pudim.od.gy

Note1: You may want to install minimal mail sending capability on the server by using an external SMTP server.

See also