Our Cardano Sentinel

From Organic Design wiki
Revision as of 11:02, 20 January 2020 by Nad (talk | contribs) (from main set p article)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This may be just a problem on the testnet, but nodes seem to get stuck a lot and require a restart. This used to happen a lot with Masternodes as well, and the general solution was to run a "sentinel" script that checks on the node and perform the necessary processes when something's not right. In the case of Jormungandr a simple restart of the node seems to resolve the issue, so I made this sentinel.pl script that is to be run from the crontab every minute. It checks if no new block hash has been created for more than a certain time, and if not it restarts the node. If it's still stuck on the same block even after restarting, then the node has probably gotten itself onto a fork, so the sentinel backs up the chain data and logs and restarts the node from a clean slate.

Dependencies

Todo...

Configuration

The script takes three parameters, the first is the number of seconds to wait between polling the node (must be a multiple of 60), and the second is the number of seconds of seeing no new block hash after which we should assume that the node is stuck and restart it. The third is your email address so that you can be notified at the end of each epoch about the number of blocks created. The script assumes a local node to be running and expects its yaml config file and a script called start.sh that starts the node in the background with all the necessary parameters to both be in the same directory

The sentinel.log file

The script writes the situation to a file in the same directory called sentinel.log which will look like the following snippet. The first part in square brackets is the Unix timestamp of the entry followed by a duration after the slash.

[1577145836/040] Epoch:10 Slot:8709 Block:34597 Hash:7c1db1bbb77d88636349371d5e7ef60e3cb8a2257e2d9aa8d109eecce3236689
[1577145891/055] Epoch:10 Slot:8735 Block:34598 Hash:fb84bb2a0e48233837d1d5773219b3be9661f333f3aaea41867372f0ed224197
[1577145916/025] Epoch:10 Slot:8746 Block:34599 Hash:c3da1a7f72815280f70e1956b33fda993ad788317515dbd2a5c74e96293b5c2b
[1577146041/125] Stuck on 10.8746, restarting node...
[1577146046/000] Status unknown, check if the node is running!
[1577146051/030] Bootstrapping
[1577146086/000] Epoch:10 Slot:8821 Block:34600 Hash:c93189408bcf7e9fe954e72502f720b661002fb365c3283b16b0a8260bd2cf4f
[1577146126/040] Epoch:10 Slot:8854 Block:34601 Hash:c3af8796c932f635067757b4cbaa1c58c6e7623d1660902172f0cf59b439d12e
[1577146136/010] Epoch:10 Slot:8858 Block:34602 Hash:b21af406419002e5b57b7e3bee7dec5378d10f76670d7780a642fbe1d2e60082
[1577146167/031] Epoch:10 Slot:8872 Block:34603 Hash:573c2b9142eff7435d2aa1cd657b808546c592c62e3710d0143670032ad0fecc

Note1: The sentinel misses blocks that are generated quicker than the polling period, so the log shouldn't be used as a definitive chain report, for this application it's only the slow updates that we're concerned about, so missing blocks are not a problem.

Note2: The sentinel appends scheduled leader slots to the first block entry in a new epoch.

End of epoch report

The message that is emailed at the end of each epoch reports on the number of blocks that were produced by your node during the epoch and how many should have been produced. See the troubleshooting section below for an explanation as to why the expected number of blocks may not be produced. The email looks like the following example. I've highlighted some parts to show the information that the sentinel extracts from the registration repo.

Pudim a gatinha com fome has produced 6 blocks during epoch 21!
Epoch 21 has finished!

Pudim a gatinha com fome (PUDIM) produced 6 out of the 7 blocks scheduled for the epoch.

All your stake are belong with PUDIM!
https://pudim.od.gy

Note1: You may want to install minimal mail sending capability on the server by using an external SMTP server.

See also