Set up a Cardano ITN staking pool

From Organic Design wiki
Revision as of 19:35, 22 January 2020 by Nad (talk | contribs) (Some blocks are not created at all: leaders log of rejected block)

To run a staking pool you'll need a reasonable server that is on a reliable high-bandwidth connection. First you need to install the node software, then create the cryptographically signed staking pool parameters associated with a funded pledge address, and then finally register your pool so it shows up in the staking wallets. This section is mainly based on the instructions at Stake Pool Operators How-To with a few differences.

Dependencies

It's a good idea to install chrony and add a pool closest to your server. Accurate time means less likelihood of rejected blocks. Use chronyc tracking to check the current status of the time, and chronyc sources to see which actual time servers are being used.

Install and configure Jormungandr

We start by installing the latest release (not a pre release) of Jormungandr from the official repo (it's a good idea to subscribe to the repo's feed so you can know as soon as new stable releases are available). I found installing from source pretty straight forward using their instructions too, the only issue was that I needed to install the pkg-config package with apt in addition to their listed prerequisites, Note that you need to log out and back in again for the Rust paths to take effect, and the final binaries are located in ~/.cargo/bin.

One important difference from their configuration procedure is that we need to use the itn_rewards_v1 configuration rather than the beta configuration. A couple of differences from their procedure too: first I changed the port to 3000 as this seems to be what the vast majority of nodes on the network are running, with 3100 for the internal REST interface. I also changed the logging output to stdout, and had to add a storage location to make the chain data persistent. The first few sections of your config file should look something like this:

{
  "log": [
    {
      "format": "plain",
      "level": "info",
      "output": "stdout"
    }
  ],
  "storage": "./storage/",
  "p2p": {
    "listen_address": "/ip4/0.0.0.0/tcp/3000",
    "public_address": "/ip4/1.2.3.4/tcp/3000",
    "topics_of_interest": {
      "blocks": "high",
      "messages": "high"
    },
    . . .

Another issue is that I was not able to find any genesis hash in the configuration as it says there should be, I had to obtain it myself from the last page of slots for epoch 0 which yields this (later I found this parameter and others here). It's best to put this genesis has into a file called genesis-hash.txt so that it can be referred to easily from other programs when needed.

I created a script called start.sh to run it with the correct parameters in the background and redirected its output to a log file. It also calls another script I made called cardano-update-conf.pl which rebuilds the config file with the current list of good peers available from adapools.org/peers, you'll need to download this into your pool's directory as well if you want to use it.

#!/bin/sh
./cardano-update-conf.pl
nohup ./jormungandr --config itn_rewards_v1-config.yaml --genesis-block-hash `cat genesis-hash.txt` >> debug.log &

If you see no errors in the log and the daemon keeps running, you can check the sync progress by running the node stats command and checking that the lastBlockDate matches the current epoch and slot shown in the Shelley explorer.

./jcli rest v0 node stats get --host "http://127.0.0.1:3100/api"
blockRecvCnt: 41
lastBlockContentSize: 0
lastBlockDate: 7.35623
lastBlockFees: 0
lastBlockHash: "f78c64c030383899ebb1b25dac7ae9d360d222d0b80320323375dc51762651d2"
lastBlockHeight: 26342
lastBlockSum: 0
lastBlockTime: "2019-12-21T15:15:20+00:00"
lastBlockTx: 0
state: "Running"
txRecvCnt: 45
uptime: 886
version: "jormungandr 0.8.3-8f276c0"

To shut the node down gracefully use:

./jcli rest v0 shutdown get --host "http://127.0.0.1:3100/api"

Create and fund a reward address

Now we need to create three files for our reward account, a public/private key-pair and it's corresponding ADA address which I did by following the instructions in how to register your stake pool on the chain.

./jcli key generate --type ed25519 | tee owner.prv | ./jcli key to-public > owner.pub
./jcli address account --testing --prefix addr `cat owner.pub` > owner.addr

You can then send funds (minimum 510 ADA) to the address in the owner.addr file from Daedalus or Yoroi, and then check the balance:

./jcli rest v0 account get `cat owner.addr` -h http://127.0.0.1:3100/api
counter: 0
delegation:
  pools: []
last_rewards:
  epoch: 0
  reward: 0
value: 550000000

Create the stake pool and publish to the blockchain

Finally we need to create the stake pool itself which can be done by calling the handy createStakePool.sh and send-certificate.sh scripts. You only need to run the former script which calls the latter, make sure both are executable first. The script takes four parameters, the listening port, the fixed tax (in lovelace), the percentage as a fraction and the private key of your reward address that you put in the owner.prv file above. For example:

./createStakePool.sh 3100 1000000 5/100 OWNER_PRIV_KEY | tee results.txt

This will create a pool that takes 1 ADA (1M Lovelaces) fixed rate, and 5%. Note that the instructions say you need another tax_limit parameter, but this must have been removed at some point. This script returns two important values that you need to keep, the Pool ID and Pool Owner, but by appending the tee command, all the output is also captured in results.txt. It also creates the important node_secret.yaml file that is used when starting jormungandr from now on. Check the output for errors and successful signing and sending of the new pool registration transaction, you should see something like this in your output:

## 10. Encode and send the transaction
56ded95ea6868470337272ef899264abb5c27dcdd2f9aae839924dca19b5dd3f
 ## 11. Remove the temporary files
 ## Waiting for new block to be created (timeout = 200 blocks = 400s)
New block was created - 8fe7ac108640778ca53ce4d38ed8b7b6092454770d4aaf04a38ec548cc66b330

Note: If anything goes wrong in this process, you're best creating a new pledge address before trying again, because if you end up with more than one pool operating on the same pledge address, only the last one will work.

Backup your pool data

IMPORTANT: As soon as you've created an address and node secret, create a directory for it using the stake id, or its first few characters, as the name and copy all the specific files into it so you have them in case you need them later. For example you need them if you want to retire the pool, or sign any messages as that pool owner, even if it's just a dummy run and you're sure you'll never need to refer to them again, do it anyway! The files are:

  • node_secret.yaml
  • owner.prv
  • owner.pub
  • owner.addr
  • stake_pool.id
  • results.txt

If you ever need to rebuild your pool, for example if you need to move server, then simply put these files into the directory after you've put all the program files and scripts in place and then when you run jormungandr it will start as that pool and retrieving the block chain data.

Start your pool!

Now you're ready to shut your node down and restart it with you secret key parameter to start it as a pool!

./jcli rest v0 shutdown get --host "http://127.0.0.1:3100/api"
nohup ./jormungandr --config itn_rewards_v1-config.yaml --secret node_secret.yaml --genesis-block-hash `cat genesis-hash.txt` >> debug.log &

Note: Remember to add the --secret node_secret.yaml parameter to the command in your start.sh script.

Register your pool in the official registry

To allow people to delegate their stake to your pool in a supporting wallet, you need to add your pool to the public registry. This is done by creating a JSON file containing your pool's details, and a file signing the JSON content with the owner's private key, and committing these files to the registry's Github repo.

The name of the pool is your owner public key from the owner.pub file appended with a .json file extension. The content of the file is as follows. The "owner" field is the same key as used in the filename, and the "pledge_address" field is the owner address from the owner.addr file.

{
  "owner": "OWNER_PUBKEY",
  "name": "Pudim o gatinho com fome",
  "description": "All your stake are belong with PUDIM!",
  "ticker": "PUDIM",
  "homepage": "https://pudim.od.gy",
  "pledge_address": "OWNER_ADDR"
}

Note: This file must be valid JSON (e.g. you must use double quotes) otherwise the pull request will fail.


Then to sign this JSON file with the owner's private key, you use jcli as follows:

./jcli key sign --secret-key owner.prv --output `cat owner.pub`.sig `cat owner.pub`.json


You then need to fork the Cardano foundation's incentivized-testnet-stakepool-registry Github repo, clone your new fork of it, add your two files into the registry directory, add, commit and push them and then create a pull request on the Github site in your forked repo page. Note that it's always best to create a new branch for a pull request, because all commits even after you've made the request are automatically included in the pull request at the upstream repo. Note that the following example is assuming that you've cloned the repo in your pool directory, if you haven't adjust the path to your keys as necessary.

git clone git clone git@github.com:YOUR_GITHUB_USERNAME/incentivized-testnet-stakepool-registry.git
git checkout -b PUDIM
cd incentivized-testnet-stakepool-registry/registry
cp ../../OWNER_PUBKEY.* ./
git add *
git commit -m "PUDIM"
git push --set-upstream origin PUDIM

The pull request will verify your that your JSON is valid and your signature verifies, and if so the team should approve it for inclusion in the registry shortly after, and your stake pool will be listed in the delegation interface of Daedalus!

The image below shows two pull requests to the official registry repo, our PUDIM registration has passed, meaning that the JSON syntax is all correct and the signature has been successfully verified, but the MORON pool registration has not been successful. Even after a successful pull request, manual validation is required by the Cardano team, ours was accepted the next day.

Pudim-pool.jpg
Pudim-pool-accepted.jpg

Note: If you want to remove your pool from the register or change it's details, see these details which involve creating two pull requests, one for a signed "voiding" of the old metadata file, and another to add the new metadata and signature files.

Register with PoolTool.io

As a pool operator, it's a good idea to sign up with pooltool.io and claim your pool on the site, i.e. find your pool in the list after you have an account and claim it to associate it with yourself as the owner. This is a great site for seeing clear statistics about your pool in real-time, and how it compared to other pools.

Users who are signed up and run a pool can send their current pool block height to the site, which allows the site to know the current maximum height of the network. This allows the PoolTool site to show you clearly if you node is up to date. Using the information, the site can also display information about the approximate percentage of nodes that are synchronised.

Pooltool-network-info.jpg


Also you can supply your PoolTool user ID as the forth parameter to the sentinel script, and it will take care of sharing your node's block height with the PoolTool site so you can see easily if it's up to date on the PoolTool site, as shown below in the green bubble to the right. If it's green it means the node is less than ten blocks behind the maximum which is considered as synchronised, if a node gets ten or more blocks behind the bubble becomes orange and then red.

Pudim-on-pooltool.jpg

By providing your block height to PoolTool, they will return the current maximum height across all shared heights they've received, which allows your sentinel to show how far behind your node is. This is shown as a negative integer appended to the block number in the log as shown in the example below, most of the blocks should be appended with "-0". Sometimes you'll see block heights appended with "--" which means the request to PoolTool failed for example due to taking longer than the 1s timeout limit imposed by the sentinel script.

[1577918246/109] Epoch:19 Slot:6085 Block:62588-2 Hash:7f0ed4a88a80104aea8e9162fe618b6f8d3d480773dd94eb3b66729c9bdd4c7b

Note: If you are behind a block or two regularly then you may find that you have cpu overloading issues, check your cpu usage and if a single CPU is peaking often at 100% you'll need to reduce your max_connections setting or change your hardware, because many of your blocks will be produced too late to be accepted by the network in that state.

Troubleshooting & questions

Maintaining a running pool

Pools currently require a lot of babysitting, this should be something that will improve a lot by the time we get to the mainnet, but currently the node will require a lot of attention. Most operators have their own custom scripts to take as much of the headache out of it as possible. We have a script called the Sentinel to take care of things for us. This will do things like check if the node is stuck and restart it if so, if it can't get past the bootstrap period after a few goes it will restart it again using the previous day's storage directory, sends a report at the end of each epoch and more.

Bootstrapping

The bootstrap phase is the main hold up (unless you also need to download the genesis block and start from scratch). It's to restart if the bootstrapping phase doesn't complete in say ten minutes. If your node was pretty close to the tip when it was stopped, and it's only been stopped for a minute or so, then you can actually start it with no entries in the trusted peers list and you'll begin getting connections without the need for going through the bootstrap phase at all.

Some blocks are not created at all

Sometimes you see your slot come and go and your pool didn't even attempt to create a block at all, and nobody else did either! In this case it's likely because your server's system time is off, so check your chrony configuration. If this is the case you'll see the following errors in your log at the times you should have been creating a block:

Eek... Too late, we missed an event schedule, system time might be off?

This will be associated with a rejected block that looks like this in the leaders log:

scheduled_at_date: "39.41535"
  scheduled_at_time: "2020-01-22T18:18:07+00:00"
  status:
    Rejected:
      reason: Missed the deadline to compute the schedule

Some blocks are rejected by the network shortly after creation

Sometimes pools create blocks that exist for a short time and then disappear. I had this happen with 20.11202, 20.18839, 20.21269. These showed up and my monitor picked them up and emailed the block creation event (except for the second one which must have been too short lived - but I see it in the debug log). Other people have had this happen too and have raised issues #1427, #1469 and #1472 about it. The consensus seems to be that it's due to the node not being perfectly in sync the time it creates the block which means that the block gets created with an older parent then it should and others on the longer chain replace it. Issue #1446 recommends a flag be added in the leader logs output to show blocks that didn't make it onto the main chain. These forks are worse in times of network instability and even with a fully synched node cam happen a lot - in my case around 13 out of 17 blocks were rejected during epoch 26 (although to be fair only around half of those rejections were due to forks, the others were due to the node bootstrapping after being stuck). Apparently installing chrony to have more accurate time can help, but I have personally not noticed any improvement. I raised #1532 as this seems to be more than just normal protocol behaviour, #1503 talks about how many nodes are deliberately creating multiple adversarial forks.

Maxed-out CPU can also lead to block rejection

If the CPU is maxed-out then that will also cause blocks to be produced too slowly. Reducing the max_connections setting reduces CPU usage, but there also seems to be issues with some choices of VPS that cause high CPU usage with Jormungandr (especially since the 0.8.6 release), most likely differences in the IO backend are responsible for this. Changing from a Linode to a Digital Ocean "Droplet" made a huge difference for us.

Jormungandr on Linode.jpg
Jormungandr on Digital Ocean.jpg

On the left is the CPU usage of the Linode going almost continuously running at 100% even with the max_connections setting at the extremely conservative default value of 256. The sharp drops you see are when the node gets stuck and is restarted by the sentinel which happens very frequently. The Linode costs $20/mo, has 4GB of RAM and two Xeon E5-2697v4 CPUs running at 2.3GHz with a cache size of 16MB.

On the right is the CPU graph for the new Droplet, also $20/mo with 4GB of RAM and two CPUs at the same speed of 2.3GHz, but slightly more powerful Xeon Gold 6140 CPUs with 24MB of cache. As you can see the difference is startling and cannot be attributed only to the slightly better CPU. After 12 hours of operation it settles down to around 20% average CPU usage even with double the max_connections setting compared to the Linode. The glitches you can see are from restarts when we changed the max_connections setting from 256 to 1024 which resulted in some peaks of around 90% (too short to appear on the graph), so we then settled on 512.

Issue#1599 was recently raised about the problem of the CPU usage creeping up for many people causing the nodes to get out of sync.

Do you miss out on the leader elections if your node is offline at the start of the epoch?

No. The elections are a deterministic pseudo-random process which doesn't require communication between nodes to organise, so it's possible for a node to be disconnected at the start of the epoch and still know its leader schedule when it comes back online after the epoch as started. Note that although the leader schedules (including who would lead a slot if the primary choice was a no show, or the next choice was a no show as well etc) are deterministic, the process is based on a seed which is derived from the hashes of the blocks of the previous epoch, and is therefore impossible to know before that block is complete.

As evidence of the fact that a node doesn't need to be present at the start of an epoch in order to participate in the block creation within it, you can see below a terrible start to epoch 27 by my node where it couldn't get out of the bootstrapping phase for over an hour during which time the transition from epoch 26 to 27 took place. But yet the leader schedule is still populated, and blocks in that epoch including the first one at 27.4386 were successfully created.

[1578596306/180] Epoch:26 Slot:42744 Block:84910-12 Hash:14c232d74bbfcb4b86221bde4ebf02cad5ad78b039460030019578fd873d16e3 Tax:98595022 Stake:27038332991274
[1578596711/405] Stuck on 26.42744, restarting node...
[1578596716/000] Node is not running, starting now...
[1578596723/403] Bootstrapping
             . . .
[1578602533/353] Bootstrapping
[1578602891/000] Epoch:27 Slot:2663 Block:85109-6 Hash:1cfeadb6c4ba0afcbd04173fb35862c53a2f1ddfa0b8846ddd2d7d2291274160 Tax:118954432 Stake:27046455059377
[1578602941/050] Epoch:27 Slot:2807 Block:85115-0 Hash:d747df3c0becbd8efb023416d723d6190d7d45ae295bdbc6ce0fa8bd94ad34f1 Tax:118954432 Stake:27046455059377
[1578603001/060] Epoch:27 Slot:2865 Block:85116-0 Hash:7fbb072402c7dbe126ed53987fc78b2657cbb4568be6f9ce2ffc4fb143159d1a Tax:118954432 Stake:27046455059377
./jcli rest v0 leaders logs get --host "http://127.0.0.1:3100/api" | grep date
  scheduled_at_date: "27.4386"
  scheduled_at_date: "27.8080"
  scheduled_at_date: "27.34182"
  scheduled_at_date: "27.42107"
  scheduled_at_date: "27.35573"
  scheduled_at_date: "27.15571"
  scheduled_at_date: "27.18500"
  scheduled_at_date: "27.19565"
  scheduled_at_date: "27.33988"
  scheduled_at_date: "27.29712"

Manual pages

See also