Difference between revisions of "Cardano ITN staking pool FAQ"
m (→See also) |
m (→My node created a block, why's it not in the chain?) |
||
Line 23: | Line 23: | ||
PoolTool has a great new feature showing the winners and losers of all the recent "competitive forks" at [https://pooltool.io/competitive pooltool.io/competitive]. Here's an example of our node [https://pudim.od.gy PUDIM] winning a "height battle" :-) | PoolTool has a great new feature showing the winners and losers of all the recent "competitive forks" at [https://pooltool.io/competitive pooltool.io/competitive]. Here's an example of our node [https://pudim.od.gy PUDIM] winning a "height battle" :-) | ||
− | [[File:PUDIM winner.png]] | + | [[File:PUDIM winner.png|400px]] |
'''Adverserial forks:''' There is also a problem in the network where some nodes create many forks on purpose so that other nodes will more often get themselves onto the wrong forks. These are called ''adverserial forks'', and give the creator an advantage. You can see a list of adverserial forks on the [https://pooltool.io/health pooltool.io/health] page, a few pools such as the '''LION''' pools are renowned for creating huge numbers of adverserial forks and should be shamed and boycotted for this antisocial behaviour. Most of the other pools creating adverserial forks are mistakes by operators who run clusters and mess up with their leader selection process. | '''Adverserial forks:''' There is also a problem in the network where some nodes create many forks on purpose so that other nodes will more often get themselves onto the wrong forks. These are called ''adverserial forks'', and give the creator an advantage. You can see a list of adverserial forks on the [https://pooltool.io/health pooltool.io/health] page, a few pools such as the '''LION''' pools are renowned for creating huge numbers of adverserial forks and should be shamed and boycotted for this antisocial behaviour. Most of the other pools creating adverserial forks are mistakes by operators who run clusters and mess up with their leader selection process. |
Revision as of 14:26, 6 February 2020
Contents
- 1 What's the best practice for maintaining a running pool?
- 2 Is there a way to improve bootstrapping?
- 3 My node created a block, why's it not in the chain?
- 4 Why doesn't my node create some of it's blocks?
- 5 How many peer connections are actually established?
- 6 Do you miss out on the leader elections if your node is offline at the start of the epoch?
- 7 Are all storage directories the same?
- 8 See also
What's the best practice for maintaining a running pool?
Pools currently require a lot of babysitting, this should be something that will improve a lot by the time we get to the mainnet, but currently the node will require a lot of attention. Most operators have their own custom scripts to take as much of the headache out of it as possible. We have a script called the sentinel.pl to take care of things for us. This will do things like check if the node is stuck and restart it if so, if it can't get past the bootstrap period after a few goes it will restart it again using the previous day's storage directory, sends a report at the end of each epoch and more.
Is there a way to improve bootstrapping?
The bootstrap phase is the main hold up (unless you also need to download the genesis block and start from scratch). It's to restart if the bootstrapping phase doesn't complete in say ten minutes.
If your node was pretty close to the tip when it was stopped, and it's only been stopped for a minute or so, then you can actually start it with no entries in the trusted peers list and you'll begin getting connections without the need for going through the bootstrap phase at all. I have a --quick option in my cardano-update-conf.pl script that removes all trusted peers for this purpose.
Make sure your ulimit is 4096 or higher:
ulimit -n 4096
You can try forcing bootstrapping to jump to the next peer with:
sudo netstat -alnp|grep jormungandr
sudo ss -K dst Peer_IP
My node created a block, why's it not in the chain?
Competitive forks: Sometimes pools create blocks that exist for a short time and then disappear. This is a natural aspect of the network because slots often have more than one leader which results a forking situation called a "slot battle". Competitive forks can also result from different nodes assigned to very close slots both creating their block from the same parent block which is called a "height battle". Being better connected, producing blocks quickly and having accurate system time all help your chances in competitive forks. Setting up chrony properly so you have very accurate time will help your chances a lot, and having a higher max_connections setting can help too - see our sample config for where to add this setting.
PoolTool has a great new feature showing the winners and losers of all the recent "competitive forks" at pooltool.io/competitive. Here's an example of our node PUDIM winning a "height battle" :-)
Adverserial forks: There is also a problem in the network where some nodes create many forks on purpose so that other nodes will more often get themselves onto the wrong forks. These are called adverserial forks, and give the creator an advantage. You can see a list of adverserial forks on the pooltool.io/health page, a few pools such as the LION pools are renowned for creating huge numbers of adverserial forks and should be shamed and boycotted for this antisocial behaviour. Most of the other pools creating adverserial forks are mistakes by operators who run clusters and mess up with their leader selection process.
Low CPU resource: If the CPU is maxed-out then that will also cause blocks to be produced too slowly. Reducing the max_connections setting reduces CPU usage, but there also seems to be issues with some choices of VPS that cause high CPU usage with Jormungandr (especially since the 0.8.6 release), most likely differences in the IO backend are responsible for this. Changing from a Linode to a Digital Ocean "Droplet" made a huge difference for us.
On the left is the CPU usage of the Linode going almost continuously running at 100% even with the max_connections setting at the extremely conservative default value of 256. The sharp drops you see are when the node gets stuck and is restarted by the sentinel which happens very frequently. The Linode costs $20/mo, has 4GB of RAM and two Xeon E5-2697v4 CPUs running at 2.3GHz with a cache size of 16MB.
On the right is the CPU graph for the new Droplet, also $20/mo with 4GB of RAM and two CPUs at the same speed of 2.3GHz, but slightly more powerful Xeon Gold 6140 CPUs with 24MB of cache. As you can see the difference is startling and cannot be attributed only to the slightly better CPU. After 12 hours of operation it settles down to around 20% average CPU usage even with double the max_connections setting compared to the Linode. The glitches you can see are from restarts when we changed the max_connections setting from 256 to 1024 which resulted in some peaks of around 90% (too short to appear on the graph), so we then settled on 512.
Issue#1599 was recently raised about the problem of the CPU usage creeping up for many people causing the nodes to get out of sync.
Why doesn't my node create some of it's blocks?
Sometimes you see your slot come and go and your pool didn't even attempt to create a block at all, and nobody else did either! In this case it's likely because your server's system time is off, so check your chrony configuration. If this is the case you'll see the following errors in your log at the times you should have been creating a block:
Eek... Too late, we missed an event schedule, system time might be off?
This will be associated with a rejected block that looks like this in the leaders log:
scheduled_at_date: "39.41535"
scheduled_at_time: "2020-01-22T18:18:07+00:00"
status:
Rejected:
reason: Missed the deadline to compute the schedule
How many peer connections are actually established?
Setting max_connections is one thing, but to know how many are actually established is another.
netstat -an | grep ESTABLISHED | wc -l
./jcli rest v0 network stats get -h http://localhost:3100/api | grep addr | wc -l
netstat -tn | tail -n +3 | awk "{ print \$6 }" | sort | uniq -c | sort -n
Do you miss out on the leader elections if your node is offline at the start of the epoch?
No. The elections are a deterministic pseudo-random process which doesn't require communication between nodes to organise, so it's possible for a node to be disconnected at the start of the epoch and still know its leader schedule when it comes back online after the epoch as started. Note that although the leader schedules (including who would lead a slot if the primary choice was a no show, or the next choice was a no show as well etc) are deterministic, the process is based on a seed which is derived from the hashes of the blocks of the previous epoch, and is therefore impossible to know before that block is complete.
As evidence of the fact that a node doesn't need to be present at the start of an epoch in order to participate in the block creation within it, you can see below a terrible start to epoch 27 by my node where it couldn't get out of the bootstrapping phase for over an hour during which time the transition from epoch 26 to 27 took place. But yet the leader schedule is still populated, and blocks in that epoch including the first one at 27.4386 were successfully created.
[1578596306/180] Epoch:26 Slot:42744 Block:84910-12 Hash:14c232d74bbfcb4b86221bde4ebf02cad5ad78b039460030019578fd873d16e3 Tax:98595022 Stake:27038332991274
[1578596711/405] Stuck on 26.42744, restarting node...
[1578596716/000] Node is not running, starting now...
[1578596723/403] Bootstrapping
. . .
[1578602533/353] Bootstrapping
[1578602891/000] Epoch:27 Slot:2663 Block:85109-6 Hash:1cfeadb6c4ba0afcbd04173fb35862c53a2f1ddfa0b8846ddd2d7d2291274160 Tax:118954432 Stake:27046455059377
[1578602941/050] Epoch:27 Slot:2807 Block:85115-0 Hash:d747df3c0becbd8efb023416d723d6190d7d45ae295bdbc6ce0fa8bd94ad34f1 Tax:118954432 Stake:27046455059377
[1578603001/060] Epoch:27 Slot:2865 Block:85116-0 Hash:7fbb072402c7dbe126ed53987fc78b2657cbb4568be6f9ce2ffc4fb143159d1a Tax:118954432 Stake:27046455059377
./jcli rest v0 leaders logs get --host "http://127.0.0.1:3100/api" | grep date
scheduled_at_date: "27.4386"
scheduled_at_date: "27.8080"
scheduled_at_date: "27.34182"
scheduled_at_date: "27.42107"
scheduled_at_date: "27.35573"
scheduled_at_date: "27.15571"
scheduled_at_date: "27.18500"
scheduled_at_date: "27.19565"
scheduled_at_date: "27.33988"
scheduled_at_date: "27.29712"
Are all storage directories the same?
All the storage directories are compatible, so if your node is stuck, can't bootstrap, or can't retrieve the genesis block, you can use someone else's storage directory, or even use your own one from your local Daedalus node.
But even though they're all compatible, they're actually also unique, because the storage contains all the blocks that the node ever received or created, even those that were removed from the chain after being re-organised due to a fork.
Here's some evidence to show that it works like this: someone in the Telegram channel asked if anyone else could see the block he created, I checked the hash which started with 30c519 and my pool node recognised it, yet it didn't exist in some other nodes including my own local Daedalus node. I extracted his pool ID from the block data and then listed all the blocks created in that epoch by the pool according to my pool node, and sure enough that 30c519 did not show up. Here's what I did:
$ block=`./jcli rest v0 block 30c519c55c12e6ed55639bde6d3b3e97e947b7f40aeeb15e7d5ae4b3ac164450 get --host "http://127.0.0.1:3100/api"`
$ tools/crypto/cardano-pool-blocks-list.pl --pool=`echo ${block:168:64}` --explorer=http://localhost:3100/explorer/graphql --epoch=49
Total blocks created: 844
1 Date:49.10247 Height:151324 Hash:3c2dd31f5267c513f103a834c65e07d0a92876d800ce31b285ee1a8841a2b57f
2 Date:49.9987 Height:151310 Hash:c5eaa1519376fef0a408c1c1039b70a7c803defe9e17a082564fcc272b4b5219
3 Date:49.9010 Height:151241 Hash:6e8c1718cf794c26a4075a93c6a5568b3910b5806f09f7e7fd74fb86a74adf3f
4 Date:49.4343 Height:150860 Hash:20991eac6975e6578d72558b426f4febef1ab2a5fb7a2fb5e893364cbcc7723e
5 Date:49.4113 Height:150838 Hash:2115cf40c80605048d12f9fc2382f3af381d0a7bea9015e1e216a470bc7b9742
6 Date:49.2184 Height:150673 Hash:55b115c90f6067d217cfacda5192bc2c21f9d33753996c17252a06c8b6804674
7 Date:49.1551 Height:150604 Hash:5635722da0968e2d74991ac478dbe16fbe46fee7ddb487a82159ede44c13af9a
8 Date:49.1512 Height:150598 Hash:af825b898c8436b793c0a29b9e2937452d256615bfb441c910ca112e82ee7db0
9 Date:49.1231 Height:150580 Hash:4171ae1ba5a7aef433d4c4f61a63e31a95a6336769255795b259eb4e917ce922
10 Date:49.571 Height:150535 Hash:0fcb0937e0223221d161bb2fb8275305db157dd6c21e636af7d938b75530246c
As can be seen, when I use my cardano-pool-blocks-list.pl script to ask my node for all list the blocks created by the pool that created the 30c519 block, that block does not show up. This is interesting as it shows that a node continues to have knowledge of hashes even after that are discarded from the chain after a fork is resolved. What's more interesting is that even after a node restart, 30c519 is still queryable which means that their held in the storage. This means that everyone's storage is unique and holds information about the forks it's experienced, and also that storage can be slimmed down a lot.