It's been almost a week and so far I'm really happy with the Reliability Canary system. Thus far two canaries have died. Both of these are based on a (known) issue with the ethereum-rpc-client that causes it to overload the geth JSON-RPC server which in-turn causes the scheduling client to crash.
I've been putting some extra work into the rpc client and the newly released ethereum-ipc-client to improve their reliability. This has primarily been focused around reducing the number of RPC calls that are made, adding some caching, and re-architecting the clients so that they don't overload the RPC server when a high number of requests are being made.
These changes have been incorperated into the ethereum-alarm-clock-client in the 0.7.2-beta1 release. I'll be monitoring the latest canary contract as well as the scheduler process to see if these changes result in the reliability increase that I'm hoping to see.
Update 2016/01/05: The last few days have been brutal and I've got the dead canaries to prove it. This set of canary contracts has done an awesome finding the weakest links in the Alarm service and I think it's worthwhile to go over some of the things that I found.
Canary #1 died due to two separate bugs. The first was that the alarm client wasn't stable enough to run for expended periods of time and wasn't able to recover from certain crash conditions. This typically occurred a few hours after launch so I had setup my scheduling server to restart the process every 10 minutes. At the time, the client only watched for contracts who were set to be called on future blocks. This meant that if the client got restarted at the same time that the target block was mined that the call would end up getting dropped.
Canary #2 died in an attempt to fix the client stability by switching to interacting with geth over a socket. I thought that this would be more reliable than making HTTP requests to the JSON-RPC server. When I deployed it for testing, everything seemed fine for a while but sometime a bit more than a day in the IPC client crashed. As part of the development of the IPC client, I implemented a system to allow the client to be interacted with from asynchronous code, but to have the client only make requests synchronously. After this crash I realized I might be able to fix the RPC client's reliability issues with the same approach as to avoid overloading the RPC server.
Canary #5 happened because now that the RPC Client didn't get restarted every 10 minutes a new bug was exposed where I was re-adding the same handler to a logger which caused the server to run out of file descriptors since one of those handlers was writing to a log file.
It's worth noting that 100% of these failures were due to code that interacts with the alarm service and not the service itself.
At the time of writing this, Canary #6 is alive and the latest deployed version of the client has been running since this morning with no apparent issues. I've been a bit embarrassed as one canary after another died, but each of those deaths identified the weak points, and more importantly it did so in a very public way. Rinds me of the build server lava lamp. The positives that have come out of this are pretty cool as well.
As I found bugs in the client that only appeared in production in unexpected ways, I realized I needed a way to test the new client without switching off the old client. This meant provisioning a new server which I have yet to automate so I'd been putting it off. This forced the issue and I made sure to take detailed notes provisioning the new server so that I've got a starting point for automating the process.
Since I was running two schedulers, I didn't want them to be competing on calls, and thus, I needed to implement the call claiming logic in the alarm client. The second scheduling server is also doing a better job staying connected to its peers.
Fingers crossed that the canary carnage is at an end.
Update 2016/01/16: Canary #7 hasn't missed a beat for 101 heartbeats (each heartbeat is 2 hours/480 blocks)!