Looking for your account? Log in

Full transparency - server downtime
Gavin Courtney | November 24, 2016 | ChurchSuite Updates

Full transparency - server downtime

Update 25 November 0:11am

All of our checks and tests show the primary web server to be stable and fully functional. We will continue to actively monitor the status, but it early signs indicate a complete success. Thank you for bearing with us.

Update 24 November 11:31pm

We're back online and are beginning the process of verifying that the hardware replacement has been a success and that all systems are operating as normal.

Update: 24 November 11:20pm

Our maintenance period has begun. More details will follow as we progress through the task.

Update: 24 November 11:20am

We have scheduled for replacement hardware to be installed within our primary web server at 11pm tonight. We expect ChurchApp to be offline for approximately 30 minutes during this period, after which the infrastructure should be restored to full health.

==========

Over the past 10 days, we've had two periods where ChurchApp has been partially or fully unavailable, so I want to try and explain what's been happening and what we're doing to resolve this situation.

Firstly, we don't consider any unplanned downtime as acceptable. Since ChurchApp started, in 2012, we have achieved an uptime well in excess of 99.99% available, making the problems we've seen in the past week unprecedented and something we're working incredibly hard on to stop from happening again. We take the responsibility of hosting and making your data available very seriously and we're truly sorry for any time that we fail in that. We want to be as open and honest as possible in this regard so that you know exactly what has happened and can have confidence in what we're doing to resolve this.

Recent problems

On Monday 14th November, we had a period of unavailability between approximately 4am and 6am GMT where ChurchApp was offline. The primary web server, through which all web browser requests are handled, was not responding to requests for new connections and was slow to reboot. Once it had restarted, we identified the software package that caused the slow reboot, which was subsequently updated, bringing everything back online and working.

On Thursday 17th November, we scheduled a reboot of our core infrastructure at 7am GMT in order to apply software updates and ensure that all systems were restored to a full working state. This went ahead without a hitch and gave us confidence that the previous problems had been addressed.

On Sunday 20th November, we had a period from 10:19am until 10:34am GMT where parts of ChurchApp were not available, which we believed was due to a memory fault. We were able to isolate the problem and restore full services within 15 minutes of the problem arising, but we're aware that this impacted a number of our customers as it was a critical time period on a Sunday morning.

On the back of Sunday's problem, based on the evidence we had gathered, we believed that a potential memory problem existed in our infrastructure. As a precaution we scheduled routine maintenance to replace the memory on our primary web server last night, Wednesday 23rd November. The memory was replaced and full service was restored within 10 minutes, however a further problem developed shortly after 2am on Thursday 24th November.

At around 4am, it became apparent that the memory (which had been replaced) was not the cause, but a different problem was presenting itself. Between 4am and 5:30am, we investigated a wide range of potential problems, with ChurchApp being intermittently unavailable at times during this period. At around 5:45am we discovered that one of the hard disks in our RAID configuration was failing and this was the cause of the problem. We made an immediate decision to take the primary web server offline in order to remove the faulty disk and allow ChurchApp to run off the other disks in our RAID configuration. Once this change was made and the server rebooted, ChurchApp became available again and has remained stable since, with no customer data lost.

Current status

As I write now (Thursday 24th November 8:30am), ChurchApp is stable and has been online without interruption since approximately 5:55am. No data has been lost - the RAID configuration worked as intended and whilst one disk failed, the others continued to retain all customer data.

We continue to work with our hosting provider to assess the situation and will be scheduling a brief period of maintenance tonight in order to restore our RAID configuration to full capacity.

Going forward

Whilst we could never have foreseen the events of these past 10 days, for a number of weeks now we've been working on making our infrastructure more redundant and resilient. Our developers and engineers have been building out test environments to deploy ChurchApp to, and we're looking to fully move to a higher capacity and more resilient infrastructure in the coming weeks.

In addition to the infrastructure work that's already ongoing, we've also identified a number of ways that we can, and will, improve in the interim period. That said, we're pleased that we do have Disaster Recovery processes in place (which we would have preferred not to have to use), and also that we've been able to respond in a timely manner with no customer data lost.

Final thoughts

In the midst of a chaotic time, particularly this morning, I'm thankful to our customers who extended us grace as we sought to investigate these hardware problems. There's a lot of room for improvement from our side, something that we will begin working on immediately, but I'm pleased with how we responded and that the systems we already had in place worked.

As ever, if you have any questions, please just get in touch.

Post by Gavin Courtney

Gavin Courtney, ChurchApp