Disaster Recovery

Below is an outline of the Disaster Recovery Process followed by Delib. This process is in place 24/7, 365 days of the year.

Here's a summary of the stages, and the target timescales. Each stage is detailed later in this article.

Stage

When?

1. Detection & definition

When notified by our monitoring systems or a customer

2. On-call team alerted

As soon as possible after detection of critical issue

3. Initial investigation & assessment

Within 1.5 hours of detection

4. Customer notification

Within 2 hours of detection, as specified in our Service Level Agreement

5. Resolution

Depends on complexity of problem. But our target resolution times for either product or infrastructure issues are detailed in our Service Level Agreement

6. Report & document

Within 1 working day of resolution

7. Review & retrospective

Within 3 working days of resolution

1. Detection & definition

Detection: we'll be made aware of a critical issue by one of the following methods:

(a) Automated alert from our monitoring systems
(b) Internal detection (e.g. from investigations arising from third-party security announcements)
(c) Customer or end-user report

Definition: of a critical issue:

Has a customer site been unavailable to the general public for more than 10 minutes?
Is there a reproducible issue which prevents a user from entering or submitting data?
Is there a reproducible issue which causes unavoidable or unexpected data loss?
Is there a bug or security vulnerability that constitutes a realistic threat to privacy?

2. On-call team alerted

If a critical error has been picked up by one of our monitoring systems, the team will be alerted by email and text message. Unavailability lasting ten minutes or longer is automatically reported.
The on-call team will include at least one technical team member.

When? As soon as possible after detection of critical issue

3. Initial investigation and assessment

The technical lead aims to establish the cause of the issues, and assess the severity and likely duration of the service interruption.

Ideally, this will include:

Identification of the root cause
An assessment of the severity and scale of the problem, including which customers are affected
An estimated time to resolution

When? Within 1.5 hours of detection

4. Customer notification

Affected customer(s) will be contacted to inform them of the service interruption, and that Delib are actively investigating the problem.

This communication will most likely be by email, but depends on the severity of the incident. Any wider-reaching issues may be posted as a homepage announcement on delib.zendesk.com in the first instance, ahead of any direct communication.

When? Our Service Level Agreement gives a maximum initial response time for critical errors of 2 hours.

5. Resolution

Once the technical on-call lead has assessed the problem, they will report back to the on-call customer success manager as follows:

If the problem can be easily solved, it will be fixed. The technical lead will report back to the customer success manager, and document the problem and solution
or
If the issue is more complex, a resolution plan is put in place to address the service outage. This may require more technical team members to be contacted, or for the investigation to be continued in office hours. An interim report, summarising expected cause, and steps to resolution will be provided to the customer success manager.

In both cases, any information we have will be communicated to affected customers by the on-call customer success manager. The customer success manager will continue to keep all affected customers updated with progress until we reach a resolution to the issue.

When? This depends on the complexity of the problem, but our target resolution times are set out in our Service Level Agreement.

6. Reporting, documentation and tidying up

Once the problem has been resolved, the customer success manager will provide a written account for affected customers and Delib reference. This will include:

How the problem was detected
The scope of the problem and how it may have affected end-user interaction
The root cause
Steps to resolution, including any measures put in place to mitigate the risk of repeat occurrence
Total downtime
Any service credit or other compensation offered by Delib, should the error have caused us to miss our Service Level Agreement targets

When? All of this should happen within 1 working day of the resolution of the issue.

7. Review and retrospective

Once the error has been resolved, Delib will have a retrospective to identify any long term counter-measures which can be put in place to prevent a recurrence of the issue.

This disaster recovery process is also reviewed to identify any improvements that can be made.

When? Within 3 working days of resolution

Other information

Would we ever take sites offline?

This will be the informed decision of Delib's Managing Director, who will be given a full brief of the situation by the on-call team. We will ask ourselves some specific questions to determine whether this may be necessary:

If the site stays online, could users submit data that gets lost without them knowing?
If the site stays online, could any existing data loss or corruption be made worse?
If the site stays online, is there a possibility of the loss or exposure of any personal information?
Conversely, if the site is taken offline, could any existing data loss be exacerbated?

This is a last resort for us, and we would never take sites offline unless leaving them online would pose more of a risk to the customer(s) or their respondents.

Policy reviewed February 2024

PreviousService Level Agreement (SLA)NextCertifications

Last updated 1 year ago

hashtag1. Detection & definition

hashtag2. On-call team alerted

hashtag3. Initial investigation and assessment

hashtag4. Customer notification

hashtag5. Resolution

hashtag6. Reporting, documentation and tidying up

hashtag7. Review and retrospective

hashtagOther information

hashtagWould we ever take sites offline?

1. Detection & definition

2. On-call team alerted

3. Initial investigation and assessment

4. Customer notification

5. Resolution

6. Reporting, documentation and tidying up

7. Review and retrospective

Other information

Would we ever take sites offline?