SNIA definitions (Storage Networking Industry Association)
Disaster Recovery (DR): The recovery of data, access to data and associated processing through a comprehensive process of setting up a redundant site with the recovery of operational data to continue business operations after a loss of use of all or part of a data center.
This involves not only an essential set of data but also an essential set of all the hardware and software to continue processing of that data and business. Any disaster recovery may involve some amount of downtime.
Business Continuity (BC): Processes and/or procedures for ensuring continued business operations.
Maximum Tolerable Outage (MTO): The maximum acceptable time period during which recovery must become effective before an outage compromises the ability of a business unit to achieve its business objectives.
Recovery Time Objective (RTO): The maximum acceptable time period required to bring one or more applications and associated data back from an outage to a correct operational state
Recovery Point Objective (RPO): The maximum acceptable time period prior to a failure or disaster during which changes to data may be lost as a consequence of recovery.
Data changes preceding the failure or disaster by at least this time period are preserved by recovery. Zero is a valid value and is equivalent to a "zero data loss" requirement.
Failover: The automatic substitution of a functionally equivalent system component for a failed one.
High Availability: The ability of a system to perform its function continuously (without interruption) for a significantly longer period of time than the reliabilities of its individual components would suggest. High availability is most often achieved through failure tolerance. High availability is not an easily quantifiable term. Both the bounds of a system that is called highly available and the degree to which its availability is extraordinary must be clearly understood on a case-by-case basis.
DR / BC In the context of AWS
AWS Overall DR Plan: The process steps to recover from a situation where we can’t use our primary AWS account (e.g. recover from an AWS account takeover or self-destruct). The Platform Services group is accountable for the overall DR Plan.
AWS Service DR Plan: The process steps to recover each individual AWS system architecture component for a given service within the primary AWS account. (e.g. system doesn’t come up after AWS hardware failure, recover from bad upgrade, compromised machine, corrupted volume) a.k.a. “mini-disaster”, service disruption plan. Service owners are accountable for Service DR plans for their services. All plans should be tested prior to go live, and on a periodic basis once live. Service owners may delegate responsibility for Service DR Plans to Support Owners in some circumstances.
Note for Banner to Ancillary Applications Migration (BAAM) to AWS project: We will have one overall Disaster Recovery Plan, as well as a DR Plan for each service/architecture component that will migrate to AWS. We will not address Business Continuity at this time. This will be a follow-on effort.
Guidance for creating an AWS Service DR Plan
Using the System Architecture Diagram for the service as a guide,
- Identify the service as SaaS (vendor subscription) or IaaS (OIT Operated)
- Identify example failure scenarios
- Component failure causes data loss or corruption
- An authorized user deletes something by mistake
- A data problem is discovered that appears to be the result of an action from 6 months ago.
- etc.
- Document RTO/RPO for each of the identified failure scenarios
- Identify all components, and the process to recover/ RTO / RPO for each service
- Identify any assumptions / dependencies
- e.g. License keys, passphrases, private keys needed for restore
- Describe overall plan
- Identify test steps
- Conduct the test and update plan / steps used
- Create or update ServiceNow Knowledge Base article with the finalized plan / test steps
- Title naming convention INT: [Service Name] in AWS Disaster Recovery (DR) Test Plan
- Metadata: aws_operations_guide, Disaster_recovery, DR_plan, DR
- Search for examples (by using above metadata) for ideas!
- Review plan at least annually
- Communicate plan and any residual risk to the relevant stakeholders
We should have the following goals in mind for system recovery design:
- General guidance:
- As a baseline, we would like to be able to recover any individual service component within 4 hours, and the whole service within 24 hrs
- NOTE: We are less concerned with the RPO/RTO targets for components with high availability or automated failover
- Component guidance:
- Linux & Windows servers - 24 hrs RPO
- Identify as auto-build or snapshot or other)
- Oracle / SQL Server - 15 min RPO
- All AWS components - Maximum 4hr RTO
- Linux & Windows servers - 24 hrs RPO
Example AWS Service DR Plan
Service: OnBase
Assumption: backups available, overall RTO - OnBase back up in 4 hrs
- AWS Haproxy Load Balancers
- Rebuild or restore from snapshot - 1 hr
- Option - create autoscale group set to minimum of 1
- AWS SQL Server Database
- Restore from backup - 1 hr
- 15 minutes potential data loss on Database (15 min transaction backup schedule)
- AWS App Servers / IIS Server/ Processing Servers
- Restore from snapshot or rebuild - 1-2 hrs
- Swap space - on prem currently
- Local on Filers - Netapp snapshot
- When move to Avere, then will have failover - minutes to recover
Example AWS Service DR Test
- Note: Representative component testing (rather than comprehensive testing) may be reasonable based on the characteristics of the specific service, at the discretion of the service owner.
- Assumptions:
- This is a test of an issue with the app server only for the two scenarios where an Amazon Machine Instance (AMI) is taken ahead of time, and where you need to use the Cloud Protection manager to restore.
- All other components of infrastructure functioning as expected (database, network, authentication, etc.)
- How will we verify a successful test?m [Varies by services - work with system functionality experts to identify key transactions that will test out system functioning. These are typically maintenance window checkout type tests.]
- Log on to OnBase - add documents, retrieve documents, annotate documents
- Test steps:
- Exact list of all instance names:
- Exact list of all ELBs and ALBs:
- Create AMI for recovery of TEST volume as a precaution (this is in case restore fails) - name it something different - use“RECOVERY” in name (Platforms team)
- Do health check on any ALBs and ELBs before shutting down service! This helps make sure it's actually working before you blow the instance away. (Platforms team)
- For Linux machines, restore SSH PasswordAuthentication (Platforms team)
- Change Management - change environment tag to Terminate (Change Management)
- Terminate TEST instances and EBS volumes (Platforms team)
- Restore TEST instances from Cloud Protection Manager referencing AMIs (i.e. rebuild instance) (Platforms team)
- Restore instance & EBS volumes
- New instance ID
- Verify keeps IP address (rare scenario where this wouldn’t be the case, but still verify)
- Verify roles, tags and security groups are intact
- Adjust load balancer to new instance ID (if applicable)
- Add back CloudWatch alarms to new instance ID
- Note - If original AMI ID has changed then will need to update it
- Restore instance & EBS volumes
- Verify connections to key integration points (Service owner)
- Perform verification of key service functionality - perform a transaction on the application (Service owner)