ServiceNow

Overview

This document describes the hosting strategy adopted by the Databases Services team for hosting My SQL databases in the Amazon cloud (AWS). The document describes the factors that affect the tenancy of a particular database as well as the guidelines for deploying, migrating, monitoring, recovering and backing up MySQL databases in AWS.

Database Tenancy

General Considerations

The following are the major factors to take into consideration when migrating a MySQL database to AWS or implementing new MySQL database in AWS:

MySQL database version specific features.
The size of the database.
The sensitivity of the data in the database.
The availability expectations of the database.
The compute requirements of the database.
The unique behavioral and/or security requirements of the database or application that the database supports.

On a case-by-case basis an application’s database will be evaluated against the above criteria to determine the tenancy of the database.

To maximize our investment in the AWS infrastructure the default implementation will be a multi-tenant MySQL RDS instance. Exceptions to the default implementation will be considered as described below.

Database Size

Applications with large databases (>100 GiB) may be a candidate for a separate AWS server depending on performance requirements of the application or any special backup/recovery requirements.

Data Sensitivity

Sensitive data (to be determined by the customer following the published classifications of data by InfoSec) will not be mixed on the same server with non-sensitive data.

Availability Requirements

If a customer has a demonstrated need for a highly available database, an HA instance will be made available. The bar for MySQL HA in AWS will be very high given the low recovery time we have demonstrated in a single-instance recovery scenario.

Compute Requirements

When an application’s compute requirements will affect other database’s performance on a shared instance it will either need to be migrated to a separate instance or the existing server will be upgraded. The current AWS cost model generally makes a new instance more cost effective.

Client/Third Party Needs

Certain third party applications are written so that they require elevated privileges to run, employ sub-standard security, or are otherwise “badly behaved”. These applications will be candidates for migration to a dedicated server or to a designated less secure server.

Database Deployment

Using Amazon’s Elastic Compute Cloud Service

Deploying MySQL on an Elastic Compute Cloud (EC2) instance offers the greatest flexibility. Because EC2 gives full control of the instance installation all of the features of MySQL is possible.

Advantages and limitations of EC2

Using EC2 leverages the infrastructure and services built by the OIT in AWS. Server level backups and other shared services are available. Full access to the underlying operating system allows for custom processing and full access to the MySQL management features allows for better performance tuning. The limitation of MySQL on EC2 is that Elastic Block Storage (EBS) volumes cannot be resized as storage needs increase, either new volumes must be added or a downtime taken to transfer the data to a new, larger volume.

Using Amazon’s Relational Database Service

Amazon Relational Database Service (RDS) is the preferred implementation for MySQL. Systems using MySQL are generally smaller and have longer RPO/RTO than enterprise systems. The biggest drawback to RDS is cross-account disaster recovery which has to be planned for and implemented manually, RDS native backups are only portable across region, not across account.

Advantages and Limitations of RDS

For non-OIT departments that want to implement an RDS MySQL Server on their own we can offer the following analysis of the service and link to Amazon’s published documentation.

“Amazon RDS makes it easy to set up, operate, and scale MySQL deployments in the cloud. With Amazon RDS, you can deploy scalable MySQL deployments in minutes with cost-efficient and resizable hardware capacity. Amazon RDS frees you up to focus on application development by managing time-consuming database administration tasks including backups, software patching, monitoring, scaling and replication.” - https://aws.amazon.com/rds/mysql (01/2016)

There are some limitations that should be considered before committing to the RDS platform. It is a very good basic database platform, but disables most advanced management features, reserving them for itself.

RDS backups are limited to 35 days, if there are archive needs longer than that a custom backup scheme outside what RDS provides will need to be implemented.
Terminating an RDS instance will, by default, delete all the backups. You also need to explicitly request a backup when stopping your instance to avoid data loss.
You have no access to the underlying operating system resources, if you have jobs that rely on external data files (imports/exports) or batch scripts they cannot be used on an RDS instance. This also restricts access to many instance administrative functions.

Item 2 means that you need to be very careful when restoring from backup so that you do not lose backup history. The recovery method is to create a NEW instance with its own 35-day backup history, removing the original instance removes all its associated backups which means you need to keep it until the new instance catches up.

Taking into account the advantages and limitations of Amazon’s RDS will help you determine if it is a viable solution for your non-OIT supported database needs.

MySQL deployment on EC2

Server Commissioning

Follow the standard commissioning process to request a new server. The following are the accepted defaults for MySQL:

Machine size as above
An EBS volumes of appropriate size for data, binary logs, and backups
Addition of the IAM role assume-database-ec2-roles

MySQL setup automated by Ansible

The current standard build script for MySQL will be available in the ND OIT BitBucket repository.

MySQL deployment on RDS

To deploy an RDS instance follow the steps outlined in the migration section below. Also make sure that the new instance commissioned in Service Now and a Reserved Instance (RI) purchased, if needed.

High Availability

High availability is not a feature of MySQL currently being supported, although if future needs dictate it will be considered.

Migration Methodology

How to migrate a MySQL database to EC2

The migration of MySQL databases to AWS is relatively straightforward. The database(s) for the application will be backed up, the backups transferred to AWS and recovered to the appropriate server as determined by the above criteria. Any server-level objects will have to be identified, scripted, and recreated on the target server.

Determine the target MySQL instance, or commission a new one per the criteria outlined in the Database Tenancy and Database Deployment sections.
Backup the source database(s) using a native MySQL backup.
Script all needed non-database objects.
Restore the source database backup(s) to the target server.
Run the scripts created in step 3 to create the non-database objects.
Review and test.

How to migrate a MySQL database to RDS

Determine the target MySQL RDS instance, or commission a new one per the criteria outlined in the Database Tenancy and Database Deployment sections.
Backup source database using MySQL dump.
Restore the source database backup to the RDS instance using the command-line utilities on a separate EC2 instance with MySQL tools installed or using MySQL Workbench
Review and test.

Database Monitoring

Cloudwatch

Cloudwatch will offers sufficient monitoring for MySQL servers.

Nagios

The Nagios MySQL plugin will offer additional monitoring and alerting options in the future.

Database Backup and Recovery

Data backups (EC2)

For EC2 based instances native backup tools will be used on the following schedule

Weekly Full backups on Sunday, retained locally for 15 days, retained in S3 for three months
Daily full backups, retained locally for 15 days, retained in S3 for two weeks
Monthly Full backups on the first of the month retained locally for 15 days and in S3 for 18 months
Binary log backups will be made in intervals not to exceed 30 minutes in production or 4 hours in non-production systems. Transaction log backup retention will match weekly backup retention.

Retention on the local machine is controlled by the backup script, retention in S3 will be driven by S3 lifecycle policies.

Server Backups (EC2)

Currently Platform Services uses Cloud Protection Manager to snapshot EC2 instances on a daily basis. Through the configuration, the snapshot can include all mounted volumes (disks) or only include root volume (ie, C: drive). The snapshot backups will be kept for 30 days for all environments (dev, test and prod) within local region; the snapshot backups for production instances will also be stored cross regionally with a separate AWS account (backup account).

Recovery Scenarios (EC2)

Seven likely failure and recovery scenarios are outlined below:

An individual database crashes.

The database will be recovered from its backup stored locally. Depending on the size of the database, recovery time can be few minutes to an hour.
Data volume crashes.

First, a new EBS volume will be created and mounted to the EC2 instance. This will take less than 10 minutes. Additional 10 plus minutes is needed to create folder layout and bring up MySQL. Then each of databases will be recovered from its backup (similar to scenario 1). The total down time can be 25 minutes or more depending on the number of databases on the server and the size of each database.
Backup Volume crashes.

A new EBS volume will be created and mounted to the EC2 instance. Then the folder layout will be re-generated and a full database backup performed on all databases. Copy backup files from S3 to the backup volume if there is need. The total dba working time will be less than 20 minute and server down time is zero.
EC2 instance crashes but the root volume is intact.

In the event of a hardware crash where no EBS volumes are affected the instance will need to be rebooted. Total downtime will be less than 10 minutes.
Both EC2 instance and root volume (C: drive) crash.

Recovery option 1: The snapshots of the EC2 instance and root volume, which were taken by Core Services, will be used to re-instantiate a new EC2 instance and the EBS volumes re-mounted to the new instance. The total down time is unknown at this point.

Recovery option 2: provision a new EC2 from Cloud Formation script. Install MySQL along with upgrades/patches vi ansible. Mount EBS volumes and bring up the database server. The total down time will be less than 30 minutes.
A table was accidentally deleted.

Restore the database from its backups taken prior to the time when the table was deleted; give the restored database a different name. Import the table into its original database. The total dba working time will be about 10 minutes plus, which depends on the size of the database and the number of backups needed. The total database down time is zero.

If the table was deleted 8 days ago, the specific backup files (a full backup, few differential and transaction log backups) will be copied from S3 to the EBS volume. Then we perform the steps stated above.
Loss of EC2 instance and all EBS volumes.

In a disaster event, we could lose EC2 instance and storage. We’ll recover the EC2 instance and root volume in the way described in scenario 5. Then create other three EBS volumes and copy the backups from S3 to EBS volumes. Bring up the server and restore each of databases from its backups.

Recovery Methodology (EC2)

The standard backup script and crontab will create a daily full backup using mysqldump, the corresponding .pos file will indicate the last binlog used prior to the full backup. The binlogs will be enabled and backed up every 15 minutes (also via cron). To recover to a point in time choose the latest full backup and all the changelog.* files up to the desired recovery point (you may need to go to s3 for files older than 8 days). The mysqlbinlog utility is used to extract SQL from the changelog.* files, then the full dump and all the extracted SQL from the binlogs can be replayed into the database. On the final changelog file use the --stop-datetime ‘YY-MM-DD HH:MM:SS’ option to stop SQL extraction the the desired recovery point.

Example Recovery

Copy and unzip the full backup
- cd /tmp
- cp /mysql/bkup/2016-03-08-162440.sql.gz .
- gunzip 2016-03-08-162440.sql.gz
- 2016-03-08-162440.pos indicates the last used log was 39 so begin log recovery with 40
Copy and extract the changelog files
- cp /mysql/logs/changelog.000040 .
- cp /mysql/logs/changelog.000041 .
- mysqlbinlog changelog.000040 > changelog.000040.sql
- mysqlbinlog --stop-datetime ‘2016-03-10 06:11:11’ changelog.000041 > changelog.000041.sql
Run the resulting SQL to recover the database(s)
- mysql -u root < 2016-03-08-162440.sql
- mysql -u root < changelog.000040.sql
- mysql -u root < changelog.000041.sql
Query the database to ensure the data is what is expected
Take a new full backup

Backup and Recovery (RDS)

Backup

For RDS instances snapshot retention will be set at the maximum of 35 days.

Recovery

Recovery of a failed RDS instance will use the point-in-time recovery provided by the AWS console. This will require communicating a new endpoint address to the end users.

Recovery of a single database will require a point-in-time recovery of the instance, then a native dump and restore of the individual database.

Disaster Recovery

The standard scenario for DR is the logical destruction of the DCND account. In this case the RDS and its snapshots will be gone. With no mechanism to move snapshots to DRND, recovery will be dependent on out-of-band database dumps taken by the end users or application admin.

Database Security Standards

Based on best practices published by The Center for Internet Security (CIS) for MySQL.

Adapted for practice at the University of Notre Dame in March 2016 by the Databases Services team.