ServiceNow

Overview

This document describes the hosting strategy adopted by the Databases Services team for hosting Microsoft SQL Server databases in the Amazon cloud (AWS). The document describes the factors that affect the tenancy of a particular database as well as the guidelines for deploying, migrating, monitoring, recovering and backing up Microsoft SQL server databases in AWS.

Database Tenancy

General Considerations

The following are the major factors to take into consideration when migrating a SQL Server database to AWS or implementing new SQL Server database in AWS:

Microsoft SQL Server database version specific features.
The size of the database.
The sensitivity of the data in the database.
The availability expectations of the database.
The compute requirements of the database.
The unique behavioral and/or security requirements of the database or application that the database supports.

On a case-by-case basis an application’s database will be evaluated against the above criteria to determine the tenancy of the database.

To maximize our investment in the AWS infrastructure the default implementation will be a multi-tenant SQL Server Standard Edition on an r5.2xlarge instance. Standard Edition offers the features most often required by our application population, including basic implementations of the HA tools. Should licensing cost become an issue Standard Edition offers a significant savings over the Enterprise, BI, or Datacenter editions. The EC2 platform offers the balance of compute power (CPU) and memory appropriate for a well-performing SQL Server workload at the scale we anticipate implementing on an individual server.

Exceptions to the default implementation will be considered as described below.

Database Feature Requirements

The current preferred SQL Server install is SQL Server 2016 Standard Edition on an r5.xlarge AWS EC2 instance. Applications that require Enterprise edition features or a different version will require an additional SQL Server EC2 instance.

Database Size

Applications with large databases (>100 GiB) may be a candidate for a separate AWS server depending on performance requirements of the application or any special backup/recovery requirements.

Data Sensitivity

Sensitive data (to be determined by the customer following the published classifications of data by InfoSec) will not be mixed on the same server with non-sensitive data.

Availability Requirements

If a customer has a demonstrated need for a highly available database a HA instance will be made available. The bar for SQL Server HA in AWS will be very high given the low recovery time we have demonstrated in a single-instance recovery scenario (as low as 21 minutes to recover a server).

Compute Requirements

When an application’s compute requirements will affect other database’s performance on a shared instance it will either need to be migrated to a separate instance or the existing server will be upgraded. The current AWS cost model generally makes a new instance more cost effective.

Client/Third Party Needs

Certain third party applications are written so that they require elevated privileges to run, employ sub-standard security, or are otherwise “badly behaved”. These applications will be candidates for migration to a dedicated server or to a designated less secure server.

Database Deployment

Using Amazon’s Elastic Compute Cloud Service

Deploying SQL Server on an Elastic Compute Cloud (EC2) instance offers the greatest flexibility. Because EC2 gives full control of the instance installation all of the features of SQL Server is possible.

Advantages and limitations of EC2

Using EC2 leverages the infrastructure and services built by the OIT in AWS. Server level backups, Active Directory and other shared services are available. Full access to the underlying operating system allows for custom processing such as data imports and full access to the SQL Server management features allows for better performance tuning. The limitation of SQL server on EC2 is that Elastic Block Storage (EBS) volumes cannot be resized as storage needs increase, either new volumes must be added or a downtime taken to transfer the data to a new, larger volume.

Using Amazon’s Relational Database Service

Some of the limitations of Amazon’s SQL Server RDS offering make it unsuitable as a general purpose platform for hosting MS SQL Servers in our environment. The most notable limitation is the lack of direct Active Directory integration and the SQL Server agent service. For OIT-managed SQL Server databases we will provide Notre Dame standard management of the instance.

Advantages and Limitations of RDS

For non-OIT departments that want to implement an RDS SQL Server on their own we can offer the following analysis of the service and link to Amazon’s published documentation.

“Amazon RDS for Microsoft SQL Server makes it easy to set up, operate, and scale SQL Server deployments in the cloud. With Amazon RDS, you can deploy multiple editions of SQL Server (2008 R2 and 2012) including Express, Web, Standard and Enterprise, in minutes with cost-efficient and re-sizable compute capacity. Amazon RDS frees you up to focus on application development by managing time-consuming database administration tasks including provisioning, backups, software patching, monitoring, and hardware scaling.” - aws.amazon.com/rds/sqlserver (09/2015)

There are some limitations that should be considered before committing to the RDS platform. It is a very good basic database platform, but disables most advanced management features, reserving them for itself.

RDS for SQL Server is not re-scalable at the storage layer. While you can add memory and compute you cannot change the size of the disks or the IOPS without rebuilding the server. Maximum growth must be considered at build-time.
RDS requires AWSDirectory Service to be Active Directory aware, adding a layer of complexity to the authentication infrastructure.
RDS backups are limited to 35 days, if you have archive needs longer than that you will need to create a custom backup scheme outside what RDS provides.
Terminating an RDS instance will, by default, delete all the backups. You need to explicitly request a backup when stopping your instance to avoid data loss.
You have no access to the underlying operating system resources, if you have jobs that rely on external data files (imports/exports), powershell or batch scripts they cannot be used on an RDS instance. This also restricts access to many instance administrative functions.

Items 2 and 5 may eliminate RDS from consideration for many applications that rely on AD or external resources.

Item 4 means that you need to be very careful when restoring from backup so that you do not lose backup history. The recovery method is to create a NEW instance with its own 35-day backup history, removing the original instance removes all its associated backups which means you need to keep it until the new instance catches up.

Taking into account the advantages and limitations of Amazon’s RDS will help you determine if it is a viable solution for your non-OIT supported database needs.

SQL Server deployment on EC2

Server Commissioning

Follow the standard commissioning process to request a new server. The platform group has a Cloud Formation script specifically for SQL Server machines with the following defaults:

Machine size r5.2xlarge
Four EBS volumes of appropriate size.
1. Data
2. Tlog
3. Backup
4. TempDB
Attachment to the domain with the
Database Services group in Administrator’s group
Addition of the role assume-database-ec2-roles

SQL Server setup automated by Powershell

The powershell script ServerSetup.ps1 in the S3 sqlserverbackups.dcnd/BootstrapFiles bucket will do the following:

Install SQL Server 2016 using response file (configuration.ini). Use Standard Edition unless Enterprise Edition features are required.
- Ephemeral Storage used for TempDB (Requires special PowerShell script at startup)
- Local accounts own SQL Server Services
- Database Servicesgroup as sysadmins
Install latest Service Pack/Patches
Install dbaUtility data-tier application
Run standard email and notifications SQL script
Create backup jobs in SQL Server Agent
Backups to S3 (Via PowerShell script)
Create startup task to run PowerShell that sets ACLs for ephemeral storage drive.

To copy it the server you need run the following Powershell snippet:

$AZ = ( Invoke-WebRequest '169.254.169.254/latest/meta-data/placement/availability-zone' ).Content

$Region = $AZ.Substring(0, $AZ.Length -1)

$BucketName="sqlserverbackups.dcnd"

$KeyName="BootstrapFiles/ServerSetup.ps1"

$FileName="z:/ServerSetup.ps1"

Read-S3Object -BucketName $BucketName -Key $KeyName -File $FileName

Then run the ServerSetup.ps1 script as Administrator

SQL Server deployment on RDS

To deploy a SQL Server RDS instance follow the steps outlined in the migration section below. Also make sure that the instance is commissioned in Service Now and a Reserved Instance (RI) purchased, if needed.

High Availability

SQL Server High Availability is accomplished by using Availability Groups. SQL Server AG’s require two servers, preferably in different AWS Availability Zones. Request a Windows Server Failover Cluster (WSFC) be commissioned. There is a Cloud Formation script that will build it. When the WSFC is available SQL Server Enterprise Edition needs to be installed on each node. Once the database is built the AG wizard will guide you through creating and populating the AG.

SQL Server Availability Groups are a high availability option, but the relatively low RTO achievable in AWS without HA should preclude its use in most cases.

Migration Methodology

How to migrate a SQL Server database to EC2

The migration of SQL Server databases to AWS is relatively straightforward. The database(s) for the application will be backed up, the backups transferred to AWS and recovered to the appropriate server as determined by the above criteria. Any server-level objects will have to be identified, scripted, and recreated on the target server. These may include items like logins, SQL Server Agent jobs, Linked servers or SSIS packages. The DBA’s have tools that will help simplify this in concert with CITS to re-deploy custom objects.

Determine the target SQL Server instance, or commission a new one per the criteria outlined in the Database Tenancy and Database Deployment sections.
Backup the source database(s) using a native SQL Server backup, compression is advised to reduce transfer time to the cloud.
Script all needed non-database objects such as logins and agent jobs. In most cases this can be accomplished by right-clicking the database object and selecting “script” from the tasks menu. For security objects use the Idera Permissions Extractor tool.
Restore the source database backup(s) to the target server.
Run the scripts created in step 3 to create the non-database objects.
Review and test.

How to migrate a SQL server database to RDS

Build an RDS, or decide which one to target.
Backup source database using SQL Server native backup, compression is advised to reduce transfer time to the cloud.
Script all needed non-database objects such as Logins. Use the Idera Permissions Extractor tool, some small alteration of the created script may be needed.
Restore the source database backup to an intermediate server (such as an EC2 instance in AWS).
Use the SQL Azure Migration Wizard from codeplex to extract DDL and data (into bcp files). There are limits as to what the wizard can migrate and the scripts should be reviewed for completeness. The wizard wants to put primary keys on non-indexed tables, these may need to be removed from the script or changed.
Run the (altered) SQL Azure Migration Wizards Scripts.
Run the scripts created in step 3.
Review and test.

Database Monitoring

SCOM

SCOM is used for alerting on OS events such as server start/stop.

Idera (Diagnostic Manager/performance monitor, Prod only)

The Idera Diagnostic Manager will be used to monitor and alert on all production SQL Server instances.

PBM/EPM (Policy-based management)

The built-in Policy Based Management tool and the open-source wrapper Enterprise Policy Manager will be used to monitor proper configuration of all SQL Server instances.

UCP (performance history, non-production)

A Utility Control Point (a MSSQL built-in feature) will be used to monitor non-production SQL Server instances and production instances not reachable by Idera

Database Backup and Recovery

Data backups

Native SQL Server backup tools will be used on the following schedule

Weekly Full backups on Sunday, retained locally for 8 days, retained in S3 for three months
Daily differential backups, retained locally for 8 days, retained in S3 for two weeks
Monthly Full backups on the first of the month retained locally for 8 days and in S3 for 18 months
Transaction log backups will be made in intervals not to exceed 15 minutes in production or 4 hours in non-production systems. Transaction log backup retention will match differential backup retention.

Retention on the local machine is controlled by the PowerShell-driven backup script, retention in S3 will be driven by S3 lifecycle policies.

Database Backups Archive (Long Term Storage) Layout on S3

SQL Server backup copies are stored on a local disk for only 8 days; as soon as a backup is produced, it is copied from local disk to S3 for long term storage. On S3, We create lifecycle policies to implement data retention. Since a S3 lifecycle policy can be applied to a folder and affect all its subfolders, we structure S3 storage space as below:

bucketName is “sqlserverbackups.dcnd”.

environment takes two values: “np” stands for non-production; “prod” stands for production.

frequency takes three values: “daily”, “weekly” and “monthly”. S3 lifecycle policies are applied at this level of folders.

Policy Name	Rule Target	Comment
prod/daily 15days	prod/daily/	Production daily differential backups and transaction backups will be kept for 15 days
prod/weekly 90 days	prod/weekly/	Production weekly full backups will be kept for 90 days
prod/monthly 18 months	prod/monthly/	Production monthly copy-only full backups will be kept for 18 months
np/daily 8 days	np/daily/	Non-production daily differential backups and transaction backups will be kept for 8 days
np/weekly 90 days	np/weekly/	Non-production weekly full backups will be kept for 90 days
np/monthly 6 months	np/monthly/	Non-production monthly copy-only full backups will be kept for 6 months

sqlServerName takes names such as “sqlmon3-prod”, “sqlstd1-prod”, …, etc.

databaseNames takes names of all databases on the individual SQL server.

backupFileNmae takes names with pattern: <databaseName>_<dateAndTime>.<fileExtention> such as “dbautility_20160528070011.bak”

Server Backups

Currently Core Services uses Cloud Protection Manager to snapshot EC2 instances on a daily basis. Through the configuration, the snapshot can include all mounted volumes (disks) or only include root volume (ie, C: drive). The snapshot backups will be kept for 30 days for all environments (dev, test and prod) within local region; the snapshot backups for production instances will also be stored cross regionally with a separate AWS account (backup account).

Recovery Scenarios

There are five EBS volumes mounted on an EC2 instance for SQL Server.

Root Volume (C: drive): where sql server binaries are stored.
Data Volume (D: drive): where database data files and home-grown scripts are stored.
Tlog Volume (E: drive): where database transaction log files are stored.
Backup Volume (F: drive): where database backups are stored. This volume is usually referred as local backup storage.
Ephemeral storage (Z: drive): where TempDB datafiles and transaction log files are stored. This volume is recreated each time the instance is booted. (r5d family of EC2)

Seven likely failure and recovery scenarios are outlined below:

An individual database crashes.

The database will be recovered from its backup stored on F: drive. The recovery time is close to what it takes on an On-Premise sql server. Depending on the size of the database, it can be few minutes to an hour.
Data volume, and/or Tlog volume crashes.

First, a new EBS volume will be created and mounted to the EC2 instance. This will take less than 10 minutes. Additional 10 plus minutes is needed to create folder layout and bring up sql server. Then each of databases will be recovered from its backup (similar to scenario 1). The total down time can be 25 minutes plus, which depends on the number of databases on the sql server and the size of each database.
Backup Volume crashes.

A new EBS volume will be created and mounted to the EC2 instance. Then re-generate folder layout and perform a full database backup on all databases. Copy backup files from S3 to the backup volume if there is need. The total dba working time will be less than 20 minute and sql server down time is zero.
EC2 instance crashes but the root volume (C: drive) is intact.

In the event of a hardware crash where no EBS volumes are affected the instance will need to be rebooted. Total downtime will be less than 10 minutes.
Both EC2 instance and root volume (C: drive) crash.

Recovery option 1: The snapshots of the EC2 instance and root volume, which were taken by Core Services, will be used to re-instantiate a new EC2 instance and the EBS volumes re-mounted to the new instance. The total down time is unknown at this point.

Recovery option 2: provision a new EC2 from Cloud Formation script. Install sql server along with upgrades/patches. Mount EBS volumes (D:, E:, F:); bring up the sql server. The total down time will be less than 30 minutes.
A table was accidentally deleted.

Restore the database from its backups taken prior to the time when the table was deleted; give the restored database a different name. Import the table into its original database. The total dba working time will be about 10 minutes plus, which depends on the size of the database and the number of backups needed. The total sql server down time is zero.

If the table was deleted 8 days ago, the specific backup files (a full backup, few differential and transaction log backups) will be copied from S3 to the EBS volume. Then we perform the steps stated above.
Loss of EC2 instance and all EBS volumes.

In a disaster event, we could lose EC2 instance and storage. We’ll recover the EC2 instance and root volume in the way described in scenario 5. Then create other three EBS volumes and copy the backups from S3 to EBS volumes. Bring up sql server and restore each of databases from its backups.

Database Security Standards

Based on best practices published by The Center for Internet Security (CIS) for SQL Server. The currently available document is version 1.0.0 for SQL Server Versions up to 2016. Adapted for practice at the University of Notre Dame in March 2016 by the Databases Services team.

Unless specifically authorized databases instances will reside in the private subnet of the VPC with the appropriate security groups attached.

EBS volumes will be encrypted

The Notre Dame password policy will be enforced.