Overview
This document describes the hosting strategy adopted by the Databases Services team for hosting Microsoft SQL Server databases in the Amazon cloud (AWS). The document describes the factors that affect the tenancy of a particular database as well as the guidelines for deploying, migrating, monitoring, recovering and backing up Microsoft SQL server databases in AWS.
Database Tenancy
General Considerations
The following are the major factors to take into consideration when migrating a SQL Server database to AWS or implementing new SQL Server database in AWS:
- Microsoft SQL Server database version specific features.
- The size of the database.
- The number of databases on one server
- The sensitivity of the data in the database.
- The availability expectations of the database.
- The compute requirements of the database.
- The unique behavioral and/or security requirements of the database or application that the database supports.
On a case-by-case basis an application’s database will be evaluated against the above criteria to determine the tenancy of the database.
To maximize our investment in the AWS infrastructure, the default implementation will be a multi-tenant SQL Server Standard Edition r6i EC2 instance. Standard Edition offers the features most often required by our application population, including basic implementations of the HA tools. Should licensing cost become an issue Standard Edition offers a significant savings over the Enterprise, BI, or Datacenter editions. The r6i EC2 type offers the balance of compute power (CPU) and memory appropriate for a well-performing SQL Server workload at the scale we anticipate implementing on an individual server.
Exceptions to the default implementation will be considered as described below.
Database Feature Requirements
The current preferred SQL Server install is SQL Server 2019 Standard Edition on an R6i AWS EC2 instance. Applications that require Enterprise edition features or a different version will be considered and discussed with clients.
Database Size
Applications with large databases (>100 GiB) may be a candidate for a separate SQL Server on its own EC2 instance, depending on performance requirements of the application or any special backup/recovery requirements.
Data Sensitivity
Sensitive data (to be determined by the customer following the published classifications of data by InfoSec) will not be mixed on the same server with non-sensitive data.
Availability Requirements
If a customer has a demonstrated need for a highly available database a HA instance will be made available. The bar for SQL Server HA in AWS will be very high given the low recovery time we have demonstrated in a single-instance recovery scenario (as low as 21 minutes to recover a server).
Compute Requirements
When an application’s compute requirements will affect other database’s performance on a shared instance, it will either need to be migrated to a separate instance or the existing server will be upgraded. The current AWS cost model generally makes a new instance more cost effective.
Client/Third Party Needs
Certain third party applications are written so that they require elevated privileges to run, employ sub-standard security, or are otherwise “badly behaved”. These applications will be candidates for migration to a dedicated server or to a designated less secure server.
Database Deployment
Using Amazon’s Elastic Compute Cloud Service
Deploying SQL Server on an Elastic Compute Cloud (EC2) instance offers the greatest flexibility. Because EC2 gives full control of the instance, installation all of the features of SQL Server is possible.
Advantages and Limitations of EC2
Using EC2 leverages the infrastructure and services built by the OIT in AWS. Server level backups, Active Directory and other shared services are available. Full access to the underlying operating system allows for custom processing such as data imports and full access to the SQL Server management features allows for better performance tuning. We’re able to use some of the more advanced SQL Server features such as sending emails, filestream and linked server, etc. The cons of SQL server on EC2 is that we have to manage both host and sql server; we upgrade and apply patch server, take full backup of volumes, etc.
Using Amazon’s Relational Database Service
Some of the limitations of Amazon’s SQL Server RDS offering make it unsuitable as a general purpose platform for hosting MS SQL Servers in our environment. The most notable limitation is the lack of Active Directory integration. Since we will need to implement for ourselves many of the database management processes offered by RDS, there is no advantage to incurring the additional cost for an AWS managed system. For OIT-managed SQL Server databases we will provide Notre Dame standard management of the instance.
Advantages and Limitations of RDS
For non-OIT departments that want to implement an RDS SQL Server on their own we can offer the following analysis of the service and link to Amazon’s published documentation.
“SQL Server is a relational database management system developed by Microsoft. Amazon RDS for SQL Server makes it easy to set up, operate, and scale SQL Server deployments in the cloud. With Amazon RDS, you can deploy multiple editions of SQL Server (2014, 2016, 2017 and 2019), including Express, Web, Standard, and Enterprise, in minutes with cost-efficient and re-sizable compute capacity. Amazon RDS frees you up to focus on application development by managing time-consuming database administration tasks, including provisioning, backups, software patching, monitoring, and hardware scaling.”
https://aws.amazon.com/rds/sqlserver/ (10/2022)
There are some limitations that should be considered before committing to the RDS platform. It is a very good basic database platform, but disables most advanced management features, reserving them for itself.
- RDS was not Active Directory aware; you would not be able to use NetIDs or EDS groups for authentication. In a recent year, the RDS instance works with AWS Directory Service for Microsoft Active Directory, also called AWS Managed Microsoft AD, to enable Windows Authentication. We need to wait for ND to buy and implement AWS Directory Service; then we explore the features.
- We cannot control RDS backup retention policy.
- Terminating an RDS instance will, by default, delete all the backups. You also need to explicitly request a backup when stopping your instance to avoid data loss.
- You have no access to the underlying operating system resources, if you have jobs that rely on external data files (imports/exports), PowerShell or batch scripts they cannot be used on an RDS instance. This also restricts access to many instance administrative functions.
- Restore/refresh/clone a specific database on RDS instance is complicate and time consuming.
- RDS does not support some of Microsoft SQL Server features such as FileStream, Maintenance Plans, Replication, Linked Server (Oracle), etc.
Items 1 and 4 may eliminate RDS from consideration for many applications that rely on AD or external resources.
Item 3 means that you need to be very careful when restoring from backup so that you do not lose backup history. The recovery method is to create a NEW instance with its own 35-day backup history, removing the original instance removes all its associated backups which means you need to keep it until the new instance catches up.
Taking into account the advantages and limitations of Amazon’s RDS will help you determine if it is a viable solution for your non-OIT supported database needs.
SQL Server deployment on EC2
Server Commissioning
R6i Instances are memory optimized; they are designed to deliver fast performance for workloads that process large data sets in memory. In general, we choose R6i type EC2 instances to host sql server.
An Amazon Elastic Block Store (EBS) volume is a durable, block-level storage device that can be attached to an EC2 instance. General Purpose SSD (gp3) volumes offer a leading balance of price and performance for SQL Server workloads when configured appropriately. In general, we choose EBS gp3 volumes for our sql server hosts.
In ServiceNow, create a request to commission an EC2 R6i.xx instance as a sql server host. Follow the form to fill all fields. Our platform group has a Cloud Formation script specifically for SQL Server machines. By default, an IAM role assume-database-ec2-roles should be attached to the newly built EC2 instance; and ADND/OE DBAS is added to Administrator’s group on the host.
For each EC2 instance, we request five EBS gp3 volumes attached as local drives. The names and usages of local drives are listed in below:
C: -- store binaries
D: -- store MSSQLxx.MSSQLSERVER directory, which was generated during sql server installation, user databases’ data files and our scripts
E: -- store user databases’ transaction log files
F: -- store database backup files
Z: -- store temp database data files and transaction log files
SQL Server setup is partially done by PowerShell and TSQL scripts. Here is the list of steps:
- Ensure there are 5 drives: C:, D:, E:, F:, and Z:;
- Run SQLSetupPreStep.ps1. This script creates folder C:\sqlserver_system_database_file_backup, D:\Scripts, D:\MSSQL\data, E:\MSSQL\tlog, F:\MSSQL\backup, F:\MSSQL\sqlAgentJob , Z:\MSSQL\data, Z:\MSSQL\tlog, etc. This script also copies home-grown database maintenance scripts from S3 bucket to local D:\Scripts
- Install sql server and patch
- Bounce the host, which will set some environment variables including the path to SQLPS.
- Create empty DBATools database by restoring DBATools database from D:\Scripts\DBATools.bak; open MaintenanceSolution_ND.sql in SSMS; search for “ND Env”; if there is need, change the code in the block of “ND Env beginning” and “ND Env ending”; execute it to create procedures and tables in DBATools database.
- Modify D:\Scripts\ SQLServerMoveBackup.ps1 and SQLServerArchiveSystemDBFiles.ps1:
Add the new sql server name into hash table variable “$hash_env”.
Modify SQLServerArchiveSystemDBFiles.ps1 $pathToSysDBFiles
- Log into the new sql server to executes "SQL Server Alert and Email Setup.sql" (set up database alerts and email) and “CreateMain8Jobs.sql" (set up five database backup jobs, one job to Copy and Move backups, one job to do index defragmentation and one job to check data integrity). Adjust 8 jobs’ schedules.
- Test database server connection from desk top; if the new sql server is not accessible from desk top, verify if local inbound SQL_TCP and SQL_UDP FW rules exist on the new EC2 instance.
- Set up weekly job to Clean Jobs History.
- Stop sql server; run D:\Scripts\SQLServerArchiveSystemDBFiles.ps1 to save system database files on C: drive and in S3
- Add the sql server to Idera SQL Diagnostic Manager (if it is a prod server) on SQLIDERA1-PROD.
- Register it for Microsoft SQL Server Central Management and Master/Target SQL Agent jobs on SQLMON5-PROD.
SQL Server deployment on RDS
To deploy a SQL Server RDS instance follow the steps outlined in the migration section below. Also make sure that the Core Services team is aware of the new instance so that is included in the master server list and a Reserved Instance (RI) purchased, if needed.
High Availability
SQL Server High Availability is accomplished by using Availability Groups. SQL Server AG’s require two servers, preferably in different AWS Availability Zones. Request a Windows Server Failover Cluster (WSFC) be commissioned. There is a Cloud Formation script that will build it. When the WSFC is available SQL Server Enterprise Edition needs to be installed on each node. Once the database is built the AG wizard will guide you through creating and populating the AG.
SQL Server Availability Groups are a high availability option, but the relatively low RTO achievable in AWS without HA should preclude its use in most cases.
Migration Methodology
How to migrate a SQL Server database to EC2
The migration of SQL Server databases to AWS is relatively straightforward. The database(s) for the application will be backed up, the backups transferred to AWS and recovered to the appropriate server as determined by the above criteria. Any server-level objects will have to be identified, scripted, and recreated on the target server. These may include items like logins, SQL Server Agent jobs, Linked servers or SSIS packages.
- Determine the target SQL Server instance, or commission a new one per the criteria outlined in the Database Tenancy and Database Deployment sections.
- Backup the source database(s) using a native SQL Server backup, compression is advised to reduce transfer time to the cloud.
- Script all needed non-database objects such as logins and agent jobs. In most cases this can be accomplished by right-clicking the database object and selecting “script” from the tasks menu. For security objects use the Idera Permissions Extractor tool.
- Restore the source database backup(s) to the target server.
- Run the scripts created in step 3 to create the non-database objects.
- Review and test.
How to migrate a SQL server database to RDS.
- Build an RDS, or decide which one to target.
- Backup source database using SQL Server native backup, compression is advised to reduce transfer time to the cloud.
- Script all needed non-database objects such as Logins. Windows Domain authentication works only with AWS Directory Service for Microsoft Active Directory (also called AWS Managed Microsoft AD). Currently, Notre Dame has not implemented AWS Managed Microsoft AD. Therefore, RDS cannot use windows authentication. All windows logins need to be changed to SQL Server logins. Use the Idera Permissions Extractor tool, some small alteration of the created script may be needed.
- Connect to the RDS instance through SSMS; Restore the source database backup.
- Run the scripts created in step 3 to migrate login accounts
- Review and test.
Database Monitoring
Idera (Diagnostic Manager/performance monitor, Prod only)
The Idera Diagnostic Manager will be used to monitor and alert on all production SQL Server instances.
UCP (performance history, non-production)
A Utility Control Point (a MSSQL built-in feature) will be used to monitor non-production SQL Server instances and production instances not being monitored by Idera
Database Backup and Recovery
Database backups
Native SQL Server backup tools will be used. All backup files are generated and stored on an attached EBS volume; then copied to S3 bucket in a different AWS account within 2 hours.
- Weekly Full backups, retained locally for 8 days.
- Daily differential backups, retained locally for 8 days.
- Monthly Full backups on the first of the month retained locally for 8 days.
- Transaction log backups will be made in intervals not to exceed 15 minutes in production or 4 hours in non-production systems. Transaction log backup retention will match differential backup retention.
Retention on the local machine is controlled by the PowerShell-driven backup script, retention in S3 will be driven by S3 lifecycle policies.
Database backup Encryption
Before SQL Server 2014, Transparent Data Encryption (TDE) was commonly used to encrypt data at rest, however, it will significantly reduce query performance and increase CPU consumption. Luckily since SQL Server 2014, the new Backup Encryption option allows us to encrypt the backup.
- Prepare a sql server so we can take a database backup with encryption
To encrypt the SQL Server backup while creating, we need a SMK, a DMK and a certificate, and back up them first.
- Service Master Key (SMK): SMK is automatically generated and stored in the system master database when we install SQL Server, unique for each instance. It will be used to encrypt the Database Master Key.
- Database Master Key (DMK): DMK is unique to each system master database for each instance. DMK is used to protect the certificate or asymmetric key.
- Certificate: Certificate can contain a private key that is protected by the Database Master Key, or an asymmetric key (Better not choose it). It is used to encrypt the database backup.
- Check the Existence of DMK
SELECT * FROM master.sys.symmetric_keys
- Create A DMK
-- Create the Database Master Key (assume the database master key does not exit)
USE master
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD='password'
GO
- Create A Certificate
-- Create Backup Certificate;
--Note: Without parameter EXPIRY_DATE, SQL Server creates a certificate with an expiration date one year in the future by default.
USE master
GO
CREATE CERTIFICATE Certificate
WITH SUBJECT = 'SQL Backup Certificate';
GO
- Backup the key and certificate
-- Backup the Service Master Key
USE master
GO
BACKUP SERVICE MASTER KEY
TO FILE = 'filepath\SMKfilename.key'
ENCRYPTION BY PASSWORD = 'SMKpassword';
GO
-- Backup the Database Master Key
BACKUP MASTER KEY
TO FILE = 'filepath \DMKfilename.key'
ENCRYPTION BY PASSWORD = 'DMKpassword';
GO
-- Backup the Certificate
BACKUP CERTIFICATE Certificate
TO FILE = 'filepath \CertificateFilename.cer'
WITH PRIVATE KEY(
FILE = 'filepath \CertificateKeyFilename.key',
ENCRYPTION BY PASSWORD = 'CertificatePassword');
GO
- Backup database with encryption
BACKUP DATABASE [DBATools]
TO DISK = N'S:\MSSQL\backup\DBATools_encripted.bak'
WITH
COPY_ONLY,
COMPRESSION,
ENCRYPTION(ALGORITHM = AES_256, SERVER CERTIFICATE = [BackupCert_ndpddb1p])
GO
Note: in the backup job, the backup step is defined as
EXECUTE dbo.DatabaseBackup
@Databases = 'USER_DATABASES',
@BackupType = 'FULL',
@Encrypt = 'Y',
@ServerCertificate = 'BackupCert_ndpddb1p',
@EncryptionAlgorithm='AES_256',
@Verify = 'Y',
@Compress = 'Y',
@CheckSum = 'Y'
- Check if a sql server database backup is encrypted
RESTORE HEADERONLY
FROM DISK = N'filepath\Filename.bak';
GO
Run the above statement in SSMS, and check the values in EncryptorThumbprint and EncryptorType columns. If it is not encrypted, they will be NULL.
- Restore a database with encrypted backup on the same sql server where the backup was taken
--just like restore a database with un-encrypted backup.
- Restore a database with encrypted backup (taken on server A) on a different sql server (server B)
- On server B, Create Master Key
CREATE MASTER KEY
ENCRYPTION BY PASSWORD = 'MasterKey_Password';
- copy certificate file CertificateFilename.cer and its private key file CertificateKeyFilename.key from their backup location on Server A to a location local to Server B. SSMS connection to Server B and create certificate from the backups
CREATE CERTIFICATE BackupEncryptTestCert
FROM FILE = 'filepath\CertificateFilename.cer'
WITH PRIVATE KEY
(
FILE = 'filepath\CertificateKeyFilename.key',
DECRYPTION BY PASSWORD = 'CertificatePassword’
)
- Restore the database on Server B
- Certificate expired
When the certificate expires, the backup job would fail with error:
“The Certificate specified for backup encryption has expired. [SQLSTATE 42000].
BACKUP DATABASE is terminating abnormally. [SQLSTATE 42000]”
How to fix:
Step 1: create a new certificate with new expire date
--create a backup certificate which will expire on Jan 31,2033
USE master
CREATE CERTIFICATE BackupCert_ndpddb1p_2023 WITH SUBJECT = 'ndpd-db1-prod Backup Encryption Certificate',
EXPIRY_DATE = '20330131';
Go
--backup CERTIFICATE BackupCert_ndpddb1p_2023.cer and its private key
BACKUP CERTIFICATE BackupCert_ndpddb1p_2023
TO FILE = 'C:\sqlserver_system_database_file_backup\BackupCert_ndpddb1p_2023.cer'
WITH PRIVATE KEY
(
FILE = 'C:\sqlserver_system_database_file_backup\BACKUP_CERTIFICATE_PRIVATE_KEY_ndpddb1p_2023.key'
,ENCRYPTION BY PASSWORD = 'xxx'
);
Step 2: modify the backup jobs to use the new cert.
Database Backups Archive (Long Term Storage) Layout on S3
SQL Server backup copies are stored on a local disk for only 8 days; as soon as a backup is produced, it is copied from local disk to S3 for long term storage. On S3, We create lifecycle policies to implement data retention. Since a S3 lifecycle policy can be applied to a folder and affect all its subfolders, we structure S3 storage space as below:
<bucketName>/<environment>/<frequency>/<sqlServerName>/<databaseName>/<backupFileName>
bucketName is “sqlserverbackups.drdatand”.
environment takes two values: “np” stands for non-production; “prod” stands for production.
frequency takes three values: “daily”, “weekly” and “monthly”. S3 lifecycle policies are applied at this level of folders.
Policy Name |
Rule Target |
Comment |
prod/daily 15days |
prod/daily/ |
Production daily differential backups and transaction backups will be kept for 15 days |
prod/weekly 90 days |
prod/weekly/ |
Production weekly full backups will be kept for 90 days |
prod/monthly 18 months |
prod/monthly/ |
Production monthly copy-only full backups will be kept for 18 months |
np/daily 8 days |
np/daily/ |
Non-production daily differential backups and transaction backups will be kept for 8 days |
np/weekly 90 days |
np/weekly/ |
Non-production weekly full backups will be kept for 90 days |
np/monthly 6 months |
np/monthly/ |
Non-production monthly copy-only full backups will be kept for 6 months |
sqlServerName takes names such as “sqlmon5-prod”, “sqlstd1-prod”, …, etc.
databaseNames takes names of all databases on the individual SQL server.
backupFileNmae takes names with pattern: <databaseName>_<dateAndTime>.<fileExtention> such as “dbautility_20200328070011.bak”
Server Backups
Currently Cloud Platform Services uses Cloud Protection Manager to snapshot EC2 instances on a daily basis. Through the configuration, the snapshot can include all mounted volumes (disks) or only include root volume (ie, C: drive). The snapshot backups will be kept for 30 days for all environments (dev, test and prod) within local region; the snapshot backups for production instances will also be stored cross regionally with a separate AWS account (backup account).
Recovery Scenarios
There are five EBS volumes mounted on an EC2 instance for SQL Server.
- Root Volume (C: drive): where sql server binaries are stored.
- Data Volume (D: drive): where database data files and home-grown scripts are stored.
- Tlog Volume (E: drive): where database transaction log files are stored.
- Backup Volume (F: drive): where database backups are stored. This volume is usually referred as local backup storage.
- TempDB storage (Z: drive): where TempDB datafiles and transaction log files are stored. This volume is recreated each time the instance is booted.
Seven likely failure and recovery scenarios are outlined below:
- An individual database crashes.
The database will be recovered from its backup stored on F: drive. The recovery time is close to what it takes on an On-Premise sql server. Depending on the size of the database, it can be few minutes to an hour.
- Data volume, and/or Tlog volume crashes.
First, a new EBS volume will be created and mounted to the EC2 instance. This will take less than 10 minutes. Additional 10 plus minutes is needed to create folder layout and bring up sql server. Then each of databases will be recovered from its backup (similar to scenario 1). The total down time can be 25 minutes plus, which depends on the number of databases on the sql server and the size of each database.
- Backup Volume crashes.
A new EBS volume will be created and mounted to the EC2 instance. Then re-generate folder layout and perform a full database backup on all databases. Copy backup files from S3 to the backup volume if there is need. The total dba working time will be less than 20 minute and sql server down time is zero.
- EC2 instance crashes but the root volume (C: drive) is intact.
In the event of a hardware crash where no EBS volumes are affected the instance will need to be rebooted. Total downtime will be less than 10 minutes.
- Both EC2 instance and root volume (C: drive) crash.
Recovery option 1: The snapshots of the EC2 instance and root volume, which were taken by Cloud Platform Services, will be used to re-instantiate a new EC2 instance and the EBS volumes re-mounted to the new instance. The total down time is unknown at this point.[1]
Recovery option 2: provision a new EC2 from Cloud Formation script. Mount EBS volumes (D:, E:, F:). Download sql server installation files from S3; Install sql server along with upgrades/patches. Bring up the sql server. The total down time will be less than 30 minutes.
- A table was accidentally deleted.
Restore the database from its backups taken prior to the time when the table was deleted; give the restored database a different name. Import the table into its original database with over-write option. The total dba working time will be about 10 minutes plus, which depends on the size of the database and the number of backups needed. The total sql server down time is zero.
If the table was deleted 8 days ago, the specific backup files (a full backup, few differential and transaction log backups) will be downloaded from S3 to the EBS volume. Then we perform the steps stated above.
- Loss of EC2 instance and all EBS volumes.
In a disaster event, we could lose EC2 instance and storage. We’ll recover the EC2 instance and root volume in the way described in scenario 5. Then create other three EBS volumes and copy the backups from S3 to EBS volumes. Bring up sql server and restore each of databases from its backups.