Overview

This document describes the hosting strategy adopted by the Databases Services team for hosting PostgreSQL databases in the Amazon cloud (AWS). The document describes the factors that affect the tenancy of a particular database as well as the guidelines for deploying, migrating, monitoring, recovering and backing up PostgreSQL databases in AWS.

Database Tenancy

General Considerations

The following are the major factors to take into consideration when migrating a PostgreSQL database to AWS or implementing new PostgreSQL database in AWS:

PostgreSQL database version specific features.
The size of the database.
The sensitivity of the data in the database.
The availability expectations of the database.
The compute requirements of the database.
The unique behavioral and/or security requirements of the database or application that the database supports.

On a case-by-case basis an application’s database will be evaluated against the above criteria to determine the tenancy of the database.

To maximize our investment in the AWS infrastructure the default implementation will be a multi-tenant PostgreSQL Version 9.4 db.t2.small RDS instance. Exceptions to the default implementation will be considered as described below.

Database Feature Requirements

PostgreSQL database should be no more than 2 releases behind the latest available from AWS on a db.t3.small RDS instance subject to application/vendor limitation

Database Size

Applications with large databases (>100 GiB) may be a candidate for a separate AWS server depending on performance requirements of the application or any special backup/recovery requirements.

Data Sensitivity

Sensitive data (to be determined by the customer following the published classifications of data by InfoSec) will not be mixed on the same server with non-sensitive data.

Availability Requirements

If a customer has a demonstrated need for a highly available database a HA instance will be made available. The bar for PostgreSQL HA in AWS will be very high given the low recovery time we have demonstrated in a single-instance recovery scenario (as low as 21 minutes to recover a server).

Compute Requirements

When an application’s compute requirements will affect other database’s performance on a shared instance it will either need to be migrated to a separate instance or the existing server will be upgraded. The current AWS cost model generally makes a new instance more cost effective.

Client/Third Party Needs

Certain third party applications are written so that they require elevated privileges to run, employ sub-standard security, or are otherwise “badly behaved”. These applications will be candidates for migration to a dedicated server or to a designated less secure server.

Database Deployment

Using Amazon’s Elastic Compute Cloud Service

Deploying PostgreSQL on an Elastic Compute Cloud (EC2) instance offers the greatest flexibility. Because EC2 gives full control of the instance installation all of the features of PostgreSQL is possible.

Advantages and limitations of EC2

Using EC2 leverages the infrastructure and services built by the OIT in AWS. Server level backups and other shared services are available. Full access to the underlying operating system allows for custom processing such as data imports and full access to the PostgreSQL management features allows for better performance tuning. The limitation of PostgreSQL on EC2 is that Elastic Block Storage (EBS) volumes cannot be resized as storage needs increase, either new volumes must be added or a downtime taken to transfer the data to a new, larger volume.

Using Amazon’s Relational Database Service

Amazon’s PostgreSQL RDS offering provides a managed platform for hosting PostgreSQL in our environment. For OIT-managed PostgreSQL databases, we have a strong preference for using the Amazon RDS to reduce management overhead of the database.

Advantages and Limitations of RDS

For non-OIT departments that want to implement an RDS PostgreSQL on their own we can offer the following analysis of the service and link to Amazon’s published documentation.

“Amazon RDS makes it easy to set up, operate, and scale PostgreSQL deployments in the cloud. With Amazon RDS, you can deploy scalable PostgreSQL deployments in minutes with cost-efficient and resizable hardware capacity. Amazon RDS manages complex and time-consuming administrative tasks such as PostgreSQL software installation and upgrades; storage management; replication for high availability and read throughput; and backups for disaster recovery.” - aws.amazon.com/rds/postgresql (04/2016)

There are some limitations that should be considered before committing to the RDS platform. It is a very good basic database platform, but disables most advanced management features, reserving them for itself.

RDS backups are limited to 35 days, if you have archive needs longer than that you will need to create a custom backup scheme outside what RDS provides.
Terminating an RDS instance will, by default, delete all the backups. You also need to explicitly request a backup when stopping your instance to avoid data loss.
You have no access to the underlying operating system resources, if you have jobs that rely on external data files (imports/exports) or batch scripts they cannot be used on an RDS instance. This also restricts access to many instance administrative functions.

Item 2 means that you need to be very careful when restoring from backup so that you do not lose backup history. The recovery method is to create a NEW instance with its own 35-day backup history, removing the original instance removes all its associated backups which means you need to keep it until the new instance catches up.

Taking into account the advantages and limitations of Amazon’s RDS will help you determine if it is a viable solution for your non-OIT supported database needs.

PostgreSQL deployment on EC2

Server Commissioning

Follow the standard commissioning process to request a new server.

PostgreSQL deployment on RDS

To deploy an RDS instance follow the steps outlined in the migration section below. Also make sure that the Core Services team is aware of the new instance so that is included in the master server list and a Reserved Instance (RI) purchased, if needed.

High Availability

High availability is not a feature of PostgreSQL currently being offered, although if future needs dictate it will be considered.

Migration Methodology

How to migrate a PostgreSQL database using Amazon Migration Services

Amazon’s migration services are available to migrate databases from an on-premises source to an AWS hosted target (or vice versa) in a homogeneous or heterogeneous fashion

How to migrate a PostgreSQL database to EC2

The migration of PostgreSQL databases to AWS is relatively straightforward. The database(s) for the application will be backed up, the backups transferred to AWS and recovered to the appropriate server as determined by the above criteria. Any server-level objects will have to be identified, scripted, and recreated on the target server.

Determine the target PostgreSQL instance, or commission a new one per the criteria outlined in the Database Tenancy and Database Deployment sections.
Backup the source database(s) using a native PostgreSQL backup, compression is advised to reduce transfer time to the cloud.
Script all needed non-database objects.
Restore the source database backup(s) to the target server.
Run the scripts created in step 3 to create the non-database objects.
Review and test.

How to migrate a PostgreSQL database to RDS.

Build an RDS, or decide which one to target.
Backup source database using PostgreSQL native backup, compression is advised to reduce transfer time to the cloud.
Script all needed non-database objects such as Logins.
Restore the source database backup to the RDS instance using the command-line utilities
Run the scripts created in step 3
Review and test.

Database Monitoring

Basic Cloud Watch metrics and alerts will be configured.

Database Backup and Recovery

Data backups

Native backup tools will be used on the following schedule

Weekly Full backups on Sunday, retained locally for 8 days, retained in S3 for three months

Daily differential backups, retained locally for 8 days, retained in S3 for two weeks

Monthly Full backups on the first of the month retained locally for 8 days and in S3 for 18 months

Transaction log backups will be made in intervals not to exceed 15 minutes in production or 4 hours in non-production systems. Transaction log backup retention will match differential backup retention.

Retention on the local machine is controlled by the backup script, retention in S3 will be driven by S3 lifecycle policies.

Server Backups

Currently Core Services uses Cloud Protection Manager to snapshot EC2 instances on a daily basis. Through the configuration, the snapshot can include all mounted volumes (disks) or only include root volume (ie, C: drive). The snapshot backups will be kept for 30 days for all environments (dev, test and prod) within local region; the snapshot backups for production instances will also be stored cross regionally with a separate AWS account (backup account).

Recovery Scenarios

Seven likely failure and recovery scenarios are outlined below:

An individual database crashes.

The database will be recovered from its backup stored locally. Depending on the size of the database, recovery time can be few minutes to an hour.
Data volume crashes.

First, a new EBS volume will be created and mounted to the EC2 instance. This will take less than 10 minutes. Additional 10 plus minutes is needed to create folder layout and bring up PostgreSQL. Then each of databases will be recovered from its backup (similar to scenario 1). The total down time can be 25 minutes plus, which depends on the number of databases on the PostgreSQL and the size of each database.
Backup Volume crashes.

A new EBS volume will be created and mounted to the EC2 instance. Then re-generate folder layout and perform a full database backup on all databases. Copy backup files from S3 to the backup volume if there is need. The total dba working time will be less than 20 minute and PostgreSQL down time is zero.
EC2 instance crashes but the root volume is intact.

In the event of a hardware crash where no EBS volumes are affected the instance will need to be rebooted. Total downtime will be less than 10 minutes.
Both EC2 instance and root volume (C: drive) crash.

Recovery option 1: The snapshots of the EC2 instance and root volume, which were taken by Core Services, will be used to re-instantiate a new EC2 instance and the EBS volumes re-mounted to the new instance. The total down time is unknown at this point.

Recovery option 2: provision a new EC2 from Cloud Formation script. Install PostgreSQL along with upgrades/patches. Mount EBS volumes (D:, E:, F:); bring up the PostgreSQL. The total down time will be less than 30 minutes.
A table was accidentally deleted.

Restore the database from its backups taken prior to the time when the table was deleted; give the restored database a different name. Import the table into its original database. The total dba working time will be about 10 minutes plus, which depends on the size of the database and the number of backups needed. The total PostgreSQL down time is zero.

If the table was deleted 8 days ago, the specific backup files (a full backup, few differential and transaction log backups) will be copied from S3 to the EBS volume. Then we perform the steps stated above.
Loss of EC2 instance and all EBS volumes.

In a disaster event, we could lose EC2 instance and storage. We’ll recover the EC2 instance and root volume in the way described in scenario 5. Then create other three EBS volumes and copy the backups from S3 to EBS volumes. Bring up PostgreSQL and restore each of databases from its backups.

Backup and Recovery (RDS)

Backup

For RDS instances snapshot retention will be set at the maximum of 35 days.

Recovery

Recovery of a failed RDS instance will use the point-in-time recovery provided by the AWS console. This will require communicating a new endpoint address to the end users.

Recovery of a single database will require a point-in-time recovery of the instance, then a native dump and restore of the individual database.

Disaster Recovery

The standard scenario for DR is the logical destruction of the DCND account. In this case the RDS and its snapshots will be gone. With no mechanism to move snapshots to DRND, recovery will be dependent on out-of-band database dumps taken by the end users or application admin.

Database Security Standards

Database security standards will be based on Center for Internet Security standards adapted for the university’s needs.