Migrating from Oracle to Snowflake

This document is meant to serve as a migration plan for SQL Server databases and tables from on-premises and/or cloud IaaS to the Snowflake Data Cloud.

Oracle is a relational database (RDBMS) that frequently has significant licensing costs and is used to house a company’s most crucial data. In addition, Oracle needs a lot of setup and upkeep work, both for the hardware (network, storage, OS patching, configuration) and the software (configuration, updates). If the data on these systems has to be used for analytical reasons, there are frequently additional licensing costs for the OLAP counterparts, or there is an increased danger that analytical workloads will interfere with OLTP systems that are supporting end users. These factors might soon add up to a pricey product that puts a strain on the IT crew and the budget.

Beyond licensing fees, there are other motives for wishing to migrate, some of which were mentioned above. Although the migration strategy outlined here is applicable to far more systems than just Oracle, the focus of this document is on Oracle migrations.

Why Snowflake Over Oracle?

The Snowflake Data Cloud was designed with the cloud in mind, and allows its users to interface with the software without having to worry about the infrastructure it runs on or how to install it. Between the reduction in operational complexity, the pay-for-what-you-use pricing model, and the ability to isolate compute workloads there are numerous ways to reduce costs associated with performing analytical tasks. Some other benefits and capabilities include:

Data sharing: Easily share data securely within your organization or externally with your customers.
Zero copy cloning: Create multiple ‘copies’ of tables, schemas, or databases without actually copying the data. This saves on the time to copy and reduces data storage costs.
Separate compute and storage: Scale your compute and storage independent of one another, and isolate compute power for jobs that need their own dedicated warehouse.
No hardware provisioning: No hardware to provision, just a t-shirt sized warehouse available as needed within seconds.

Snowflake Considerations

Snowflake is built on public cloud infrastructure, and can be deployed to Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP). When moving to Snowflake, there are some considerations regarding which cloud platform to use. Refer to the Snowflake documentation to assist with choosing the right platform for your organization.

For the purposes of this migration plan, AWS technologies will be used when options are available.

Migration Goals and Requirements

The primary goals of this migration are to reduce the costs and operational burden of running licenced OLAP database systems on-premises, in favor of a cloud native, pay-for-what-you-use SaaS counterpart. It seems only natural that a migration process would have similar goals. This will be a foundational concept applied throughout the migration plan.

Further, it would be difficult to build a solution without some sort of requirements. There is a lot of value in knowing requirements up front, as they will help to drive the design of a system and can often be used to create SLA’s by which the solution can be evaluated against. This solution must:

Migrate specific databases/schemas/tables (defined by configuration) for the source relational database systems – Oracle.
Migrate to an equivalent target database/schema/table in Snowflake.
Migrate both retrospective and prospective data.
Migrate prospective data from the source database to Snowflake with a latency that is under five minutes.

Plan for Migration

The data flow for migrating any source system to Snowflake can be distilled into the following three box architecture:

Snowflake Migration three-box Architecture

This picture is pretty straightforward, but immediately introduces something not obvious; a small pitstop for the source data in an ‘external stage’. This is because Snowflake has no native tools for pulling data directly from a database. Instead it uses the concept of a Stage, which can be either internal or external, as a point to load data from and unload data to.

Snowflake Stages are effectively pointers to public cloud object storage locations and metadata about the files in that location. These object storage locations are ideal for staging data because they scale nearly infinitely and often large enterprises would build data lakes on this storage, making it convenient to get data into Snowflake and allow for data in Snowflake to be easily contributed back to the lake.

The two arrows represent distinct processing steps of the migration that will be outlined in the coming sections. Per the migration goals, each of these steps should attempt to minimize costs by utilizing cloud infrastructure (preferring SaaS solutions to PaaS, PaaS to IaaS, and IaaS to on-premises) and allowing cost to scale with the migration.

Migration – Source to Stage

Snowflake has many technology partners that assist with moving data from a source database to an object storage location that can be used as an external stage. These include FiveTran, Attunity, Informatica, HVR, and Streamsets, all of which phData has past experience with, but this migration will instead focus on a cloud native solution from AWS, the Database Migration Service (DMS).

AWS DMS is a data migration service that supports highly configurable data migrations and replications between various sources and targets. Oracle is included as an available source system, making this the perfect tool to handle data collection from our source Oracle database. Targets include Apache Kafka, multiple AWS services like S3, DynamoDB, and Kinesis but for the purpose of this paper we are going to focus on S3 for Snowflake consumption.

AWS DMS appears to meet the migration goals defined above. It is a SaaS solution by AWS that is primarily setup by configuration. Costs scale with the number and size of its replication instances needed for a migration. It’s replication tasks have settings for bulk load migration to support retrospective data, as well as CDC for prospective data. It also supports S3 as a write target which is going to be used as the stage for Snowflake. The replication tasks achieve near real time latency for CDC by using Oracles’s native API to read archived redo logs as they are written to.

Building this step of the migration is primarily done by configuring the AWS DMS to produce the desired results, but it isn’t the only thing that needs to be done. Ensuring that the DMS infrastructure is authorized to access both the Oracle database and the S3 target is important, as is setting up the source database to produce the data that is needed for migration.

DMS Infrastructure

Replication Instance: This is the infrastructure that is running a replication task. Creation of this resource will require network configuration (VPCs, Subnets, SecurityGroups) and instance class/size details. Additional considerations include whether or not this instance is publicly available and whether or not it is highly available.
Replication Task: This is the definition of how data is replicated from the source to the target. The type of the migration is defined here, CDC and/or Full (bulk) Load, as well as settings that include (but isn’t limited to) logging, CDC tuning, error handling, bulk load tuning, validation, and transformation. This is where a majority of configuration of this step of the migration will be. 
Source Endpoint: A holder for the connection details for the source of a replication task. This will be the Oracle connection details such as: username, password, hostname, port, and database.
Target Endpoint: Similar to the source endpoint, this is a holder of connection details for the target. For S3 this would include bucket, object key (folder) and IAM Role, but also includes additional configuration details such as encryption, compression, max file size and data format.
S3 Bucket: The storage location for the external stage for Snowflake.
IAM Role (not shown): A target of S3 is authorized by IAM role, rather than a database user, and so it is necessary to include an IAM Role. This role must have access to write to the S3 location, and have a trust policy that allows the DMS service the ability to assume the role.

Authentication and Authorization

The replication task must authenticate against both the Oracle database and the S3 bucket to perform the first step of the migration. Authentication into both systems is specified via the respective endpoints. For Oracle, the database username and password provided would be for a user specifically created for use by AWS DMS. For the target, the IAM Role is the identity, and must have an attached trust policy that specifies AWS DMS as a valid principal.

Authorization for the Oracle database is specified via a series of GRANTS to allow access to the database redo logs. For the target, the IAM role must include a policy that allows it to write to the target S3 bucket, and the target S3 bucket’s policy must allow the IAM role to write to it.

See the appendix for further details on authentication and authorization to these systems.

Source Database Preparation

Setting up a source database to be properly utilized by AWS DMS requires a bit more configuration than just providing it a user. In general, the following needs to be set up:

For one time batch, it is typically sufficient to allow the DMS database user to query the tables that need to be migrated.
To include ongoing replication requires database specific actions to enable data capture and provide access to this data to the DMS database user.

For Oracle, the ongoing replication would be handled by enabling redo logs, persisting them for a reasonable amount of time, and allowing the DMS database user to access them. How this is achieved differs from vendor to vendor, and from self-managed to hosted setups.

See the appendix for details on setting up an Oracle database to work with AWS DMS.

Replication Output

Replication tasks will output to S3 with a default data format of comma separated values (.csv), but the format is configurable with Parquet being recommended. This can be updated via the extra connection attributes provided during Target Endpoint creation.

When CDC is involved, it is important to note a couple points about the data that is output.

The schema for a given table will reflect that table’s schema, but will also include an additional ‘timestamp’ field and ‘op’ field to reflect the point in time a change happened and the operation that the change reflects, either I[nsert], U[pdate], or D[elete].
The data is an append-only log of ongoing changes added to the end of the bulk load of data. This matters because this data will need to be reassembled in Snowflake for the tables to be reflective of the data in the source tables.

Migration – Stage to Snowflake

Once data is in the external stage, the remainder of the migration can be achieved using Snowflake resources and capabilities.

Even though Snowflake is the primary requirement for this solution, it is still valid to evaluate it against the migration goals. Snowflake is a SaaS solution that builds data warehouse systems using SQL commands. With Snowflake, costs accrue for storage use and compute use on a per-second basis. Snowflakes pipe and task objects support building low latency data pipelines. Finally, the data landing in S3 can be treated the same through the Snowflake pipeline, whether retrospective or prospective.

Building this step of the migration involves configuring a couple components to enable authorized access to the data in S3 and to ensure timely delivery to the Snowflake pipe. The remainder of this step would then be built by executing DDL to build out the pipe, tables, stream and task.

Snowflake and AWS Infrastructure

S3 Bucket: The storage location for the external stage for Snowflake.
Storage Integration: The storage integration holds the IAM entity that is authorized to access the S3 location that holds the output of the first step of the migration.
External Stage: The Snowflake object that refers to the actual S3 storage location that holds the output of the first step of the migration. A storage integration is usually provided when creating an external stage, which acts as the authentication to access the S3 bucket.
SQS Queue: Snowflake has internal SQS Queues that are used to provide Snowflake pipes with the data they need in near real time. Additional configuration to an S3 bucket, in the form of adding a bucket event notification, is necessary to get pipes working automatically. As the first step of the migration moves CDC data into the S3 bucket, the event notifications will push that data into the SQS queue, which will trigger the Snowflake pipes into action.

Authentication and Authorization

Snowflake has the concept of a Storage Integration that it uses to provide credential-less access to AWS S3 external stages. These are setup using an IAM Role that is provided access to the S3 location. This role is then set up with a trust policy that allows a Snowflake IAM role to assume the original role, effectively delegating its abilities to the Snowflake role.

The process of properly creating a storage integration has a few back and forth steps, which is essentially creating a long term authentication to Snowflake via the trust policy. The policies attached to the IAM role and the S3 bucket are what defines the authorization.

Migration Pipeline

This step of the migration involves two Snowflake tables. The first is a representation of the data that mirrors what is in S3, an append-only table that contains the operations performed against the original source database, the ‘change table’. The second is reflective of the (mostly) current state of the original source database, the ‘reporting table’, which is the final target for the data in the migration.

The pipe, often referred to as Snowpipe in Snowflake marketing materials, is used to keep the change table up to date with the latest data from S3. A pipe is effectively a COPY INTO statement that is listening for changes to data in a stage. As new CDC data lands in S3 from the first step of the migration, S3 object notifications signal the pipe of the new data. Upon receiving a signal, the pipe queues its COPY INTO command for a Snowflake managed warehouse to copy the data into the change table. Charges for the warehouse are billed per second with a small overhead for every 1000 files.

Table Streams keep track of DML changes to a table allowing for action to be taken against the delta. Streams can be queried just like a table can, and the contents of the stream are cleared when queried. The stream on the change table is keeping track of the changes to the change table, which will only be insert operations for this migration, as the data coming from S3 is either the data from the bulk load or the CDC data from the source database.

Tasks are scheduled execution of a specified SQL statement. Creation of a task requires providing a warehouse with which the query will execute. The migration for this step uses a task that starts up periodically, checks to see if the stream has any new changes, and executes a MERGE INTO statement to the reporting table.

Together these Snowflake resources work together to take the CDC data as it lands in S3 and reassemble it to a representation that mimics the source, with some degree of latency.