As essentially the most broadly used cloud knowledge warehouse, Amazon Redshift makes it easy and cost-effective to research your knowledge utilizing normal SQL and your current ETL (extract, remodel, and cargo), enterprise intelligence (BI), and reporting instruments. Tens of hundreds of shoppers use Amazon Redshift to research exabytes of information per day and energy analytics workloads similar to BI, predictive analytics, and real-time streaming analytics with out having to handle the info warehouse infrastructure. It natively integrates with different AWS providers, facilitating the method of constructing enterprise-grade analytics functions in a fashion that isn’t solely cost-effective, but additionally avoids level options.
We’re constantly innovating and releasing new options of Amazon Redshift, enabling the implementation of a variety of information use circumstances and assembly necessities with efficiency and scale. For instance, Amazon Redshift Serverless lets you run and scale analytics workloads with out having to provision and handle knowledge warehouse clusters. Different options that assist energy analytics at scale with Amazon Redshift embody computerized concurrency scaling for learn and write queries, computerized workload administration (WLM) for concurrency scaling, computerized desk optimization, the brand new RA3 situations with managed storage to scale cloud knowledge warehouses and scale back prices, cross-Area knowledge sharing, knowledge trade, and the SUPER knowledge sort to retailer semi-structured knowledge or paperwork as values. For the most recent function releases for Amazon Redshift, see Amazon Redshift What’s New. Along with enhancing efficiency and scale, you too can achieve as much as 3 times higher worth efficiency with Amazon Redshift than different cloud knowledge warehouses.
To reap the benefits of the efficiency, safety, and scale of Amazon Redshift, clients need to migrate their knowledge from their current cloud warehouse in a manner that’s each price optimized and performant. This publish describes the way to migrate a big quantity of information from Snowflake to Amazon Redshift utilizing AWS Glue Python shell in a fashion that meets each these targets.
AWS Glue is serverless knowledge integration service that makes it straightforward to find, put together, and mix knowledge for analytics, machine studying (ML), and software improvement. AWS Glue supplies all of the capabilities wanted for knowledge integration, permitting you to research your knowledge in minutes as a substitute of weeks or months. AWS Glue helps the flexibility to make use of a Python shell job to run Python scripts as a shell, enabling you to creator ETL processes in a well-known language. As well as, AWS Glue lets you handle ETL jobs utilizing AWS Glue workflows, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), and AWS Step Capabilities, automating and facilitating the orchestration of ETL steps.
Answer overview
The next structure reveals how an AWS Glue Python shell job migrates the info from Snowflake to Amazon Redshift on this answer.
The answer is comprised of two phases:
- Extract – The primary a part of the answer extracts knowledge from Snowflake into an Amazon Easy Storage Service (Amazon S3) knowledge lake
- Load – The second a part of the answer reads the info from the identical S3 bucket and masses it into Amazon Redshift
For each phases, we join the AWS Glue Python shell jobs to Snowflake and Amazon Redshift utilizing database connectors for Python. The primary AWS Glue Python shell job reads a SQL file from an S3 bucket to run the related COPY instructions on the Snowflake database utilizing Snowflake compute capability and parallelism emigrate the info to Amazon S3. When that is full, the second AWS Glue Python shell job reads one other SQL file, and runs the corresponding COPY instructions on the Amazon Redshift database utilizing Redshift compute capability and parallelism to load the info from the identical S3 bucket.
Each jobs are orchestrated utilizing AWS Glue workflows, as proven within the following screenshot. The workflow pushes knowledge processing logic all the way down to the respective knowledge warehouses by working COPY instructions on the databases themselves, minimizing the processing capability required by AWS Glue to simply the sources wanted to run the Python scripts. The COPY instructions load knowledge in parallel each to and from Amazon S3, offering one of many quickest and most scalable mechanisms to switch knowledge from Snowflake to Amazon Redshift.
As a result of all heavy lifting round knowledge processing is pushed all the way down to the info warehouses, this answer is designed to offer a cost-optimized and extremely performant mechanism emigrate a big quantity of information from Snowflake to Amazon Redshift with ease.
Your entire answer is packaged in an AWS CloudFormation template for simplicity of deployment and computerized provisioning of many of the required sources and permissions.
The high-level steps to implement the answer are as follows:
- Generate the Snowflake SQL file.
- Deploy the CloudFormation template to provision the required sources and permissions.
- Present Snowflake entry to newly created S3 bucket.
- Run the AWS Glue workflow emigrate the info.
Conditions
Earlier than you get began, you’ll be able to optionally construct the most recent model of the Snowflake Connector for Python package deal domestically and generate the wheel (.whl) package deal. For directions, check with How you can construct.
In the event you don’t present the most recent model of the package deal, the CloudFormation template makes use of a pre-built .whl file that will not be on essentially the most present model of Snowflake Connector for Python.
By default, the CloudFormation template migrates knowledge from all tables within the TPCH_SF1
schema of the SNOWFLAKE_SAMPLE_DATA
database, which is a pattern dataset supplied by Snowflake when an account is created. The next saved process is used to dynamically generate the Snowflake COPY instructions required emigrate the dataset to Amazon S3. It accepts the database title, schema title, and stage title because the parameters.
Deploy the required sources and permissions utilizing AWS CloudFormation
You should utilize the supplied CloudFormation template to deploy this answer. This template mechanically provisions an Amazon Redshift cluster together with your desired configuration in a non-public subnet, sustaining a excessive normal of safety.
- Register to the AWS Administration Console, ideally as admin consumer.
- Choose your required Area, ideally the identical Area the place your Snowflake occasion is provisioned.
- Select Launch Stack:
- Select Subsequent.
- For Stack title, enter a significant title for the stack, for instance,
blog-resources
.
The Parameters part is split into two subsections: Supply Snowflake Infrastructure and Goal Redshift Configuration.
- For Snowflake Unload SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the pattern knowledge within the
TPCH_SF1
schema of theSNOWFLAKE_SAMPLE_DATA
database. - For Information S3 Bucket, enter a prefix for the title of the S3 bucket that’s mechanically provisioned to stage the Snowflake knowledge, for instance,
sf-migrated-data
. - For Snowflake Driver, if relevant, enter the S3 location (URI) of the .whl package deal constructed earlier as a prerequisite. By default, it makes use of a pre-built .whl file.
- For Snowflake Account Identify, enter your Snowflake account title.
You should utilize the next question in Snowflake to return your Snowflake account title:
- For Snowflake Username, enter your consumer title to connect with the Snowflake account.
- For Snowflake Password, enter the password for the previous consumer.
- For Snowflake Warehouse Identify, enter the warehouse title for working the SQL queries.
Be certain the aforementioned consumer has entry to the warehouse.
- For Snowflake Database Identify, enter the database title. The default is
SNOWFLAKE_SAMPLE_DATA
. - For Snowflake Schema Identify, enter schema title. The default is
TPCH_SF1
.
- For VPC CIDR Block, enter the specified CIDR block of Redshift cluster. The default is
10.0.0.0/16
. - For Subnet 1 CIDR Block, enter the CIDR block of the primary subnet. The default is
10.0.0.0/24
. - For Subnet 2 CIDR Block, enter the CIDR block of the primary subnet. The default is
10.0.1.0/24
. - For Redshift Load SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the pattern knowledge in S3 to Redshift.
The next database view in Redshift is used to dynamically generate Redshift COPY instructions required emigrate the dataset from Amazon S3. It accepts the schema title because the filter standards.
- For Redshift Database Identify, enter your required database title, for instance,
dev
. - For Variety of Redshift Nodes, enter the specified compute nodes, for instance,
2
. - For Redshift Node Sort, select the specified node sort, for instance, ra3.4xlarge.
- For Redshift Password, enter your required password with the next constraints: it should be 8–64 characters in size, and comprise at the least one uppercase letter, one lowercase letter, and one quantity.
- For Redshift Port, enter the Amazon Redshift port quantity to connect with. The default port is
5439
.
- Select Subsequent.
- Evaluation and select Create stack.
It takes round 5 minutes for the template to complete creating all sources and permissions. A lot of the sources have the prefix of the stack title you specified for straightforward identification of the sources later. For extra particulars on the deployed sources, see the appendix on the finish of this publish.
Create an IAM function and exterior Amazon S3 stage for Snowflake entry to the info S3 bucket
To ensure that Snowflake to entry the TargetDataS3Bucket
created earlier by CloudFormation template, it’s essential to create an AWS Identification and Entry Administration (IAM) function and exterior Amazon S3 stage for Snowflake entry to the S3 bucket. For directions, check with Configuring Safe Entry to Amazon S3.
Once you create an exterior stage in Snowflake, use the worth for TargetDataS3Bucket
on the Outputs tab of your deployed CloudFormation stack for the Amazon S3 URL of your stage.
Be certain to call the exterior stage unload_to_s3
if you happen to’re migrating the pattern knowledge utilizing the default scripts supplied within the CloudFormation template.
Convert Snowflake tables to Amazon Redshift
You’ll be able to merely run the next DDL statements to create TPCH_SF1 schema objects in Amazon Redshift. You can too use AWS Schema Conversion Software (AWS SCT) to transform Snowflake customized objects to Amazon Redshift. For directions on changing your schema, check with Speed up Snowflake to Amazon Redshift migration utilizing AWS Schema Conversion Software.
Run an AWS Glue workflow for knowledge migration
Once you’re prepared to begin the info migration, full the next steps:
- On the AWS Glue console, select Workflows within the navigation pane.
- Choose the workflow to run (<stack title>–
snowflake-to-redshift-migration
). - On the Actions menu, select Run.
- To verify the standing of the workflow, select the workflow and on the Historical past tab, choose the Run ID and select View run particulars.
- When the workflow is full, navigate to the Amazon Redshift console and launch the Amazon Redshift question editor v2 to confirm the profitable migration of information.
- Run the next question in Amazon Redshift to get row counts of all tables migrated from Snowflake to Amazon Redshift. Be certain to regulate the
table_schema
worth accordingly if you happen to’re not migrating the pattern knowledge.
- Run the next question in Snowflake to match and validate the info:
Clear up
To keep away from incurring future prices, delete the sources you created as a part of the CloudFormation stack by navigating to the AWS CloudFormation console, choosing the stack blog-resources
, and selecting Delete.
Conclusion
On this publish, we mentioned the way to carry out an environment friendly, quick, and cost-effective migration from Snowflake to Amazon Redshift. Migrations from one knowledge warehouse surroundings to a different can usually be very time-consuming and resource-intensive; this answer makes use of the facility of cloud-based compute by pushing down the processing to the respective warehouses. Orchestrating this migration with the AWS Glue Python shell supplies further price optimization.
With this answer, you’ll be able to facilitate your migration from Snowflake to Amazon Redshift. In the event you’re concerned about additional exploring the potential of utilizing Amazon Redshift, please attain out to your AWS Account Crew for a proof of idea.
Appendix: Sources deployed by AWS CloudFormation
The CloudFormation stack deploys the next sources in your AWS account:
- Networking sources – Amazon Digital Non-public Cloud (Amazon VPC), subnets, ACL, and safety group.
- Amazon S3 bucket – That is referenced as
TargetDataS3Bucket
on the Outputs tab of the CloudFormation stack. This bucket holds the info being migrated from Snowflake to Amazon Redshift. - AWS Secrets and techniques Supervisor secrets and techniques – Two secrets and techniques in AWS Secrets and techniques Supervisor retailer credentials for Snowflake and Amazon Redshift.
- VPC endpoints – The 2 VPC endpoints are deployed to determine a non-public connection from VPC sources like AWS Glue to providers that run outdoors of the VPC, similar to Secrets and techniques Supervisor and Amazon S3.
- IAM roles – IAM roles for AWS Glue, Lambda, and Amazon Redshift. If the CloudFormation template is to be deployed in a manufacturing surroundings, it’s worthwhile to alter the IAM insurance policies so that they’re not as permissive as offered on this publish (which have been set for simplicity and demonstration). Notably, AWS Glue and Amazon Redshift don’t require all of the actions granted within the
*FullAccess
insurance policies, which might be thought of overly permissive. - Amazon Redshift cluster – An Amazon Redshift cluster is created in a non-public subnet, which isn’t publicly accessible.
- AWS Glue connection – The connection for Amazon Redshift makes certain that the AWS Glue job runs inside the identical VPC as Amazon Redshift. This additionally ensures that AWS Glue can entry the Amazon Redshift cluster in a non-public subnet.
- AWS Glue jobs – Two AWS Glue Python shell jobs are created:
- <stack title>-glue-snowflake-unload – The primary job runs the SQL scripts in Snowflake to repeat knowledge from the supply database to Amazon S3. The Python script is offered in S3. The Snowflake job accepts two parameters:
- SQLSCRIPT – The Amazon S3 location of the SQL script to run in Snowflake emigrate knowledge to Amazon S3. That is referenced because the Snowflake Unload SQL Script parameter within the enter part of the CloudFormation template.
- SECRET – The Secrets and techniques Supervisor ARN that shops Snowflake connection particulars.
- <stack title>-glue-redshift-load – The second job runs one other SQL script in Amazon Redshift to repeat knowledge from Amazon S3 to the goal Amazon Redshift database. The Python script hyperlink is offered in S3. The Amazon Redshift job accepts three parameters:
- SQLSCRIPT – The Amazon S3 location of the SQL script to run in Amazon Redshift emigrate knowledge from Amazon S3. In the event you present customized SQL script emigrate the Snowflake knowledge to Amazon S3 (as talked about within the stipulations), the file location is referenced as LoadFileLocation on the Outputs tab of the CloudFormation stack.
- SECRET – The Secrets and techniques Supervisor ARN that shops Amazon Redshift connection particulars.
- PARAMS – This consists of any further parameters required for the SQL script, together with the Amazon Redshift IAM function used within the COPY instructions and the S3 bucket staging the Snowflake knowledge. A number of parameter values might be supplied separated by a comma.
- <stack title>-glue-snowflake-unload – The primary job runs the SQL scripts in Snowflake to repeat knowledge from the supply database to Amazon S3. The Python script is offered in S3. The Snowflake job accepts two parameters:
- AWS Glue workflow – The orchestration of Snowflake and Amazon Redshift AWS Glue Python shell jobs is managed by way of an AWS Glue workflow. The workflow <stack title>–
snowflake-to-redshift-migration
runs later for precise migration of information.
Concerning the Authors
Raks Khare is an Analytics Specialist Options Architect at AWS primarily based out of Pennsylvania. He helps clients architect knowledge analytics options at scale on the AWS platform.
Julia Beck is an Analytics Specialist Options Architect at AWS. She helps clients in validating analytics options by architecting proof of idea workloads designed to satisfy their particular wants.