Thursday, November 30, 2023
HomeBig DataMigrate from Snowflake to Amazon Redshift utilizing AWS Glue Python shell

Migrate from Snowflake to Amazon Redshift utilizing AWS Glue Python shell


As essentially the most broadly used cloud knowledge warehouse, Amazon Redshift makes it easy and cost-effective to research your knowledge utilizing normal SQL and your current ETL (extract, remodel, and cargo), enterprise intelligence (BI), and reporting instruments. Tens of hundreds of shoppers use Amazon Redshift to research exabytes of information per day and energy analytics workloads similar to BI, predictive analytics, and real-time streaming analytics with out having to handle the info warehouse infrastructure. It natively integrates with different AWS providers, facilitating the method of constructing enterprise-grade analytics functions in a fashion that isn’t solely cost-effective, but additionally avoids level options.

We’re constantly innovating and releasing new options of Amazon Redshift, enabling the implementation of a variety of information use circumstances and assembly necessities with efficiency and scale. For instance, Amazon Redshift Serverless lets you run and scale analytics workloads with out having to provision and handle knowledge warehouse clusters. Different options that assist energy analytics at scale with Amazon Redshift embody computerized concurrency scaling for learn and write queries, computerized workload administration (WLM) for concurrency scaling, computerized desk optimization, the brand new RA3 situations with managed storage to scale cloud knowledge warehouses and scale back prices, cross-Area knowledge sharing, knowledge trade, and the SUPER knowledge sort to retailer semi-structured knowledge or paperwork as values. For the most recent function releases for Amazon Redshift, see Amazon Redshift What’s New. Along with enhancing efficiency and scale, you too can achieve as much as 3 times higher worth efficiency with Amazon Redshift than different cloud knowledge warehouses.

To reap the benefits of the efficiency, safety, and scale of Amazon Redshift, clients need to migrate their knowledge from their current cloud warehouse in a manner that’s each price optimized and performant. This publish describes the way to migrate a big quantity of information from Snowflake to Amazon Redshift utilizing AWS Glue Python shell in a fashion that meets each these targets.

AWS Glue is serverless knowledge integration service that makes it straightforward to find, put together, and mix knowledge for analytics, machine studying (ML), and software improvement. AWS Glue supplies all of the capabilities wanted for knowledge integration, permitting you to research your knowledge in minutes as a substitute of weeks or months. AWS Glue helps the flexibility to make use of a Python shell job to run Python scripts as a shell, enabling you to creator ETL processes in a well-known language. As well as, AWS Glue lets you handle ETL jobs utilizing AWS Glue workflows, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), and AWS Step Capabilities, automating and facilitating the orchestration of ETL steps.

Answer overview

The next structure reveals how an AWS Glue Python shell job migrates the info from Snowflake to Amazon Redshift on this answer.

Architecture

The answer is comprised of two phases:

  • Extract – The primary a part of the answer extracts knowledge from Snowflake into an Amazon Easy Storage Service (Amazon S3) knowledge lake
  • Load – The second a part of the answer reads the info from the identical S3 bucket and masses it into Amazon Redshift

For each phases, we join the AWS Glue Python shell jobs to Snowflake and Amazon Redshift utilizing database connectors for Python. The primary AWS Glue Python shell job reads a SQL file from an S3 bucket to run the related COPY instructions on the Snowflake database utilizing Snowflake compute capability and parallelism emigrate the info to Amazon S3. When that is full, the second AWS Glue Python shell job reads one other SQL file, and runs the corresponding COPY instructions on the Amazon Redshift database utilizing Redshift compute capability and parallelism to load the info from the identical S3 bucket.

Each jobs are orchestrated utilizing AWS Glue workflows, as proven within the following screenshot. The workflow pushes knowledge processing logic all the way down to the respective knowledge warehouses by working COPY instructions on the databases themselves, minimizing the processing capability required by AWS Glue to simply the sources wanted to run the Python scripts. The COPY instructions load knowledge in parallel each to and from Amazon S3, offering one of many quickest and most scalable mechanisms to switch knowledge from Snowflake to Amazon Redshift.

As a result of all heavy lifting round knowledge processing is pushed all the way down to the info warehouses, this answer is designed to offer a cost-optimized and extremely performant mechanism emigrate a big quantity of information from Snowflake to Amazon Redshift with ease.

Glue Workflow

Your entire answer is packaged in an AWS CloudFormation template for simplicity of deployment and computerized provisioning of many of the required sources and permissions.

The high-level steps to implement the answer are as follows:

  1. Generate the Snowflake SQL file.
  2. Deploy the CloudFormation template to provision the required sources and permissions.
  3. Present Snowflake entry to newly created S3 bucket.
  4. Run the AWS Glue workflow emigrate the info.

Conditions

Earlier than you get began, you’ll be able to optionally construct the most recent model of the Snowflake Connector for Python package deal domestically and generate the wheel (.whl) package deal. For directions, check with How you can construct.

In the event you don’t present the most recent model of the package deal, the CloudFormation template makes use of a pre-built .whl file that will not be on essentially the most present model of Snowflake Connector for Python.

By default, the CloudFormation template migrates knowledge from all tables within the TPCH_SF1 schema of the SNOWFLAKE_SAMPLE_DATA database, which is a pattern dataset supplied by Snowflake when an account is created. The next saved process is used to dynamically generate the Snowflake COPY instructions required emigrate the dataset to Amazon S3. It accepts the database title, schema title, and stage title because the parameters.

CREATE OR REPLACE PROCEDURE generate_copy(db_name VARCHAR, schema_name VARCHAR, stage_name VARCHAR)
   returns varchar not null
   language javascript
   as
   $$
var return_value = "";
var sql_query = "choose table_catalog, table_schema, decrease(table_name) as table_name from " + DB_NAME + ".information_schema.tables the place table_schema="" + SCHEMA_NAME + """ ;
   var sql_statement = snowflake.createStatement(
          {
          sqlText: sql_query
          }
       );
/* Creates consequence set */
var result_scan = sql_statement.execute();
whereas (result_scan.subsequent())  ' COMPRESSION = GZIP)";
       return_value += "n";
       return_value += "OVERWRITE = TRUE;"
       return_value += "n";
       
return return_value;
$$
;

Deploy the required sources and permissions utilizing AWS CloudFormation

You should utilize the supplied CloudFormation template to deploy this answer. This template mechanically provisions an Amazon Redshift cluster together with your desired configuration in a non-public subnet, sustaining a excessive normal of safety.

  1. Register to the AWS Administration Console, ideally as admin consumer.
  2. Choose your required Area, ideally the identical Area the place your Snowflake occasion is provisioned.
  3. Select Launch Stack:
  4. Select Subsequent.
  5. For Stack title, enter a significant title for the stack, for instance, blog-resources.

The Parameters part is split into two subsections: Supply Snowflake Infrastructure and Goal Redshift Configuration.

  1. For Snowflake Unload SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the pattern knowledge within the TPCH_SF1 schema of the SNOWFLAKE_SAMPLE_DATA database.
  2. For Information S3 Bucket, enter a prefix for the title of the S3 bucket that’s mechanically provisioned to stage the Snowflake knowledge, for instance, sf-migrated-data.
  3. For Snowflake Driver, if relevant, enter the S3 location (URI) of the .whl package deal constructed earlier as a prerequisite. By default, it makes use of a pre-built .whl file.
  4. For Snowflake Account Identify, enter your Snowflake account title.

You should utilize the next question in Snowflake to return your Snowflake account title:

SELECT CURRENT_ACCOUNT();

  1. For Snowflake Username, enter your consumer title to connect with the Snowflake account.
  2. For Snowflake Password, enter the password for the previous consumer.
  3. For Snowflake Warehouse Identify, enter the warehouse title for working the SQL queries.

Be certain the aforementioned consumer has entry to the warehouse.

  1. For Snowflake Database Identify, enter the database title. The default is SNOWFLAKE_SAMPLE_DATA.
  2. For Snowflake Schema Identify, enter schema title. The default is TPCH_SF1.

CFN Param Snowflake

  1. For VPC CIDR Block, enter the specified CIDR block of Redshift cluster. The default is 10.0.0.0/16.
  2. For Subnet 1 CIDR Block, enter the CIDR block of the primary subnet. The default is 10.0.0.0/24.
  3. For Subnet 2 CIDR Block, enter the CIDR block of the primary subnet. The default is 10.0.1.0/24.
  4. For Redshift Load SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the pattern knowledge in S3 to Redshift.

The next database view in Redshift is used to dynamically generate Redshift COPY instructions required emigrate the dataset from Amazon S3. It accepts the schema title because the filter standards.

CREATE OR REPLACE VIEW v_generate_copy
AS
SELECT
    schemaname ,
    tablename  ,
    seq        ,
    ddl
FROM
    (
        SELECT
            table_id   ,
            schemaname ,
            tablename  ,
            seq        ,
            ddl
        FROM
            (
                --COPY TABLE
                SELECT
                    c.oid::bigint  as table_id   ,
                    n.nspname      AS schemaname ,
                    c.relname      AS tablename  ,
                    0              AS seq        ,
                    'COPY ' + n.nspname + '.' + c.relname + ' FROM ' AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r'
                --COPY TABLE continued                
                UNION                
                SELECT
                    c.oid::bigint as table_id   ,
                    n.nspname     AS schemaname ,
                    c.relname     AS tablename  ,
                    2             AS seq        ,
                    '''${' + '2}' + c.relname + '/'' iam_role ''${' + '1}'' gzip delimiter ''|'' EMPTYASNULL REGION ''us-east-1''' AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r'
                --END SEMICOLON                
                UNION                
                SELECT
                    c.oid::bigint as table_id  ,
                    n.nspname     AS schemaname,
                    c.relname     AS tablename ,
                    600000005     AS seq       ,
                    ';'           AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r' 
             )
        ORDER BY
            table_id  ,
            schemaname,
            tablename ,
            seq 
    );

SELECT ddl
FROM v_generate_copy
WHERE schemaname="tpch_sf1";

  1. For Redshift Database Identify, enter your required database title, for instance, dev.
  2. For Variety of Redshift Nodes, enter the specified compute nodes, for instance, 2.
  3. For Redshift Node Sort, select the specified node sort, for instance, ra3.4xlarge.
  4. For Redshift Password, enter your required password with the next constraints: it should be 8–64 characters in size, and comprise at the least one uppercase letter, one lowercase letter, and one quantity.
  5. For Redshift Port, enter the Amazon Redshift port quantity to connect with. The default port is 5439.

CFN Param Redshift 1 CFN Param Redshift 2

  1. Select Subsequent.
  2. Evaluation and select Create stack.

It takes round 5 minutes for the template to complete creating all sources and permissions. A lot of the sources have the prefix of the stack title you specified for straightforward identification of the sources later. For extra particulars on the deployed sources, see the appendix on the finish of this publish.

Create an IAM function and exterior Amazon S3 stage for Snowflake entry to the info S3 bucket

To ensure that Snowflake to entry the TargetDataS3Bucket created earlier by CloudFormation template, it’s essential to create an AWS Identification and Entry Administration (IAM) function and exterior Amazon S3 stage for Snowflake entry to the S3 bucket. For directions, check with Configuring Safe Entry to Amazon S3.

Once you create an exterior stage in Snowflake, use the worth for TargetDataS3Bucket on the Outputs tab of your deployed CloudFormation stack for the Amazon S3 URL of your stage.

CF Output

Be certain to call the exterior stage unload_to_s3 if you happen to’re migrating the pattern knowledge utilizing the default scripts supplied within the CloudFormation template.

Convert Snowflake tables to Amazon Redshift

You’ll be able to merely run the next DDL statements to create TPCH_SF1 schema objects in Amazon Redshift. You can too use AWS Schema Conversion Software (AWS SCT) to transform Snowflake customized objects to Amazon Redshift. For directions on changing your schema, check with Speed up Snowflake to Amazon Redshift migration utilizing AWS Schema Conversion Software.

CREATE SCHEMA TPCH_SF1;
SET SEARCH_PATH to TPCH_SF1;
CREATE TABLE buyer (
  c_custkey int8 not null ,
  c_name varchar(25) not null,
  c_address varchar(40) not null,
  c_nationkey int4 not null,
  c_phone char(15) not null,
  c_acctbal numeric(12,2) not null,
  c_mktsegment char(10) not null,
  c_comment varchar(117) not null,
  Main Key(C_CUSTKEY)
) ;

CREATE TABLE lineitem (
  l_orderkey int8 not null ,
  l_partkey int8 not null,
  l_suppkey int4 not null,
  l_linenumber int4 not null,
  l_quantity numeric(12,2) not null,
  l_extendedprice numeric(12,2) not null,
  l_discount numeric(12,2) not null,
  l_tax numeric(12,2) not null,
  l_returnflag char(1) not null,
  l_linestatus char(1) not null,
  l_shipdate date not null ,
  l_commitdate date not null,
  l_receiptdate date not null,
  l_shipinstruct char(25) not null,
  l_shipmode char(10) not null,
  l_comment varchar(44) not null,
  Main Key(L_ORDERKEY, L_LINENUMBER)
)  ;

CREATE TABLE nation (
  n_nationkey int4 not null,
  n_name char(25) not null ,
  n_regionkey int4 not null,
  n_comment varchar(152) not null,
  Main Key(N_NATIONKEY)                                
) ;

CREATE TABLE orders (
  o_orderkey int8 not null,
  o_custkey int8 not null,
  o_orderstatus char(1) not null,
  o_totalprice numeric(12,2) not null,
  o_orderdate date not null,
  o_orderpriority char(15) not null,
  o_clerk char(15) not null,
  o_shippriority int4 not null,
  o_comment varchar(79) not null,
  Main Key(O_ORDERKEY)
) ;

CREATE TABLE half (
  p_partkey int8 not null ,
  p_name varchar(55) not null,
  p_mfgr char(25) not null,
  p_brand char(10) not null,
  p_type varchar(25) not null,
  p_size int4 not null,
  p_container char(10) not null,
  p_retailprice numeric(12,2) not null,
  p_comment varchar(23) not null,
  PRIMARY KEY (P_PARTKEY)
) ;

CREATE TABLE partsupp (
  ps_partkey int8 not null,
  ps_suppkey int4 not null,
  ps_availqty int4 not null,
  ps_supplycost numeric(12,2) not null,
  ps_comment varchar(199) not null,
  Main Key(PS_PARTKEY, PS_SUPPKEY)
) ;

CREATE TABLE area (
  r_regionkey int4 not null,
  r_name char(25) not null ,
  r_comment varchar(152) not null,
  Main Key(R_REGIONKEY)                             
) ;

CREATE TABLE provider (
  s_suppkey int4 not null,
  s_name char(25) not null,
  s_address varchar(40) not null,
  s_nationkey int4 not null,
  s_phone char(15) not null,
  s_acctbal numeric(12,2) not null,
  s_comment varchar(101) not null,
  Main Key(S_SUPPKEY)
);

Run an AWS Glue workflow for knowledge migration

Once you’re prepared to begin the info migration, full the next steps:

  1. On the AWS Glue console, select Workflows within the navigation pane.
  2. Choose the workflow to run (<stack title>snowflake-to-redshift-migration).
  3. On the Actions menu, select Run.Glue Workflow Run
  4. To verify the standing of the workflow, select the workflow and on the Historical past tab, choose the Run ID and select View run particulars.
    Glue Workflow Status
  5. When the workflow is full, navigate to the Amazon Redshift console and launch the Amazon Redshift question editor v2 to confirm the profitable migration of information.
  6. Run the next question in Amazon Redshift to get row counts of all tables migrated from Snowflake to Amazon Redshift. Be certain to regulate the table_schema worth accordingly if you happen to’re not migrating the pattern knowledge.
SELECT tab.table_schema,
       tab.table_name,
       nvl(tinf.tbl_rows,0) tbl_rows,
       nvl(tinf.dimension,0) dimension
FROM svv_tables tab
LEFT JOIN svv_table_info tinf 
          on tab.table_schema = tinf.schema 
          and tab.table_name = tinf.”desk”
WHERE tab.table_type="BASE TABLE"
      and tab.table_schema in ('tpch_sf1')
ORDER BY tbl_rows;

Redshift Editor

  1. Run the next question in Snowflake to match and validate the info:
USE DATABASE snowflake_sample_data;
SELECT  TABLE_CATALOG,
        TABLE_SCHEMA,
        TABLE_NAME,
        ROW_COUNT,
        BYTES AS SIZE,
        COMMENT
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'TPCH_SF1'
ORDER BY ROW_COUNT;

Snowflake Editor

Clear up

To keep away from incurring future prices, delete the sources you created as a part of the CloudFormation stack by navigating to the AWS CloudFormation console, choosing the stack blog-resources, and selecting Delete.

Conclusion

On this publish, we mentioned the way to carry out an environment friendly, quick, and cost-effective migration from Snowflake to Amazon Redshift. Migrations from one knowledge warehouse surroundings to a different can usually be very time-consuming and resource-intensive; this answer makes use of the facility of cloud-based compute by pushing down the processing to the respective warehouses. Orchestrating this migration with the AWS Glue Python shell supplies further price optimization.

With this answer, you’ll be able to facilitate your migration from Snowflake to Amazon Redshift. In the event you’re concerned about additional exploring the potential of utilizing Amazon Redshift, please attain out to your AWS Account Crew for a proof of idea.

Appendix: Sources deployed by AWS CloudFormation

The CloudFormation stack deploys the next sources in your AWS account:

  • Networking sourcesAmazon Digital Non-public Cloud (Amazon VPC), subnets, ACL, and safety group.
  • Amazon S3 bucket – That is referenced as TargetDataS3Bucket on the Outputs tab of the CloudFormation stack. This bucket holds the info being migrated from Snowflake to Amazon Redshift.
  • AWS Secrets and techniques Supervisor secrets and techniques – Two secrets and techniques in AWS Secrets and techniques Supervisor retailer credentials for Snowflake and Amazon Redshift.
  • VPC endpoints – The 2 VPC endpoints are deployed to determine a non-public connection from VPC sources like AWS Glue to providers that run outdoors of the VPC, similar to Secrets and techniques Supervisor and Amazon S3.
  • IAM roles – IAM roles for AWS Glue, Lambda, and Amazon Redshift. If the CloudFormation template is to be deployed in a manufacturing surroundings, it’s worthwhile to alter the IAM insurance policies so that they’re not as permissive as offered on this publish (which have been set for simplicity and demonstration). Notably, AWS Glue and Amazon Redshift don’t require all of the actions granted within the *FullAccess insurance policies, which might be thought of overly permissive.
  • Amazon Redshift cluster – An Amazon Redshift cluster is created in a non-public subnet, which isn’t publicly accessible.
  • AWS Glue connection – The connection for Amazon Redshift makes certain that the AWS Glue job runs inside the identical VPC as Amazon Redshift. This additionally ensures that AWS Glue can entry the Amazon Redshift cluster in a non-public subnet.
  • AWS Glue jobs – Two AWS Glue Python shell jobs are created:
    • <stack title>-glue-snowflake-unload – The primary job runs the SQL scripts in Snowflake to repeat knowledge from the supply database to Amazon S3. The Python script is offered in S3. The Snowflake job accepts two parameters:
      • SQLSCRIPT – The Amazon S3 location of the SQL script to run in Snowflake emigrate knowledge to Amazon S3. That is referenced because the Snowflake Unload SQL Script parameter within the enter part of the CloudFormation template.
      • SECRET – The Secrets and techniques Supervisor ARN that shops Snowflake connection particulars.
    • <stack title>-glue-redshift-load – The second job runs one other SQL script in Amazon Redshift to repeat knowledge from Amazon S3 to the goal Amazon Redshift database. The Python script hyperlink is offered in S3. The Amazon Redshift job accepts three parameters:
      • SQLSCRIPT – The Amazon S3 location of the SQL script to run in Amazon Redshift emigrate knowledge from Amazon S3. In the event you present customized SQL script emigrate the Snowflake knowledge to Amazon S3 (as talked about within the stipulations), the file location is referenced as LoadFileLocation on the Outputs tab of the CloudFormation stack.
      • SECRET – The Secrets and techniques Supervisor ARN that shops Amazon Redshift connection particulars.
      • PARAMS – This consists of any further parameters required for the SQL script, together with the Amazon Redshift IAM function used within the COPY instructions and the S3 bucket staging the Snowflake knowledge. A number of parameter values might be supplied separated by a comma.
  • AWS Glue workflow – The orchestration of Snowflake and Amazon Redshift AWS Glue Python shell jobs is managed by way of an AWS Glue workflow. The workflow <stack title>snowflake-to-redshift-migration runs later for precise migration of information.

Concerning the Authors

Raks KhareRaks Khare is an Analytics Specialist Options Architect at AWS primarily based out of Pennsylvania. He helps clients architect knowledge analytics options at scale on the AWS platform.

Julia BeckJulia Beck is an Analytics Specialist Options Architect at AWS. She helps clients in validating analytics options by architecting proof of idea workloads designed to satisfy their particular wants.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments