Saturday, December 9, 2023
HomeBig DataHow William Hill migrated NoSQL workloads at scale to Amazon Keyspaces

How William Hill migrated NoSQL workloads at scale to Amazon Keyspaces


Social gaming and on-line sports activities betting are aggressive environments. The sport should be capable of deal with giant volumes of unpredictable site visitors whereas concurrently promising zero downtime. On this area, person retention is now not simply fascinating, it’s important. William Hill is a worldwide on-line playing firm primarily based in London, England, and it’s the founding member of the UK Betting and Gaming Council. They share the mission to champion the betting and gaming {industry} and set world-class requirements to ensure of an fulfilling, honest, and protected betting and playing expertise for all of their clients. In sports activities betting, William Hill is an industry-leading model, awarded with prestigious {industry} titles just like the IGA Awards Sports activities Betting Operator of the 12 months in 2019, 2020, and 2022, and the SBC Awards Racing Sportsbook of the Yr in 2019. William Hill has been acquired by Caesars Leisure, Inc (NASDAQ: CZR) in April 2021, and it’s the most important casino-entertainment firm within the US and one of many world’s most diversified casino-entertainment suppliers. On the coronary heart of William Hill gaming platform is a NoSQL database that maintains 100% uptime, scales in real-time to deal with thousands and thousands of customers or extra, and supplies customers with a responsive and customized expertise throughout all of their gadgets.

On this submit, we’ll focus on how William Hill moved their workload from Apache Cassandra to Amazon Keyspaces (for Apache Cassandra) with zero downtime utilizing AWS Glue ETL.

William Hill was dealing with challenges concerning scalability, cluster instability, excessive operational prices, and handbook patching and server upkeep. They have been on the lookout for a NoSQL answer which was scalable, highly-available, and fully managed. This allow them to deal with offering higher person expertise slightly than sustaining infrastructure. William Hill Restricted determined to maneuver ahead with Amazon Keyspaces, since it could run Apache Cassandra workloads on AWS utilizing the identical Cassandra software code and developer instruments used at present, with out the necessity to provision, patch, handle servers, set up, preserve, or function software program.

Answer overview

William Hill Restricted wished emigrate their present Apache Cassandra workloads to Amazon Keyspaces with a replication lag of minutes, with minimal migration prices and improvement efforts. Due to this fact, AWS Glue ETL was leveraged to ship the specified consequence.

AWS Glue is a serverless knowledge integration service that gives a number of advantages for migration:

  • No infrastructure to take care of; allocates the mandatory computing energy and runs a number of migration jobs concurrently.
  • All-in-one pricing mannequin that features infrastructure and is 55% cheaper than different cloud knowledge integration choices.
  • No lock in with the service; potential to develop knowledge migration pipelines in open-source Apache Spark (Spark SQL, PySpark, and Scala).
  • Migration pipeline may be scaled fearlessly with Amazon Keyspaces and AWS Glue.
  • Constructed-in pipeline monitoring to ensure of in-migration continuity.
  • AWS Glue ETL jobs make it potential to carry out bulk knowledge extraction from Apache Cassandra and ingest to Amazon Keyspaces.

On this submit, we’ll take you thru William Hill’s journey of constructing the migration pipeline from scratch emigrate the Apache Cassandra workload to Amazon Keyspaces by leveraging AWS Glue ETL with DataStax Spark Cassandra connector.

For the aim of this submit, let’s have a look at a typical Cassandra Community setup on AWS and the mechanism used to set up the reference to AWS Glue ETL. The migration answer described additionally works for Apache Cassandra hosted on on-premises clusters.

Structure overview

The structure demonstrates the migration setting that requires Amazon Keyspaces, AWS Glue, Amazon Easy Storage Service (Amazon S3), and the Apache Cassandra cluster. To keep away from a excessive CPU utilization/saturation on the Apache Cassandra cluster through the migration course of, you may need to deploy one other Cassandra datacenter to isolate your manufacturing from the migration workload to make the migration course of seamless on your clients.

Amazon S3 has been used for staging whereas migrating knowledge from Apache Cassandra to Amazon Keyspaces to ensure that the IO load on Cassandra serving reside site visitors on manufacturing is minimized, in case the info add to Amazon Keyspaces fails and a retry should be achieved.

Conditions

The Apache Cassandra cluster is hosted on Amazon Elastic Compute Cloud (Amazon EC2) cases, unfold throughout three availability zones, and hosted in non-public subnets. AWS Glue ETL is hosted on Amazon Digital Non-public Cloud (Amazon VPC) and thus wants a AWS Glue Studio {custom} Connectors and Connections to be setup to speak with the Apache Cassandra nodes hosted on the non-public subnets within the buyer VPC. Thereby, this permits the connection to the Cassandra cluster hosted within the VPC. The DataStax Spark Cassandra Connector should be downloaded and saved onto an Amazon S3 bucket: s3://$MIGRATION_BUCKET/jars/spark-cassandra-connector-assembly_2.12-3.2.0.jar.

Let’s create an AWS Glue Studio {custom} connector named cassandra_connection and its corresponding connection named conn-cassandra-custom for AWS area us-east-1.

For the connector created, create an AWS Glue Studio connection and populate it with community data VPC, and a Subnet permitting for AWS Glue ETL to determine a reference to Apache Casandra.

  • Identify: conn-cassandra-custom
  • Community Choices

Let’s start by creating a keyspace and desk in Amazon Keyspaces utilizing Amazon Keyspaces Console or CQLSH, after which create a goal keyspace named target_keyspace and a goal desk named target_table.

CREATE KEYSPACE target_keyspace WITH replication = {'class': 'SingleRegionStrategy'};

CREATE TABLE target_keyspace.target_table (
    userid      uuid,
    stage       textual content,
    gameid      int,
    description textual content,
    nickname    textual content,
    zip         textual content,
    electronic mail       textual content,
    updatetime  textual content,
    PRIMARY KEY (userid, stage, gameid)
) WITH default_time_to_live = 0 AND CUSTOM_PROPERTIES = {
	'capacity_mode':{
		'throughput_mode':'PROVISIONED',
		'write_capacity_units':76388,
		'read_capacity_units':3612
	}
} AND CLUSTERING ORDER BY (stage ASC, gameid ASC);

After the desk has been created, change the desk to on-demand mode to pre-warm the desk and keep away from AWS Glue ETL job throttling failures. The next script will replace the throughput mode.

ALTER TABLE target_keyspace.target_table 
WITH CUSTOM_PROPERTIES = {
	'capacity_mode':{
		'throughput_mode':'PAY_PER_REQUEST'
	}
} 

Let’s go forward and create two Amazon S3 buckets to help the migration course of. The primary bucket (s3://your-spark-cassandra-connector-bucket-name)ought to retailer the spark Cassandra connector meeting jar file, Cassandra, and Keyspaces configuration YAML recordsdata.

The second bucket (s3://your-migration-stage-bucket-name) shall be used to retailer intermediate parquet recordsdata to establish the delta between the Cassandra cluster and the Amazon Keyspaces desk to trace modifications between subsequent executions of the AWS Glue ETL jobs.

Within the following KeyspacesConnector.conf, set your contact factors to hook up with Amazon Keyspaces, and substitute the username and the password to the AWS credentials.

Utilizing the RateLimitingRequestThrottler we will ensure that requests don’t exceed the configured Keyspaces capability. The G1.X DPU creates one executor per employee. The RateLimitingRequestThrottler on this instance is ready for 1000 requests per second. With this configuration, and G.1X DPU, you’ll obtain 1000 request per AWS Glue employee. Alter the max-requests-per-second accordingly to suit your workload. Enhance the variety of staff to scale throughput to a desk.

datastax-java-driver {
  primary.request.consistency = "LOCAL_QUORUM"
  primary.contact-points = ["cassandra.us-east-1.amazonaws.com:9142"]
   superior.reconnect-on-init = true
   primary.load-balancing-policy {
        local-datacenter = "us-east-1"
    }
    superior.auth-provider = {
       class = PlainTextAuthProvider
       username = "user-at-sample"
       password = "S@MPLE=PASSWORD="
    }
    superior.throttler = {
       class = RateLimitingRequestThrottler
       max-requests-per-second = 1000
       max-queue-size = 50000
       drain-interval = 1 millisecond
    }
    superior.ssl-engine-factory {
      class = DefaultSslEngineFactory
      hostname-validation = false
    }
    superior.connection.pool.native.measurement = 1
}

Equally, create a CassandraConnector.conf file, set the contact factors to hook up with the Cassandra cluster, and substitute the username and the password respectively.

datastax-java-driver {
  primary.request.consistency = "LOCAL_QUORUM"
  primary.contact-points = ["127.0.0.1:9042"]
   superior.reconnect-on-init = true
   primary.load-balancing-policy {
        local-datacenter = "datacenter1"
    }
    superior.auth-provider = {
       class = PlainTextAuthProvider
       username = "user-at-sample"
       password = "S@MPLE=PASSWORD="
    }
}

Construct AWS Glue ETL migration pipeline with Amazon Keyspaces

To construct dependable, constant delta add Glue ETL pipeline, let’s decouple the migration course of into two AWS Glue ETLs.

  • CassandraToS3 Glue ETL: Learn knowledge from the Apache Cassandra cluster and switch the migration workload to Amazon S3 within the Apache Parquet format. To establish incremental modifications within the Cassandra tables, the job shops separate parquet recordsdata with main keys with an up to date timestamp.
  • S3toKeyspaces Glue ETL: Uploads the migration workload from Amazon S3 to Amazon Keyspaces. Throughout the first run, the ETL uploads the whole knowledge set from Amazon S3 to Amazon Keyspaces, and for the following run calculates the incremental modifications by evaluating the up to date timestamp throughout two subsequent runs and calculating the incremental distinction. The job additionally takes care of inserting new information, updating present information, and deleting information primarily based on the incremental distinction.

On this instance, we’ll use Scala to write down the AWS Glue ETL, however you may also use PySpark.

Let’s go forward and create an AWS Glue ETL job named CassandraToS3 with the next job parameters:

aws glue create-job 
    --name "CassandraToS3" 
    --role "GlueKeyspacesMigration" 
    --description "Offload knowledge from the Cassandra to S3" 
    --glue-version "3.0" 
    --number-of-workers 2 
    --worker-type "G.1X" 
    --connections "conn-cassandra-custom" 
    --command "Identify=glueetl,ScriptLocation=s3://$MIGRATION_BUCKET/scripts/CassandraToS3.scala" 
    --max-retries 0 
    --default-arguments '{
        "--job-language":"scala",
        "--KEYSPACE_NAME":"source_keyspace",
        "--TABLE_NAME":"source_table",
        "--S3_URI_FULL_CHANGE":"s3://$MIGRATION_BUCKET/full-dataset/",
        "--S3_URI_CURRENT_CHANGE":"s3://$MIGRATION_BUCKET/incremental-dataset/present/",
        "--S3_URI_NEW_CHANGE":"s3://$MIGRATION_BUCKET/incremental-dataset/new/",
        "--extra-files":"s3://$MIGRATION_BUCKET/conf/CassandraConnector.conf",
        "--conf":"spark.cassandra.connection.config.profile.path=CassandraConnector.conf",
        "--class":"GlueApp"
    }'

The CassandraToS3 Glue ETL job reads knowledge from the Apache Cassandra desk source_keyspace.source_table and writes it to the S3 bucket within the Apache Parquet format. The job rotates the parquet recordsdata to assist establish delta modifications within the knowledge between consecutive job executions. To establish inserts, updates, and deletes, it’s essential to know main key and columns write occasions (up to date timestamp) within the Cassandra cluster up entrance. Our main key consists of a number of columns userid, stage, gameid, and a write time column updatetime. In case you have a number of up to date columns, then it’s essential to use multiple write time columns with an aggregation perform. For instance, for electronic mail and updatetime, take the utmost worth between write occasions for electronic mail and updatetime.

The next AWS Glue spark code offloads knowledge to Amazon S3 utilizing the spark-cassandra-connector. The script takes 4 parameters KEYSPACE_NAME, KEYSPACE_TABLE, S3_URI_CURRENT_CHANGE, S3_URI_CURRENT_CHANGE, and S3_URI_NEW_CHANGE.

To add the info from Amazon S3 to Amazon Keyspaces, it’s essential to create a S3toKeyspaces Glue ETL job utilizing the Glue spark code to learn the parquet recordsdata from the Amazon S3 bucket created as an output of CassandraToS3 Glue job and establish inserts, updates, deletes, and execute requests in opposition to the goal desk in Amazon Keyspaces. The code pattern offered takes 4 parameters: KEYSPACE_NAME, KEYSPACE_TABLE, S3_URI_CURRENT_CHANGE, S3_URI_CURRENT_CHANGE, and S3_URI_NEW_CHANGE.

Let’s go forward and create our second AWS Glue ETL job S3toKeyspaces with the next job parameters:

aws glue create-job 
    --name "S3toKeyspaces" 
    --role "GlueKeyspacesMigration" 
    --description "Push knowledge to Amazon Keyspaces" 
    --glue-version "3.0" 
    --number-of-workers 2 
    --worker-type "G.1X" 
    --command "Identify=glueetl,ScriptLocation=s3://amazon-keyspaces-backups/scripts/S3toKeyspaces.scala" 
    --default-arguments '{
        "--job-language":"scala",
        "--KEYSPACE_NAME":"target_keyspace",
        "--TABLE_NAME":"target_table",
        "--S3_URI_FULL_CHANGE":"s3://$MIGRATION_BUCKET/full-dataset/",
        "--S3_URI_CURRENT_CHANGE":"s3://$MIGRATION_BUCKET/incremental-dataset/present/",
        "--S3_URI_NEW_CHANGE":"s3://$MIGRATION_BUCKET/incremental-dataset/new/",
        "--extra-files":"s3://$MIGRATION_BUCKET/conf/KeyspacesConnector.conf",
        "--conf":"spark.cassandra.connection.config.profile.path=KeyspacesConnector.conf",
        "--class":"GlueApp"
    }'

Job scheduling

The ultimate step is to configure AWS Glue Triggers or Amazon EventBridge relying in your scheduling must set off S3toKeyspaces Glue ETL when the job CassandraToS3 has succeeded. If you wish to run the CassandraToS3 primarily based on the schedule and configure the schedule choice, then the next instance showcases methods to schedule cassandraToS3 to run each quarter-hour.

Job tuning

There are Spark settings really useful to start with Amazon Keyspaces, which may then be elevated later as applicable on your workload.

  • Use a Spark partition measurement (teams a number of Cassandra rows) smaller than 8 MBs to keep away from replaying giant Spark duties throughout a job failure.
  • Use a low concurrent variety of writes per DPU with a lot of retries. Add the next choices to the job parameters: --conf spark.cassandra.question.retry.rely=500 --conf spark.cassandra.output.concurrent.writes=3.
  • Set spark.job.maxFailures to a bounded worth. For instance, you can begin from 32 and improve as wanted. This feature may also help you improve plenty of duties reties throughout a desk pre-warm stage. Add the next choice to the job parameters: --conf spark.job.maxFailures=32
  • One other advice is to show off batching to enhance random entry patterns. Add the next choices to the job parameters:
    spark.cassandra.output.batch.measurement.rows=1
    spark.cassandra.output.batch.grouping.key=none spark.cassandra.output.batch.grouping.buffer.measurement=100
  • Randomize your workload. Amazon Keyspaces partitions knowledge utilizing partition keys. Though Amazon Keyspaces has built-in logic to assist load stability requests for a similar partition key, loading the info is quicker and extra environment friendly in the event you randomize the order as a result of you may reap the benefits of the built-in load balancing of writing to completely different partitions. To unfold the writes throughout the partitions evenly, it’s essential to randomize the info within the dataframe. You may use a rand perform to shuffle rows within the dataframe.

Abstract

William Hill was capable of migrate their workload from Apache Cassandra to Amazon Keyspaces at scale utilizing AWS Glue, with out the must make any modifications on their software tech stack. The adoption of Amazon Keyspaces has offered them with the headroom to deal with their Software and buyer expertise, as with Amazon Keyspaces there’s no have to handle servers, get efficiency at scale, highly-scalable, and safe answer with the power to deal with the sudden spike in demand.

On this submit, you noticed methods to use AWS Glue emigrate the Cassandra workload to Amazon Keyspaces, and concurrently hold your Cassandra supply databases fully useful through the migration course of. When your functions are prepared, you may select to chop over your functions to Amazon Keyspaces with minimal replication lag in sub minutes between the Cassandra cluster and Amazon Keyspaces. You can too use an analogous pipeline to copy the info again to the Cassandra cluster from Amazon Keyspaces to take care of knowledge consistency, if wanted. Right here you will discover the paperwork and code to assist speed up your migration to Amazon Keyspaces.


Concerning the Authors

Nikolai Kolesnikov is a Senior Knowledge Architect and helps AWS Skilled Companies clients construct highly-scalable functions utilizing Amazon Keyspaces. He additionally leads Amazon Keyspaces ProServe buyer engagements.

Kunal Gautam is a Senior Massive Knowledge Architect at Amazon Net Companies. Having expertise in constructing his personal Startup and dealing together with enterprises, he brings a novel perspective to get folks, enterprise and expertise work in tandem for patrons. He’s keen about serving to clients of their digital transformation journey and allows them to construct scalable knowledge and advance analytics options to realize well timed insights and make important enterprise selections. In his spare time, Kunal enjoys Marathons, Tech Meetups and Meditation retreats.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments