Saturday, December 9, 2023
HomeBig DataCatastrophe restoration issues with Amazon EMR on Amazon EC2 for Spark workloads

Catastrophe restoration issues with Amazon EMR on Amazon EC2 for Spark workloads


Amazon EMR is a cloud massive knowledge platform for working large-scale distributed knowledge processing jobs, interactive SQL queries, and machine studying (ML) purposes utilizing open-source analytics frameworks similar to Apache Spark, Apache Hive, and Presto. Amazon EMR launches all nodes for a given cluster in the identical Amazon Elastic Compute Cloud (Amazon EC2) Availability Zone to enhance efficiency. Throughout an Availability Zone failure or because of any surprising interruption, Amazon EMR will not be accessible, and we’d like a catastrophe restoration (DR) technique to mitigate this downside.

A part of architecting a resilient, extremely out there Amazon EMR answer is the consideration that failures do happen. These surprising interruptions might be attributable to pure disasters, technical failures, and human interactions leading to an Availability Zone outage. The EMR cluster may additionally turn out to be unreachable because of failure of essential companies working on the EMR grasp node, community points, or different points.

On this publish, we present you how one can architect your Amazon EMR atmosphere for catastrophe restoration to keep up enterprise continuity with minimal Restoration Time Goal (RTO) throughout Availability Zone failure or when your EMR cluster is inoperable.

Though varied catastrophe restoration methods can be found within the cloud, we focus on active-active and active-passive DR methods for Amazon EMR on this publish. We concentrate on a use case for Spark batch workloads the place persistent storage is decoupled from Amazon EMR and the EMR cluster is working with a single grasp node. If the EMR cluster is used for persistent storage, it requires an extra technique to copy knowledge from the EMR cluster, which we’ll cowl in subsequent posts.


To observe together with this publish, you must have a data of Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and an understanding of Community Load Balancers.

Resolution overview

The next diagram illustrates the answer structure.

Clients typically use Amazon MWAA to submit Spark jobs to an EMR cluster utilizing an Apache Livy REST interface. We are able to configure Apache Livy to make use of a Community Load Balancer hostname as a substitute of an Amazon EMR grasp hostname, in order that we don’t must replace Livy connections from Amazon MWAA at any time when a brand new cluster is created or stopped. You may register Community Load Balancer goal teams with a number of EMR cluster grasp nodes for an active-active setup. Within the case of an active-passive setup, we will create a brand new EMR cluster when a failure is detected and register the brand new EMR grasp with the Community Load Balancer goal group. The Community Load Balancer routinely performs well being checks and distributes requests to wholesome targets. With this answer, we will preserve enterprise continuity when an EMR cluster isn’t reachable because of Availability Zone failure or when the cluster is unhealthy because of another purpose.

Energetic-active DR technique

An active-active DR setup focuses on working two EMR clusters with an identical configuration in two totally different Availability Zones. To scale back the working prices of two lively EMR clusters, we will launch each clusters with minimal capability, and managed scaling routinely scales the cluster based mostly on the workload. EMR managed scaling solely launches cases when there may be demand for assets and stops the unneeded cases when the work is completed. With this technique, we will scale back our restoration time to close zero with optimum value. This active-active DR technique is appropriate when companies need to have near-zero downtime with automated failover to your analytics workloads.

Within the following part, we stroll by means of the steps to implement the answer and supply references to associated assets that present extra detailed steering.

Create EMR clusters

We create two EMR clusters in several Availability Zones inside the similar Area of your selection. Use the next AWS Command Line Interface (AWS CLI) command and modify or add required configurations as per your wants:

aws emr create-cluster 
  --name "<emr-cluster-az-a>" 
  --release-label emr-6.4.0 
  --log-uri "s3://<your-log-bucket>" 
  --applications Identify=Spark Identify=Livy 
  --ec2-attributes "KeyName=<your-key-name>,SubnetId=<private-subnet-in-az-a>" 
  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.massive InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.massive 

We are able to create the cluster with EMR managed scaling, which helps you to routinely enhance or lower the variety of cases or models in your cluster based mostly on workload. Amazon EMR repeatedly evaluates cluster metrics to make scaling choices that optimize your clusters for value and pace.

Create and configure a Community Load Balancer

You may create a Community Load Balancer utilizing the AWS CLI (see Create a Community Load Balancer utilizing the AWS CLI) or the AWS Administration Console (see Create a Community Load Balancer). For this publish, we accomplish that on the console.

  • Create a goal group (emr-livy-dr) and register each EMR clusters’ grasp IP addresses within the goal group.

  • Create an inner Community Load Balancer in the identical VPC or Area as your EMR clusters, and select two totally different Availability Zones and choose the non-public subnets.
    These subnets don’t have to be in the identical subnets because the EMR clusters, however the clusters should enable the site visitors from the Community Load Balancer, which is mentioned in subsequent steps.

  • Create a TCP listener on port 8998 (the default EMR cluster Livy port) to ahead requests to the goal group you created.

  • Modify the EMR clusters’ grasp safety teams to permit the Community Load Balancer’s non-public IP addresses to entry port 8998.

You will discover the Community Load Balancer’s non-public IP deal with by looking out the elastic community interfaces for the Community Load Balancer’s identify. For entry management directions, confer with How do I connect a safety group to my Elastic Load Balancer.

When the goal teams turn out to be wholesome, the Community Load Balancer forwards requests to registered targets when it receives requests on Livy port 8998.

  • Get the DNS identify of the Community Load Balancer.

We are able to additionally use an Amazon Route 53 alias file to make use of our personal area identify to route site visitors to the Community Load Balancer DNS identify. We use this DNS identify in our Amazon MWAA Livy connection.

Create and configure Amazon MWAA

Full the next steps:

  • Be certain that the execution function you’re utilizing with Amazon MWAA has correct entry to EMR clusters and different required companies.
  • Replace the Amazon MWAA Livy connection (livy_default) host with the Community Load Balancer hostname you created.
  • Create a brand new Livy connection ID if it’s not already out there.

  • Use the next pattern DAG to submit a pattern Spark utility utilizing LivyOperator. We assign the livy_default connection to the livy_conn_id within the DAG code.
  • Allow the DAG and confirm if the Spark utility is profitable on one of many EMR clusters.
from datetime import timedelta, datetime
from airflow.utils.dates import days_ago
from airflow import DAG
from airflow.suppliers.apache.livy.operators.livy import LivyOperator

default_args = {
    'proprietor': 'airflow',
    "retries": 1,
    "retry_delay": timedelta(minutes=5),

dag_name = "livy_spark_dag"
# Exchange S3 bucket identify
# You should use pattern spark jar from EMR cluster grasp node
# /usr/lib/spark/examples/jars/spark-examples.jar
s3_bucket = "artifacts-bucket"
jar_location = "s3://{}/spark-examples.jar".format(s3_bucket)

dag = DAG(
    dag_id = dag_name,
    schedule_interval="@as soon as",
    start_date = days_ago(1),
    tags=['emr', 'spark', 'livy']

livy_spark = LivyOperator(
    "spark.submit.deployMode": "cluster",
    "": dag_name


Take a look at the DR plan

We are able to check our DR plan by creating eventualities that might be attributable to actual disasters. Carry out the next steps to validate if our DR technique works routinely throughout a catastrophe:

  1. Run the pattern DAG a number of occasions and confirm if Spark purposes are randomly submitted to the registered EMR clusters.
  2. Cease one of many clusters and confirm if jobs are routinely submitted to the opposite cluster in a distinct Availability Zone with none points.

Energetic-passive DR technique

Though the active-active DR technique has advantages of sustaining near-zero restoration time, it’s complicated to keep up two environments as a result of each environments require patching and fixed monitoring. In instances the place Restoration Time Goal (RTO) and Restoration Level Goal (RPO) aren’t essential to your workloads, we will undertake an active-passive technique. This method provides a extra economical and operationally much less complicated method.

On this method, we use a single EMR cluster as an lively cluster and in case of catastrophe (because of Availability Zone failures or another purpose the EMR cluster is unhealthy), we launch a second EMR cluster in a distinct Availability Zone and redirect all our workloads to the newly launched cluster. Finish-users might discover some delay as a result of launching a second EMR cluster takes time.

The high-level structure of the active-passive DR answer is proven within the following diagram.

Full the next steps to implement this answer:

  • Create an EMR cluster in a single Availability Zone.
  • Create goal teams and register the EMR cluster grasp node IP deal with. Create goal group for Useful resource Supervisor(8088), Identify Node(9870) and Livy(8998) companies. Change the port numbers if companies are working on totally different ports.

  • Create a Community Load Balancer and add TCP listeners and ahead requests to the respective goal teams.

  • Create an Amazon MWAA atmosphere with correct entry to the EMR cluster in the identical Area.
  • Edit the Amazon MWAA Livy connection to make use of the Community Load Balancer DNS identify.
  • Use the up to date Livy connection in Amazon MWAA DAGs to submit Spark purposes.
  • Validate if we will efficiently submit Spark purposes by way of Livy to the EMR cluster.
  • Arrange a DAG on Amazon MWAA or comparable scheduling device that repeatedly displays the prevailing EMR cluster well being.
  • Monitor the next key companies working on the Amazon EMR grasp host utilizing REST APIs or instructions offered by every service. Add extra well being checks as required.
  • If the well being examine course of detects a failure of the primary EMR cluster, create a brand new EMR cluster in a distinct Availability Zone.
  • Routinely register the newly created EMR cluster grasp IP deal with to the Community Load Balancer goal teams.
  • When the Community Load Balancer well being checks are profitable with the brand new EMR cluster grasp IP, delete the unhealthy EMR cluster grasp IP deal with from the goal group and cease the previous EMR cluster.
  • Validate the DR plan.

Observe the steps talked about within the active-active DR technique to create the next assets:

  • Amazon EMR
  • Amazon MWAA
  • Community Load Balancer

The next pattern script gives the performance described on this part. Use this as reference and modify it accordingly to suit your use case.


utilization() {
	cat <<EOF
   Utilization: ./ j-2NPQWXK1U4E6G

   This script takes present EMR cluster id as argument and displays the cluster well being and
   creates new EMR cluster in several AZ if present cluster is unhealthy/unreachable

	exit 1

[[ $# -lt 1 ]] && {
	echo Specify cluster id as argument to the script

#Set NLB DNS identify and area


#Relying on the use case carry out beneath well being checks for a couple of time in a loop and if cluster state continues to be unhealthy then solely carry out remaining steps
#Ports and SSL properties for curl command might differ relying on how companies are arrange
rm_state=$(curl -s --connect-timeout 5 --max-time 10 http://$hostname:8088/ws/v1/cluster | jq -r .clusterInfo.state)
if [[ $? -ne 0 || "$rm_state" != "STARTED" ]]; then
	echo "ResourceManager port not reachable or service not working"

nn_state=$(curl -s --connect-timeout 5 --max-time 10 http://$hostname:9870/jmx?qry=Hadoop:service=NameNode,identify=NameNodeStatus | jq -r .beans[0].State)
if [[ $? -ne 0 || "$nn_state" != "active" ]]; then
	echo "NameNode port not reachable or service not working"

livy_state=$(curl -s --connect-timeout 5 --max-time 10 http://$hostname:8998/periods)
if [[ $? -ne 0 ]]; then
	echo "Livy port not reachable"

cluster_name=$(aws emr describe-cluster --cluster-id $cluster_id | jq -r ".Cluster.Identify")

update_target_groups() {

	nlb_arn=$(aws elbv2 describe-load-balancers --query "LoadBalancers[?DNSName==`$hostname`].[LoadBalancerArn]" --output textual content)
	target_groups=$(aws elbv2 describe-target-groups --load-balancer-arn $nlb_arn --query "TargetGroups[*].TargetGroupArn" --output textual content)
	IFS=" " learn -a tg_array <<<$target_groups
	for tg in "${tg_array[@]}"; do
		echo "Registering new EMR grasp IP with goal group $tg"
		aws elbv2 register-targets --target-group-arn $tg --targets Id=$new_master_ip,AvailabilityZone=all

		echo "De-registering previous/unhealthy EMR grasp IP from goal group $tg"
		aws elbv2 deregister-targets --target-group-arn $tg --targets Id=$current_master_ip,AvailabilityZone=all

if [[ $cluster_status == "unhealthy" ]]; then
	echo "Cluster standing is $cluster_status, creating new EMR cluster"
	current_az=$(aws emr describe-cluster --cluster-id $cluster_id | jq -r ".Cluster.Ec2InstanceAttributes.Ec2AvailabilityZone")
	new_az=$(aws ec2 describe-availability-zones --output json --filters "Identify=region-name,Values=$area" --query "AvailabilityZones[?ZoneName!=`$current_az`].ZoneName|[0]" --output=textual content)
	current_master_ip=$(aws emr list-instances --cluster-id $cluster_id --instance-group-types MASTER --query "Cases[*].PrivateIpAddress" --output textual content)
	echo "Present/unhealthy cluster id $cluster_id, cluster identify $cluster_name,AZ $current_az, Grasp non-public ip $current_master_ip"

	echo "Creating new EMR cluster in $new_az"
	emr_op=$(aws emr create-cluster 
		--name "$cluster_name-$new_az" 
		--release-label emr-6.4.0 
		--applications Identify=Spark Identify=Livy 
		--ec2-attributes "AvailabilityZone=$new_az" 
		--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.massive InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.massive 
		--region $area)

	new_cluster_id=$(echo $emr_op | jq -r ".ClusterId")

	#anticipate cluster provisioning to get grasp ip deal with
	sleep 2m

	new_master_ip=$(aws emr list-instances --cluster-id $new_cluster_id --instance-group-types MASTER --query "Cases[*].PrivateIpAddress" --output textual content)
	echo "New EMR cluster id $new_cluster_id and Grasp node IP $new_master_ip"

	echo "Terminating unhealthy cluster $cluster_id/$cluster_name in $current_az"
	aws emr modify-cluster-attributes --cluster-id $cluster_id --no-termination-protected
	aws emr terminate-clusters --cluster-ids $cluster_id

	echo "Register new EMR grasp IP deal with with NLB goal teams and de-register unhealthy EMR grasp"
	update_target_groups $new_master_ip $current_master_ip $current_az
	echo "Present cluster $cluster_id/$cluster_name is wholesome"


On this publish, we shared some options and issues to enhance DR implementation utilizing Amazon EMR on Amazon EC2, Community Load Balancer, and Amazon MWAA. Primarily based in your use case, you may decide the kind of DR technique you need to deploy. We’ve got offered the steps required to create the mandatory environments and arrange a profitable DR technique.

For extra particulars concerning the techniques and processes described on this publish, confer with the next:

In regards to the Writer

Bharat Gamini is a Information Architect targeted on Large Information & Analytics at Amazon Net Companies. He helps clients architect and construct extremely scalable, sturdy and safe cloud-based analytical options on AWS.




Please enter your comment!
Please enter your name here

Most Popular

Recent Comments