AWS Step functions and EMR Clusters — bedrocks for Data Analytics-as-a-Service ecosystem

8 min readApr 20, 2022

Co- Author:

Data Analytics/AIML-as-a-Service platform is increasingly becoming a new stream of revenues for B2B scenario. Irrespective of type of business the organisation is in, more and more they want to offer this as-a-Service platform to their customers, e.g., Telecom service providers are building this platform for their Small and Medium or Moderately large CPG or retail clients to understand their customer segmentation, Advertising agencies are building this service for their brands to enable them to check the effectiveness of new campaign and similar B2B scenario will be there from other industries. And for developing such platform, there will be a need to provision a self-service, auto scalable and dynamic architecture. IBM and AWS has built one of such architecture patterns leveraging AWS’s Step functions, Elastic Map Reduce Cluster, Dynamo DB and Lambda functions combined with IBM’s architecture and deep capability on AWS platform.

Objective

The objective of this paper is to outline the motivations behind implementing dynamic AWS EMR cluster management, what are the considerations and what can be one standard solution pattern for this problem.

However, before we get into the technicalities, here is use case, as part of establishing a Data analytics/AIML-as-a-service platform. We need capability to manage AIML jobs, automated resource provisioning and lifecycle management to keep their usage and metering optimum.

Without this solution in place, there will be significant increase of TCO and high risk of missing Service Levels.

Why there is a need to implement dynamic EMR cluster management?

EMR is the industry-leading cloud big data platform for processing vast amounts of data. Amazon EMR makes it easy to set up, operate, and scale big data environments by automating time- consuming tasks like provisioning capacity and tuning clusters. However, when we look at the jobs, which will to run on these EMR clusters can have multiple variations in terms of:

Criticality of the job — Training/Scoring/Predicting — We shall need training job to be short lived, however may be required to be run frequently.
Size of Dataset
Length of job — This will determine how long a EMR cluster remains unavailable to tae any new job
Variability and Volume of load and auto-scaling

And, with self-serve architecture, there will be very little control on:

Nature of job — e.g., size and cycle time of job
Volume of job request at a given point of time
Demand pattern of jobs

Coupled with the above uncertainty, for as-a-Service platform there will need to be strict compliance to service level agreement in terms of:

Guaranteed delivery of jobs
Response Time In terms acceptance of receipt a job request or it’s status
Throughputs & Cycle Time

And then above all demonstrating the ROI of the platform.

While some of the above variations can be met by keeping a fixed number of clusters always running and manually managing the demand, it will not be the most efficient and automated solution. Also, it will not address the need for cycle time challenges if every time a new cluster to spun up for a new incoming job. And with clusters always running in anticipation of new job, will add to overall usage cost.

A potential approach to solve above requirement

As we look at above problem statement, the same can be broken down as below:

How to spin-up, down and manage lifecycle of EMR clusters dynamically?
How to ensure optimum resource availability to address steady demand?
How to manage incoming job flows to optimally use available clusters with guaranteed level of service?
How to ensure no job request get’s lost?
How to ensure critical jobs get the resources on priority?

Below patterns can be deployed as potential approach to address these challenges.

Job Orchestration

To address above questions/requirements, the solution can be divided into two sub-domains, a) Job Orchestrator sub-domain and b) Cluster management sub-domain. The objective of Job Orchestrator sub- domain is to manage demand of ‘Jobs’ and their lifecycle. Once it receives a new job request, it checks for availability of a matching EMR cluster, and if it finds a free EMR cluster, it allocates the cluster to the job. However, when available cluster is not found, it, checks if it can request for a new cluster spin-up and once the new cluster becomes ready, it allocates the same to job. If either is not possible (as there will be a limit in number of clusters which can be spun at a given point of time), it queues the job for retry. The purpose of Cluster management sub-domain is to handle the lifecycle of cluster requests and its states. It also monitors if a free cluster to be decommissioned or a new cluster to spin up. This sub-domain integrates with AWS’s resource layer using SDK and APIs to dynamically manage EMR cluster’s lifecycle. Below section outlines the solution in detail.

At the heart of the solution approach, there are two information model. Job Profile — This entity acts as metadata for the incoming job in terms of type, size and priority. This metadata is being managed by ‘Job Orchestrator’ sub-system. Cluster Profile — This entity captures the metadata of EMR cluster for a given job profile. Also, for each cluster profile, there is a mapped “cloudformation” template. This metadata is being used by ‘Cluster Management’ sub-domain. Both these models are persisted in AWS Dynamo DB to allow flexible data models for Job and Cluster profiles.

Below diagram shows highlevel solution components:

Job orchestrator — Capture, validate incoming job processing requests. Store / queue up the jobs and submit for processing based on job priority. Ensures guaranteed job processing and priority of job processing.
Cluster Management — Maintains a pool of EMR clusters for each profile of jobs to be processed. The profiles can be typically based on cluster sizes, number of core instances etc. The scaling of each profile of EMR clusters should be controlled from cluster management.
System Monitor — Captures the health heart beats from multiple compute components.

In the below section, Job Orchestrator and Cluster Subdomain will be covered in detail.

As part of dynamic cluster management, two key questions to be answered are:

How to spin-up, down and manage lifecycle of EMR clusters dynamically?
How to ensure optimum resource availability to address steady demand?
How to manage incoming job flows to optimally use available clusters with guaranteed level of service?

AWS’s cloudformation template along with rich APIs and SDK can be used to implement ‘Infrastructure as Code’ to meet the 1st requirement, below section describes the detailed solution approach for the same.

Solution Design For EMR Cluster Management

AWS cloudformation service uses templates to spin up a stack of aws services. The templates are infrastructure as code and follows a declarative option to specify resource options. Pre-defined EMR cluster profiles should be defined in separate templates. Each profile should have its own capacity of resources like MasterInstanceType , CoreInstanceType , Number of core instances etc.

AWS SDK / APIS can be used to trigger stack creation operation of cloud formation template to dynamically spin up the EMR clusters. The below diagram depicts the high-level component model.

Component Details:

For the 2nd and 3rd requirement, based on demand forecasting, a _MINIMUM and _MAXIMUM threshold levels of clusters to be maintained for each ‘job profile’. The _MINIMUM number of clusters will reduce the cycle time of request processing. The _ MAXIMUM number of clusters can scale up to maximum limit based on load of processing requests. The minimum and maximum levels of clusters of each profile should be configurable values. AWS parameter store / dynamo db can be used to store the configurable values.

Job Orchestrator

As part of dynamic cluster management, below key questions to be answered are:

4. How to ensure no job request gets lost?

5. How to ensure critical jobs get the resources on priority?

For the 4th requirement, an approach would be to create a Queue based architecture to manage the incoming demand. And for 5th, a priority Queue to be implemented. ‘Job Profile’ metadata would have ‘Priority’ as one of the required attributes.

As part of self-serve architecture, Job Orchestrator should expose apis to capture and track job lifecyle. Once job request is received, below will be a state model of job’s various life cycle states.

This is a best fit use case for AWS Step function as all the actions get implemented using lambdas. This lambda usage has ensured that actions to be taken in each step remains nimble and efficient and addresses the SLA compliance concerns. The AWS Step function, ensures elimination of development of additional code components, reduces TCO.

Below is the high-level solution view of Job Orchestrator:

In summary, with the new ‘Big Data/Analytics-as-a-Service’ use cases, dynamic management of AWS EMR will be key and along with job orchestration layer, the above approach provides potential solution addressing variability of demands, variation of workloads and yet conforming to services. Adopting this approach, would help TCO to be lower at least by 30%, while risk of missing any SLAs goes down to decimal points as AWS provides 99.9%+ availability across all its resources.

AWS’s highly consumable services, availability of rich resource lifecycle management APIs, Cloudformation templates combined with IBM’s deep expertise on AWS cloud platform from architecture to build (https://partners.amazonaws.com/partners/001E000001IlLnmIAF/IBM) made it possible to deliver this platform in just about 12 weeks.

AWS Step functions and EMR Clusters — bedrocks for Data Analytics-as-a-Service ecosystem

Written by Chinmohan Biswas