Migrating Legacy ETL Pipelines to Distributed Spark Ecosystems: Challenges & Strategies

Pavan Kumar Mantha

doi:10.71097/IJAIDR.v10.i1.1677

Migrating Legacy ETL Pipelines to Distributed Spark Ecosystems: Challenges & Strategies

Author(s)	Pavan Kumar Mantha
Country	United States
Abstract	With the aggressive increase in structured and semi-structured enterprise data, 2015-2019, the environment of large-scale data engineering had changed significantly. Conventional Extract-Transform-Load (ETL) systems, such as SAS batches, mainframe COBOL ETL programs, Informatica mappings, Ab Initio graphs, and Oracle PI, were initially designed with vertical scale systems with deterministic workloads and short concurrency constraints. As the volume of data in financial, retail and telecommunication sectors- and even government sectors grew to tens of terabytes per day, the monolithic nature of these ETL environments started assuming a structural performance bottleneck, schema dependence, and rapidly rising costs of operation based upon licensing models and special-purpose hardware. Simultaneously, the release of Apache Spark 2.x (2.2 through 2.4) created a new data processing paradigm that offered horizontally scalable computation on commodity clusters. In turn, businesses launched mass modernization programs that presupposed the migration of historical workloads of ETL into cloud-native Spark-based data platforms. This paper amounts to a comprehensive account of issues, tactics, and architecture changes relating to the migration of the legacy ETL pipelines into the distributed Spark architectures. The paper is centred on the 2018-2019 enterprise landscape and Hadoop 2.x-based YARN clusters, Hive Metastore, Kafka ingestion layers and S3/ADLS/GCS storage backends were at the heart of the new big data frameworks. The study finds such driving forces as cost optimization, scale requirements, single compute power, and the necessity to provide almost real-time analytics with Spark Structured Streaming. We discuss various serious migration issues: (a) competency shortages between legacy developers migrating SAS, COBOL, or PL/SQL into Scala, PySpark, and concepts of distributed processing; (b) code translation issues when moving deterministic ETL process logic into corresponding Spark DataFrame or Spark SQL transformations; (c) data fidelity in moving data between parallel systems; (d) performance scaling of clusters due to shuffle, skewness, and partitioning; (e) orchestration migration between Control-M, TWS, and Autosys and Airflow and It suggests a stepwise best-practice approach to migration, which includes workload evaluation, ingestion architecture normalization, library of reusable code, data quality framework, and performance optimization methods. A phased cutover model succeeded on the dual-run validation and golden dataset comparison is also presented in the paper. Moreover, several case studies show tangible performance: 8-hour SAS credit-risk batch can be reduced to 50 minutes with Spark; mainframe ETL can be replaced with real-time Kafka-Spark ingestion; multi-source customer service can be modernized to partitions of Parquet-based Spark SQL models. In general, the results show that Spark-based ecosystems can increase the level of scalability, decrease the costs of operations and enhance the level of reliability and allow supporting advanced machine-learning and real-time analytics applications. The work is an informative source to any organization taking on ETL modernization projects, and can add to the overall information on the practice of distributed data engineering in pre-2020 enterprise computing.
Keywords	Apache Spark 2.x, ETL Migration, Hadoop 2.x, SAS Modernization, Data Engineering, Structured Streaming, Airflow, YARN Optimization, Legacy ETL, Big Data Architecture.
Field	Engineering
Published In	Volume 10, Issue 1, January-June 2019
Published On	2019-06-07
DOI	https://doi.org/10.71097/IJAIDR.v10.i1.1677
Short DOI	https://doi.org/hbm797

View / Download PDF File

doi

CrossRef DOI is assigned to each research paper published in our journal.

IJAIDR DOI prefix is
10.71097/IJAIDR

Downloads

Research Paper Format Copyright Permission Form and Undertaking Form Cover Page Vol 17 Isu 1 Cover Page Vol 16 Isu 2 Cover Page Vol 16 Isu 1

All research papers published on this website are licensed under Creative Commons Attribution-ShareAlike 4.0 International License, and all rights belong to their respective authors/researchers.

CC-BY-SA

About IJAIDR Fees & Payment Current Issue Publication Archive	Submit Research Paper Track Submission Status Publication Guidelines Peer Review & Plagiarism	Join as a Reviewer Editors & Reviewers Get Reviewer Membership Certi.	Website/Journal Policies Usage Policy Content Policies Privacy Policy

Contact Us	Message on WhatsApp	+91-9687-182-185	editor@ijaidr.com

Journal of Advances in Developmental Research

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Migrating Legacy ETL Pipelines to Distributed Spark Ecosystems: Challenges & Strategies

Share this