AWS Data Pipeline Architecture for Machine Learning and Data Analytics Projects

Nhat Tran
5 min readJul 18, 2021

--

Welcome to my new blog, today I will show you my experience and sample architecture to building a data pipeline for machine learning and Data Analytics Projects. My role responsibility in this project is as an AI engineer and AWS Architect, I work with my team to build a data pipeline, production architecture for the system. We start to build everything from scratch with a small dataset; the solution is to use monolithic architecture and processing data in an EC2 instance. The challenge comes when we have customers pay to use the system, we have a lot of challenges in the prototype phase and that break production release, most of them are:
- Data processing problem: Data come a lot and the system cannot work fast with the large scale of data and the manual data processing workflow isn’t enough faster to processing data (take few days complete the pipeline with new dataset)
- Manual human mistake: engineers, scientists will process new data by manually way with a lot of missing and data quality isn’t trusted for production
- Security risk with production data: this is a common challenge when working with any machine learning and data analytic project. The development team will work with the real dataset for exploration

With that challenges, new requirements come to the engineering team to improve the data processing pipeline. We start with the idea to build MLOps culture for our time and start with the data pipeline to build the base of the data source for machine learning and data analytics.

Data pipeline architecture considerations

  1. Performance efficiency
    The big challenge of data processing in our system are we running the processing jobs at unpredictable how many compute, storage is enough for the job and that increasing by the runtime. With the very complex job for image processing which take really long time and we need GPUs and high CPUs for it. → Scalability and distributed computing for parallel jobs is required to resolve these things
  2. Security
    Our data include customer-sensitive information and we need to protect it very carefully and we control every data in a private network and separate multiple environments for protecting production data and limitation user access to the production data. → We trust machines and use role-based access control for machines to do automation jobs without human
  3. Cost efficiency
    When processing the large scale of dataset mean we need to spend a lot of money for this. Cost-saving always comes together with a good solution. We choose Fargate spot instances to run the system with the most cost-saving in our situation. We also have a cost-saving plan to get a good offer from AWS → No idle resources on standby and pay as we go
  4. Resiliency
    We running the parallel jobs and any jobs will be independent of each other. A job can be to restart and terminate at any time and it’s automated to re-run again to be able to fault-tolerant and improve the resilience of the system. We have high resilience and we are easy to adopt with spot instances in the production environment for large-scale data processing with most of the cost-effective. → We use a queue and batch job management system to make sure all jobs have high resiliency and are distributed.
  5. Flexibility
    When everything is in the cloud, developers and scientists are being challenged with the development environment, more than one time a job and code and running smoothly on local but not work on the cloud 💥
    We want a solution with easy for both development and deployment with a scale system easy and reduce the impact of the inconsistent environment. → We used the container base for most of the processing jobs, clean architecture, and structure for easy changes and to adopt the new requirements.

AWS Architecture for Data Pipeline

We are architected in AWS cloud and use most of AWS Services to reduce the operational effort and quickly develop the system.

Role responsibility of the services/Component

  1. GitOps:
    This is the git version system and CI/CD pipeline to help to build sources, containers, deliver jobs with more agility and high-quality control by automation tests. Every change will affect the system and we need to control the system by code version control and changes will trigger from git events
  2. AWS Airflow:
    We use AWS Airflow managed service to deploy airflow. Airflow is a job and schedule orchestration management system. We can easy to develop a dynamic, complex jobs pipeline with python code. And more thing is nice UI for more visibility.
  3. AWS Batch:
    For the jobs build in docker, we use AWS Batch to manage job queues and machine runners for scalability. AWS Batch will balance jobs and machines by queue service. it can make a priority for important jobs, make queue buffer when too many jobs starting to reduce the system connected to the database, etc … AWS Batch do different thing with Airflow, it helps to provision resource to run the jobs and make job can run autonomous and high resiliency with spot instance machine.
  4. AWS Fargate
    This is AWS managed service to help to run docker without managing a dedicated VM machine. Fargate is a serverless container service and we can start a new container quickly as scale faster than when compared with EC2. Security is the strong point of Fargate because it’s running independently on other jobs and that can improve the security if our job needs a dedicated container. We need a scale fast solution and cheap as spot instances offer and saving plan support.
  5. AWS EFS
    We need to store large-scale data for processing and multiple teams can access the processed, raw data to do the exploration. With EFS, we don’t need to scale dedicated storage such as EBS and multiple machines can mount to an EFS. EFS makes it easy and saves a lot of effort for data transfer when compared with S3.
  6. AWS S3
    We use S3 to store large and complex datasets for a long time with cheaper than when compared with EFS. AWS S3 with make more connectivity with HTTP connections to web applications. And S3 will easier to control access policy than EFS
  7. AWS RDS
    AWS RDS is application storage for SQL database structure. The jobs will connect to master and standby RDS to collect data for processing
  8. AWS Sagemaker
    Sagemaker helps Data scientists and Engineers easily explore datasets with Jupyter Notebook and easier to train Machine Learning models with build-in machine learning models.

Architecture Limitation
- The architecture has come to more complex when compared with lagacy architecture
- AWS Fargate has limitations for the number of CPUs and memory for each instance (Max: 4 CPUs, 30 GBs Mem). You need to hold your job is small enough

Conclusion

That is all I want to share with you about my choice and explain the reason and motivation. I hope you can get something from my share and let me a comment below if you want to give me a suggestion or concern.
… My next section is to make a basic template for you to hands-on it in a development environment with Docker compose or more advance in K8s (my interest).

Take care guys. Thank you!

--

--

Nhat Tran

Working as AI Engineer at MTI Technology. Aimed to software development, Cloud solution and AI technologies. Passionate with book reading, writing and sharing.