AWS DE
- Home
- AWS DE
AWS Data Engineering
Course Overview
This comprehensive AWS Data Engineering course equips learners with the skills to design, build, and optimize data pipelines using AWS cloud services. Focusing on S3, AWS Glue, Athena, Redshift, and Databricks, the course covers data ingestion, transformation, storage, querying, and analytics. Through hands-on labs, real-world projects, and industry-aligned use cases, participants will learn to architect scalable data solutions and prepare for roles in data engineering.
Level:Â Intermediate (Basic knowledge of Python, SQL, and cloud concepts recommended)
Delivery:Â Online/In-Person with live instruction, labs, and projects
Learning Outcomes:
- Master AWS services for data engineering workflows.
- Build end-to-end data pipelines from ingestion to analytics.
- Optimize performance and cost in cloud data architectures.
- Gain hands-on experience with real-world datasets and scenarios.
Section 1: Introduction to AWS and Data Engineering
Objective:Understand the fundamentals of data engineering and AWS cloud services.
Duration:4 hours
Topics:
- Overview of data engineering: Roles, responsibilities, and tools
- Introduction to AWS: Regions, Availability Zones, and IAM
- AWS data engineering ecosystem: S3, Glue, Athena, Redshift, Databricks
- Setting up an AWS account and CLI/SDK for data engineering
- Best practices for security and cost management
Hands-On Lab:
- Create an AWS account, configure IAM roles, and set up CloudTrail for monitoring.
- Install and configure AWS CLI for programmatic access.
Section 2: Amazon S3 for Data Storage
Objective:Learn to use S3 as the foundation for data lakes and storage.
Duration:6 hours
Topics:
- S3 fundamentals: Buckets, objects, and storage classes (Standard, Glacier, Intelligent-Tiering)
- S3 security: IAM policies, bucket policies, and encryption (SSE, KMS)
- Data organization: Partitioning, prefixing, and lifecycle policies
- S3 event notifications and integration with other AWS services
- Performance optimization: Multipart uploads and transfer acceleration
Hands-On Lab:
- Create an S3 bucket, upload sample datasets (e.g., CSV, JSON), and configure lifecycle rules.
- Implement bucket policies to restrict access and enable server-side encryption.
- Set up S3 event notifications to trigger a Lambda function.
Project:
- Design an S3-based data lake structure for a retail dataset with partitioning by year/month.
Section 3: AWS Glue for ETL and Data Cataloging
Objective:Build ETL pipelines and manage metadata using AWS Glue.
Duration:8 hours
Topics:
- AWS Glue overview: Components (Crawlers, Jobs, Triggers)
- Glue Data Catalog: Metadata management for S3 and other sources
- Writing ETL jobs: PySpark, Python, and Glue Studio
- Glue crawlers: Auto-discovering schema and populating the Data Catalog
- Scheduling and monitoring Glue jobs
- Optimizing Glue jobs: Partitioning, pushdown predicates, and job bookmarks
Hands-On Lab:
- Create a Glue crawler to catalog an S3 dataset (e.g., sales data).
- Write a PySpark Glue job to transform CSV data into Parquet format.
- Schedule a Glue job using triggers and monitor execution in CloudWatch.
Project:
- Build an ETL pipeline to clean and transform a raw customer dataset in S3, storing results in a curated zone.
Section 4: Amazon Athena for Serverless Querying
Objective:Query data lakes using Athena’s serverless SQL engine.
Duration:6 hours
Topics:
- Athena overview: Use cases and architecture
- Querying S3 data with Athena: SQL syntax and best practices
- Integration with Glue Data Catalog for schema-on-read
- Partitioning and compression for cost and performance optimization
- Workgroups, query execution, and cost management
- Visualizing Athena results with QuickSight (optional integration)
Hands-On Lab:
- Query a partitioned S3 dataset using Athena and Glue Data Catalog.
- Optimize query performance by adding partitions and converting data to Parquet.
- Set up a workgroup and monitor query costs in the Athena console.
Project:
- Analyze a sales dataset in S3 using Athena to generate insights (e.g., top-selling products by region).
Section 5: Amazon Redshift for Data Warehousing
Objective:Design and manage data warehouses with Redshift for analytics.
Duration:8 hours
Topics:
- Redshift fundamentals: Clusters, nodes, and distribution styles (KEY, EVEN, ALL)
- Data loading: COPY command, S3 integration, and Redshift Spectrum
- Table design: Sort keys, distribution keys, and compression
- Query optimization: Workload Management (WLM) and query tuning
- Redshift Spectrum: Querying external S3 data without loading
- Backup, restore, and scaling Redshift clusters
Hands-On Lab:
- Create a Redshift cluster and load data from S3 using the COPY command.
- Design a star schema for a sales dataset and apply distribution/sort keys.
- Query external S3 data using Redshift Spectrum and compare performance.
Project:
- Build a data warehouse in Redshift for a retail dataset, creating dashboards with sample SQL queries.
Section 6: Databricks on AWS for Advanced Analytics
Objective:Build ETL pipelines and manage metadata using AWS Glue.
Duration:8 hours
Topics:
- Databricks overview: Architecture and integration with AWS (S3, Redshift)
- Setting up a Databricks workspace on AWS
- PySpark fundamentals: DataFrames, transformations, and actions
- Delta Lake: ACID transactions, versioning, and time travel
- Collaborative notebooks: Sharing and visualizing data
- Performance optimization: Caching, partitioning, and cluster sizing
Hands-On Lab:
- Configure a Databricks cluster and mount an S3 bucket.
- Create a Delta table from an S3 dataset and perform upserts.
- Build a notebook to process and visualize a dataset using PySpark and Databricks visualizations.
Project:
- Develop a Databricks pipeline to process streaming data from S3, storing results in Delta Lake.
Section 7: End-to-End Data Pipeline and Architecture
Objective:Integrate AWS services to build a scalable data pipeline.
Duration:6 hours
Topics:
- Medallion architecture: Bronze, Silver, Gold layers
- Orchestrating pipelines: AWS Step Functions, Lambda, and EventBridge
- Real-time vs. batch processing: Use cases and tools
- Monitoring and logging: CloudWatch, Glue job metrics, and Databricks alerts
- Cost optimization: Right-sizing clusters, spot instances, and storage tiers
- Best practices for production-ready pipelines
Hands-On Lab:
- Build a pipeline using Glue to ingest data, Databricks to transform it, and Redshift to store curated data.
- Query the final dataset with Athena and visualize results.
- Set up CloudWatch alarms for pipeline monitoring.
Project:
- Architect an end-to-end pipeline for a customer analytics use case, from raw S3 data to a Redshift dashboard.
Section 8: Capstone Project and Career Preparation
Objective:Apply skills to a real-world project and prepare for data engineering roles.
Duration:6 hours
Topics:
- Review of key concepts and tools
- Building a portfolio-worthy project
- Career preparation: Resume building, LinkedIn optimization, and interview tips
- Common interview questions: PySpark, SQL, and AWS architecture
- Certification paths: AWS Certified Data Analytics, Databricks Data Engineer Associate
Capstone Project:
- Design and implement a complete data pipeline for a fictional e-commerce company:
- Ingest raw data into S3.
- Use Glue to catalog and transform data.
- Store curated data in Redshift and Delta Lake.
- Query with Athena and visualize insights.
- Document the architecture and present findings.
Career Prep:
- Mock interview with PySpark coding and AWS architecture questions.
- Create a GitHub repository with project code and documentation.
Additional Resources
- Tools Covered:Â AWS S3, AWS Glue, Amazon Athena, Amazon Redshift, Databricks, PySpark, Delta Lake, AWS CLI, CloudWatch
- Sample Datasets:Â Retail sales, customer data, and streaming logs
- Supplementary Materials:Â Cheat sheets for PySpark, SQL, and AWS services; access to recorded sessions
- Community Support:Â Slack/Discord channel for peer collaboration and instructor Q&A
Assessment and Certification
- Quizzes:Â Weekly quizzes to reinforce concepts
- Labs and Projects:Â Graded hands-on assignments
- Capstone Evaluation:Â Assessed on pipeline functionality, optimization, and presentation
- Certificate:Â Awarded upon successful completion of the course and capstone project
Requirement For This Course
Computer / Mobile
Internet Connection
Paper / Pencil
