Databricks
- Home
- Databricks
Databricks Data Engineering
Course Overview
This Databricks Data Engineering course equips learners with the skills to design, build, and optimize scalable data pipelines using the Databricks platform. Covering PySpark, Delta Lake, Delta Live Tables, MLflow, and advanced Spark optimizations, the course focuses on hands-on learning through real-world projects and industry-aligned use cases. Participants will master big data processing, data lake management, and pipeline orchestration, preparing them for data engineering roles and certifications like Databricks Data Engineer Associate.
Level: Intermediate (Basic knowledge of Python, SQL, and data processing concepts recommended)
Delivery: Online/In-Person with live instruction, labs, and projects
Learning Outcomes:
- Build and optimize data pipelines using Databricks and PySpark.
- Implement Delta Lake for reliable, scalable data lakes.
- Orchestrate ETL workflows with Delta Live Tables and Databricks Jobs.
- Apply Spark optimizations (e.g., salting, static pruning) for performance.
- Prepare for data engineering interviews with hands-on projects and PySpark coding practice.
Section 1: Introduction to Databricks and Data Engineering
Objective:Understand the Databricks platform and its role in data engineering.
Duration:4 hours
Topics:
- Overview of data engineering: Roles, responsibilities, and modern tools
- Introduction to Databricks: Lakehouse architecture, workspaces, and clusters
- Databricks ecosystem: Notebooks, Delta Lake, MLflow, and Databricks SQL
- Setting up a Databricks environment: Community Edition or enterprise workspace
- Best practices: Cluster management, cost optimization, and security (RBAC, secret scopes)
Hands-On Lab:
- Create a Databricks workspace and configure a cluster with auto-scaling.
- Set up a secret scope for secure credential management.
- Run a sample PySpark notebook to explore the Databricks UI.
Section 2: PySpark Fundamentals in Databricks
Objective:Master PySpark for data processing and transformation.
Duration:6 hours
Topics:
- PySpark overview: SparkSession, DataFrames, and lazy evaluation
- Core DataFrame operations: Filtering, grouping, joining, and aggregations
- RDDs vs. DataFrames: Use cases and performance considerations
- Data I/O: Reading/writing CSV, Parquet, JSON, and Delta formats
- Spark SQL: Creating views, querying with SQL, and UDFs
- Schema management: Defining, inferring, and enforcing schemas
Hands-On Lab:
- Read a JSON dataset into a DataFrame with a defined schema.
- Perform transformations (e.g., filter, groupBy, join) on a retail dataset.
- Create a temporary view and query it with Spark SQL.
Project:
- Process a customer dataset in PySpark, generating a summary report of sales by region.
Section 3: Delta Lake for Scalable Data Lakes
Objective:Implement Delta Lake for reliable data management.
Duration:6 hours
Topics:
- Delta Lake fundamentals: ACID transactions, schema enforcement, and storage
- Creating Delta tables: Write, update, and delete operations
- Advanced features: Time travel, versioning, and Z-order indexing
- Upserts and merges: Incremental updates and SCD Type 2
- Data quality: Assert Transformations and constraints
- Optimizing Delta tables: Compaction, vacuuming, and partitioning
Hands-On Lab:
- Create a Delta table from a Parquet dataset.
- Perform a merge operation to upsert new transaction data.
- Use time travel to query a previous version and vacuum old files.
Project:
- Build a Delta Lake pipeline to ingest and update a product inventory dataset, enforcing data quality with constraints.
Section 4: Delta Live Tables for Simplified ETL
Objective:Streamline ETL pipelines with Delta Live Tables.
Duration:6 hours
Topics:
- Delta Live Tables (DLT) overview: Declarative ETL and continuous processing
- Defining DLT pipelines: Python and SQL syntax
- Pipeline modes: Batch and streaming
- Data quality: Expectations and quarantine rules
- Monitoring DLT: Pipeline status, lineage, and alerts
- Integration with storage systems: File systems, cloud storage, and databases
Hands-On Lab:
- Create a DLT pipeline to process streaming log data from a file source.
- Apply expectations to filter invalid records and quarantine them.
- Monitor the pipeline and visualize lineage in the DLT UI.
Project:
- Develop a DLT pipeline to process real-time clickstream data, storing curated results in Delta Lake.
Section 5: Spark Performance Optimization
Objective:Optimize Spark jobs for performance and scalability.
Duration:8 hours
Topics:
- Spark execution model: Stages, tasks, shuffles, and DAGs
- Caching and persistence: Memory vs. disk, when to cache
- Partition management: Repartition, coalesce, and optimal partition sizing
- Data skew mitigation: Salting (e.g., floor(rand() * salt_factor).cast(“string”))
- Static and dynamic pruning: Partition pruning and predicate pushdown
- Adaptive Query Execution (AQE): Dynamic coalescing and skew handling
Hands-On Lab:
- Optimize a skewed join on a sales dataset using salting.
- Apply static pruning by partitioning data by FiscalYear and filtering.
- Enable AQE and measure performance improvements with caching.
Project:
- Optimize a PySpark job processing a large dataset (e.g., 50GB), applying salting, partitioning, and AQE, and document performance gains.
Section 6: Advanced Databricks Features and MLflow
Objective:Leverage advanced Databricks tools for analytics and ML.
Duration:6 hours
Topics:
- Medallion architecture: Bronze, Silver, Gold layers for structured pipelines
- Structured Streaming: Real-time processing with Delta Lake
- MLflow: Tracking experiments, models, and runs
- Databricks SQL: Building dashboards and analytics queries
- Unity Catalog: Data governance, lineage, and access control
- Collaborative workflows: Notebook sharing, comments, and version control
Hands-On Lab:
- Build a streaming pipeline using Structured Streaming to process event data.
- Use MLflow to track a simple regression model experiment.
- Create a Databricks SQL dashboard for a sales dataset.
Project:
- Implement a medallion-based streaming pipeline for IoT sensor data, tracking experiments with MLflow and visualizing results in Databricks SQL.
Section 7: Pipeline Orchestration and Monitoring
Objective:Orchestrate and monitor Databricks workflows.
Duration:6 hours
Topics:
- Databricks Workflows: Scheduling and managing jobs
- Error handling: Retries, alerts, and failover strategies
- Monitoring: Job logs, cluster metrics, and alerting
- Monitoring: Job logs, cluster metrics, and alerting
- Version control: Integrating notebooks with Git
- Best practices: Modularity, Data Vault modeling, and logging
Hands-On Lab:
- Create a Databricks Workflow to schedule a multi-step pipeline.
- Set up email alerts for job failures and monitor cluster utilization.
- Commit a notebook to a Git repository for version control.
Project:
- Build an orchestrated pipeline using Databricks Workflows, applying Data Vault modeling, and monitor performance with custom alerts.
Section 8: Capstone Project and Career Preparation
Objective:Apply skills to a real-world project and prepare for data engineering roles.
Duration:6 hours
Topics:
- Review of key concepts: PySpark, Delta Lake, DLT, and optimizations
- Building a portfolio-worthy project with medallion architecture
- Career preparation: Resume building with Apache Spark expertise, LinkedIn optimization, and interview tips
- Common interview questions: PySpark coding (e.g., DataFrame operations, skew mitigation), Delta Lake, and pipeline design
- Certification paths: Databricks Certified Data Engineer Associate
Capstone Project:
- Design and implement a complete data pipeline for a fictional logistics company:
- Ingest raw data into a file system or cloud storage.
- Transform data using PySpark, Delta Lake, and Delta Live Tables, applying optimizations like salting and static pruning..
- Store curated data in Delta tables.
- Build a Databricks SQL dashboard for analytics.
- Document the architecture with Data Vault principles and present findings in a notebook.
Career Prep:
- Mock interview with PySpark coding (e.g., optimizing joins, handling skew) and Databricks architecture questions.
- Create a GitHub repository with project code, notebooks, and pipeline documentation, versioned using Git.
Additional Resources
- Tools Covered: Databricks, PySpark, Delta Lake, Delta Live Tables, MLflow, Databricks SQL, Unity Catalog, Databricks Workflows
- Sample Datasets: Retail sales, logistics data, IoT streams, and clickstream logs
- Supplementary Materials: Cheat sheets for PySpark, Delta Lake, and Spark optimizations; access to recorded sessions
- Community Support: Slack/Discord channel for peer collaboration and instructor Q&A
Assessment and Certification
- Quizzes: Weekly quizzes to reinforce concepts
- Labs and Projects: Graded hands-on assignments
- Capstone Evaluation: Assessed on pipeline functionality, optimization, and presentation
- Certificate: Awarded upon successful completion of the course and capstone project
Requirement For This Course
Computer / Mobile
Internet Connection
Paper / Pencil
