Uber BlogMarch 16, 2023

Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi

Introduction

Uber has implemented a "transactional data lake" using Apache Hudi to achieve data freshness in ETL pipelines. Incremental processing is used to update modeled tables rather than recomputing all data with each new run, with Apache Hudi allowing for powerful incremental data processing capabilities.

Challenges with Batch Processing

Batch processing pipelines handle late-arriving data by recomputing entire tables or time windows multiple times a day. Incremental processing extends streaming data semantics to batch processing pipelines by processing only new data and updating results incrementally. This addresses challenges in massive data lakes, where recomputing would require recalculation of all affected partitions.

Read and Join Strategies

ETL pipelines require handling of various types of reads and joins using Apache Hudi. Snapshot reads are also necessary for backfills on single or multiple tables.

Backfill Strategies

Incremental data pipelines also need a way to backfill tables when business logic changes. At Uber, an Apache Spark ETL framework is used to manage and operate ETL pipelines at scale, scheduled through Piper.

Performance and Cost Savings

Using Apache Hudi's DeltaStreamer for incremental reads and upserts showed large performance gains for Uber's batch ETL pipelines.

Conclusion

Apache Hudi and incremental ETL allow for reading updates incrementally and only running the computation logic on delta changes, resulting in more efficient and cost-effective data processing.