Apache Hudi
Streaming data lakehouse platform with incremental processing and record-level updates.
CLI
About Apache Hudi
Streaming data lakehouse platform with incremental processing and record-level updates. Explore how Apache Hudi integrates with the agentic data stack ecosystem and supports autonomous data operations.
Key Features
- ACID transactions with record-level upserts and deletes on data lakes
- Two table types: Copy-on-Write (read-optimized) and Merge-on-Read (write-optimized)
- Incremental and snapshot queries plus Change Data Capture query support
- Built-in DeltaStreamer ingestion tool for Kafka, DFS, and database CDC sources
- Advanced indexing (bloom filters, HFile, bucket index) for fast record lookups
- Automatic file sizing, clustering, and compaction for performance optimization
- Multi-engine compatibility (Spark, Flink, Presto, Trino, Hive)
- Comprehensive admin CLI with 40+ commands for table management and diagnostics
Agent Integration
CLI — hudi-cli
$ Download hudi-cli-bundle JAR and hudi-cli-with-bundle.sh from Maven/GitHubExternal Links
Apache Hudi GitHub
Main repository — upserts, deletes, and incremental processing on big data
hudi-rs (Rust/Python SDK)
Native Rust implementation with Python bindings for reading Hudi tables without Spark/JVM
CLI Documentation
Official docs for the Hudi CLI tool for table management, inspection, and maintenance
Python/Rust Quick Start
Getting started guide for the native Python binding based on hudi-rs
Awesome Lakehouse Guide
Curated repo covering open table formats including Hudi architecture