Apache Hudi

Streaming data lakehouse platform with incremental processing and record-level updates.

CLI

About Apache Hudi

Streaming data lakehouse platform with incremental processing and record-level updates. Explore how Apache Hudi integrates with the agentic data stack ecosystem and supports autonomous data operations.

Key Features

ACID transactions with record-level upserts and deletes on data lakes
Two table types: Copy-on-Write (read-optimized) and Merge-on-Read (write-optimized)
Incremental and snapshot queries plus Change Data Capture query support
Built-in DeltaStreamer ingestion tool for Kafka, DFS, and database CDC sources
Advanced indexing (bloom filters, HFile, bucket index) for fast record lookups
Automatic file sizing, clustering, and compaction for performance optimization
Multi-engine compatibility (Spark, Flink, Presto, Trino, Hive)
Comprehensive admin CLI with 40+ commands for table management and diagnostics

Agent Integration

CLI — hudi-cli

$ Download hudi-cli-bundle JAR and hudi-cli-with-bundle.sh from Maven/GitHub

CLI Documentation

External Links

Apache Hudi GitHub

Main repository — upserts, deletes, and incremental processing on big data

hudi-rs (Rust/Python SDK)

Native Rust implementation with Python bindings for reading Hudi tables without Spark/JVM

CLI Documentation

Official docs for the Hudi CLI tool for table management, inspection, and maintenance

Python/Rust Quick Start

Getting started guide for the native Python binding based on hudi-rs

Awesome Lakehouse Guide

Curated repo covering open table formats including Hudi architecture

← Back to Lake Format