Apache Spark

Unified analytics engine for large-scale data processing with SQL, streaming, and ML.

MCPCLI1 Skill

About Apache Spark

Unified analytics engine for large-scale data processing with SQL, streaming, and ML. Explore how Apache Spark integrates with the agentic data stack ecosystem and supports autonomous data operations.

Key Features

Unified analytics engine for batch processing, streaming, machine learning, and graph processing
In-memory computation delivering up to 100x faster processing vs disk-based MapReduce
Multi-language APIs in Python, Scala, Java, and R with a pandas-compatible API
Spark SQL engine for structured data processing with full SQL support and DataFrame APIs
Structured Streaming for incremental, fault-tolerant stream processing
MLlib built-in machine learning library with classification, regression, clustering
Deploys on standalone clusters, Hadoop YARN, Kubernetes, or locally
Spark Connect client-server architecture for remote connectivity

Agent Integration

MCP Server

kubeflow/mcp-apache-spark-history-server

CLI — spark-sql / spark-submit / pyspark

$ brew install apache-spark OR pip install pyspark

CLI Documentation

Agent Skills

spark-optimization

External Links

Spark History Server MCP (Kubeflow)

Official Kubeflow MCP server enabling AI agents to analyze Spark job performance and identify bottlenecks

PySpark AI (English SDK for Spark)

English SDK that compiles natural language instructions into PySpark DataFrames

PySpark MCP Server

MCP server for PySpark focused on query optimization using AI systems

MLlib Guide

Official ML Pipelines documentation for classification, regression, clustering, and feature engineering

Spark Submit Documentation

CLI reference for spark-submit, the primary interface for launching Spark applications

← Back to SQL Engine