Apache Spark™ - Unified Engine for large-scale data analytics


本站和网页 https://spark.apache.org/ 的作者无关,不对其内容负责。快照谨为网络故障时之索引,不代表被搜索网站的即时页面。

Apache Spark™ - Unified Engine for large-scale data analytics
Download
Libraries
SQL and DataFrames
Spark Streaming
MLlib (machine learning)
GraphX (graph)
Third-Party Projects
Documentation
Latest Release
Older Versions and Other Resources
Frequently Asked Questions
Examples
Community
Mailing Lists & Resources
Contributing to Spark
Improvement Proposals (SPIP)
Issue Tracker
Powered By
Project Committers
Project History
Privacy Policy
Developers
Useful Developer Tools
Versioning Policy
Release Process
Security
Apache Software Foundation
Apache Homepage
License
Sponsorship
Thanks
Security
Event
Unified engine for large-scale data analytics
Get Started
What is Apache Spark™?
Apache Spark™ is a multi-language engine for executing data engineering,
data science, and machine learning on single-node machines or clusters.
Simple.Fast.Scalable.Unified.
Key features
Batch/streaming data
Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
SQL analytics
Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
Data science at scale
Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
Machine learning
Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
Python
SQL
Scala
Java
Run now
Install with 'pip'
$ pip install pyspark
$ pyspark
Use the official Docker image
$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark
QuickStart
Machine Learning
Analytics & Data Science
df = spark.read.json("logs.json")
df.where("age > 21").select("name.first").show()
# Every record contains a label and feature vector
df = spark.createDataFrame(data, ["label", "features"])
# Split the data into train/test datasets
train_df, test_df = df.randomSplit([.80, .20], seed=42)
# Set hyperparameters for the algorithm
rf = RandomForestRegressor(numTrees=100)
# Fit the model to the training data
model = rf.fit(train_df)
# Generate predictions on the test dataset.
model.transform(test_df).show()
df = spark.read.csv("accounts.csv", header=True)
# Select subset of features and filter for balance > 0
filtered_df = df.select("AccountBalance", "CountOfDependents").filter("AccountBalance > 0")
# Generate summary statistics
filtered_df.summary().show()
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-sql
spark-sql>
SELECT
name.first AS first_name,
name.last AS last_name,
age
FROM json.`logs.json`
WHERE age > 21;
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-shell
scala>
val df = spark.read.json("logs.json")
df.where("age > 21")
.select("name.first").show()
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-shell
scala>
Dataset df = spark.read().json("logs.json");
df.where("age > 21")
.select("name.first").show();
Run now
$ docker run -it --rm spark:r /opt/spark/bin/sparkR
df <- read.json(path = "logs.json")
df <- filter(df, df$age > 21)
head(select(df, df$name.first))
The most widely-used
engine for scalable computing
Thousands of
companies, including 80% of the Fortune 500, use Apache Spark&trade;.Over 2,000 contributors to
the open source project from industry and academia.
Ecosystem
Apache Spark&trade; integrates with your favorite frameworks, helping to scale them to thousands of machines.
Data science and Machine learning
SQL analytics and BI
Storage and Infrastructure
Spark SQL engine: under the hood
Apache Spark&trade; is built on an advanced distributed SQL engine
for large-scale data
Adaptive Query Execution
Spark SQL adapts the execution plan at runtime, such as automatically setting the number of reducers and join algorithms.
Support for ANSI SQL
Use the same SQL you鈥檙e already comfortable with.
Structured and unstructured data
Spark SQL works on structured tables and unstructured data such as JSON or images.
TPC-DS 1TB No-Stats With vs. Without Adaptive Query Execution
Accelerates TPC-DS queries up to 8x
Join the community
Spark has a thriving open source community, with
contributors from around the globe building features, documentation and assisting other users.
Mailing list
Source code
News and events
How to contribute
Issue tracking
Committers
Apache Spark, Spark, Apache, the Apache feather logo, and the Apache Spark project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries. See guidance on use of Apache Spark trademarks. All other marks mentioned may be trademarks or registered trademarks of their respective owners. Copyright 漏 2018 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.