Apache Spark 2.3

The Databricks team has been busy.  They’ve recently announced Apache Spark 2.3 on Databricks:

Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.3 marks a major milestone for Structured Streaming by introducing low-latency continuous processing and stream-to-stream joins; boosts PySpark by improving performance with pandas UDFs; and runs on Kubernetes clusters by providing native support for Apache Spark applications.

In addition to extending new functionality to SparkR, Python, MLlib, and GraphX, the release focuses on usability, stability, and refinement, resolving over 1400 tickets. Other salient features from Spark contributors include:

Anirudh Ramanathan and Palak Bathia also get into Kubernetes support in Spark 2.3:

Starting with Spark 2.3, users can run Spark workloads in an existing Kubernetes 1.7+ cluster and take advantage of Apache Spark’s ability to manage distributed data processing tasks. Apache Spark workloads can make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging. Best of all, it requires no changes or new installations on your Kubernetes cluster; simply create a container image and set up the right RBAC rolesfor your Spark Application and you’re all set.

Concretely, a native Spark Application in Kubernetes acts as a custom controller, which creates Kubernetes resources in response to requests made by the Spark scheduler. In contrast with deploying Apache Spark in Standalone Mode in Kubernetes, the native approach offers fine-grained management of Spark Applications, improved elasticity, and seamless integration with logging and monitoring solutions. The community is also exploring advanced use cases such as managing streaming workloads and leveraging service meshes like Istio.

Stream to stream joins looks particularly interesting.

Related Posts

HDP 3.0 Released

Roni Fontaine and Saumitra Buragohain announce Hortonworks Data Platform version 3.0: Other additional capabilities include: Scalability and availability with NameNode federation, allowing customers to scale to thousands of nodes and a billion files. Higher availability with multiple name nodes and standby capabilities allow for the undisrupted, continuous cluster operations if a namenode goes down. Lower […]

Read More

Metacat: Federated Metadata Discovery

Ajoy Majumdar and Zhen Li walk us through Metacat: The core architecture of the big data platform at Netflix involves three key services. These are the execution service (Genie), the metadata service, and the event service. These ideas are not unique to Netflix, but rather a reflection of the architecture that we felt would be […]

Read More

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031