Press "Enter" to skip to content

Category: Spark

Subqueries In Spark 2.0

Davies Liu and Herman van Hövell discuss SQL subqueries in Apache Spark 2.0:

In the upcoming Apache Spark 2.0 release, we have substantially expanded the SQL standard capabilities. In this brief blog post, we will introduce subqueries in Apache Spark 2.0, including their limitations, potential pitfalls and future expansions, and through a notebook, we will explore both the scalar and predicate type of subqueries, with short examples that you can try yourself.

A subquery is a query that is nested inside of another query. A subquery as a source (inside aSQL FROM clause) is technically also a subquery, but it is beyond the scope of this post. There are basically two kinds of subqueries: scalar and predicate subqueries. And within scalar and predicate queries, there are uncorrelated scalar and correlated scalar queries and nested predicate queries respectively.

They also link to a Notebook which you can use to follow along.  If you’re interested in window functions, here are notes from Spark 1.4.

Comments closed

Analyze Fantasy Sports With Spark

Jordan Volz is back with part two of his series on fantasy sports analysis using Apache Spark:

We’ll look at both zTot and nTot, and consider the player’s age and experience.The latter is potentially important because there have been shifts in what ages players joined the league over the timespan we are considering. It used to be rare for players to skip college, then it wasn’t, now they are required to play at least one year. It will be interesting to see if we see a difference in age versus experience in the numbers.

We start with the RDD containing all the raw stats, z-scores, and normalized z-scores. Another piece of data to consider is how a player’s z-score and normalized z-score change each year, so we’ll calculate the change in both from year to year. We’ll save off two sets of data, one a key-value pair of age-values, and one a key-value pair of experience-values. (Note that in this analysis, we disregard all players who played in 1980, as we don’t have sufficient data to determine their experience level.)

Jordan also looks at player performance over time and makes data analysis look pretty easy.

Comments closed

Databricks Community Edition

Databricks has released an IDE for Spark:

We are excited to announce the General Availability (GA) of Databricks Community Edition(DCE). As a free version of the Databricks service, DCE enables everyone to learn and exploreApache Spark, by providing access to a simple, integrated development environment for data analysts, data scientists and engineers with high quality training materials and sample application notebooks.

Less than four months ago, at Spark Summit New York, we introduced Databricks Community Edition (DCE) beta. Its introduction generated tremendous interest with thousands of people requesting accounts. Today, we are delighted to report that more than 8,000 users have signed on DCE, many of them using the service heavily. The top 10% active users are averaging over 6 hours per week, and are executing over 10,000 commands on average.

They also just started an EdX course on an introduction to Spark yesterday.  If you’re interested in Spark but haven’t had the time to learn, this might be a good course to take.

Comments closed

Spark Security

Dave Wang discusses the Databricks Enterprise Security framework:

The Databricks just-in-time data platform takes a holistic approach to solving the enterprise security challenge by building all the facets of security — encryption, identity management, role-based access control, data governance, and compliance standards — natively into the data platform with DBES.

  • Encryption: Provides strong encryption at rest and inflight with best-in-class standards such as SSL and keys stored in AWS Key Management System (KMS).
  • Integrated Identity Management: Facilitates seamless integration with enterprise identity providers via SAML 2.0 and Active Directory.
  • Role-Based Access Control: Enables fine-grain management access to every component of the enterprise data infrastructure, including files, clusters, code, application deployments, dashboards, and reports.
  • Data Governance: Guarantees the ability to monitor and audit all actions taken in every aspect of the enterprise data infrastructure.
  • Compliance Standards: Achieves security compliance standards that exceed the high standards of FedRAMP as part of Databricks’ ongoing DBES strategy.

In short, DBES will provide holistic security in every aspect of the entire big d

As enterprises come to depend on technologies like Spark and Hadoop, they need to have techniques and technologies to ensure that data remains secure.  This is a sign of a maturing platform.

Comments closed

Automate Spark Jobs Using Oozie

Mike Grimes shows how to use Oozie to automate Hadoop and Spark jobs:

This problem is easy to solve, right? You can write scripts that run jobs in sequence, and use the output of one program as the input to another—no problem. But what if your workflow is complex and requires specific triggers, such as specific data volumes or resource constraints, or must meet strict SLAs? What if parts of your workflow don’t depend on each other and can be run in parallel?

Building your own infrastructure around this problem can seem like an attractive idea, but doing so can quickly become laborious. If, or rather when, those requirements change, modifying such a tool isn’t easy . And what if you need monitoring around these jobs? Monitoring requires another set of tools and headaches.

This is a pretty detailed look at the basics of Oozie.

Comments closed

Adding An Index To A Spark RDD

Arijit Tarafdar gives us a good method for adding an index column to a Spark data frame based on a non-unique value:

The basic idea is to create a lookup table of distinct categories indexed by unique integer identifiers. The way to avoid is to collect the unique categories to the driver, loop through them to add the corresponding index to each to create the lookup table (as Map or equivalent) and then broadcast the lookup table to all executors. The amount of data that can be collected at the driver is controlled by the spark.driver.maxResultSize configuration which by default is set at 1 GB for Spark 1.6.1. Both collect and broadcast will eventually run into the physical memory limits of the driver and the executors respectively at some point beyond certain number of distinct categories, resulting in a non-scalable solution.

The solution is pretty interesting:  build out a new RDD of unique results, and then join that set back.  If you’re using SQL (including Spark SQL), I would use the DENSE_RANK() window function.

Comments closed

Aggregating Clickstream Data

Ofer Habushi solves a clickstream aggregation problem using Spark:

At this point, an interesting question came up for us: How can we keep the data partitioned and sorted? 

That’s a challenge. When we sort the entire data set, we shuffle in order to get sorted RDDs and create new partitions, which are different than the partitions we got from Step 1. And what if we do the opposite?

Sort first by creation time and then partition the data? We’ll encounter the same problem. The re-partitioning will cause a shuffle and we’ll lose the sort. How can we avoid that?

Partition→sort = losing the original partitioning

Sort→partition = losing the original sort

There’s a solution for that in Spark. In order to partition and sort in Spark, you can use repartitionAndSortWithinPartitions. 

This is an interesting solution to an ever-more-common problem.

Comments closed

Analysis Of Fantasy Sports Using Spark

Jordan Voiz knows how to get to my heart:

Although the data involved is not large in volume, the types of data processing, data analytics, and machine-learning techniques used in this area are common to many Apache Hadoop use cases. So, fantasy sports analytics provides a good (and fun) use case for exploring the Hadoop ecosystem.

Apache Spark is a natural fit in this environment. As a data processing platform with embedded SQL and machine-learning capabilities, Spark gives programmatic access to data while still providing an easy SQL access point and simple APIs to churn through the data. Users can write code in Python, Java, or Scala, and then use Apache Hive, Apache Impala (incubating), or even Cloudera Search (Apache Solr) for exploratory analysis.

Baseball was my introduction to statistics, and I think that fantasy sports is a great way of driving interest in stats and machine learning.  I’m looking forward to the other two parts of this series.

Comments closed

MapR Goes Spark-First

MapR has introduced a new version of their platform which is based on Spark:

With the emergence of Spark as a unified computing engine, developers can perform ETL and advanced analytics in both continuous (streaming) and batch mode either programmatically (using Scala, Java, Python, or R) or with procedural SQL (using Spark SQL or Hive QL).

With MapR converging the data management platform, you can now take a preferential Spark-first approach. This differs from the traditional approach of starting with extended Hadoop tools and then adding Spark as part of your big data technology stack. As a unified computing engine, Spark can be used for faster batch ETL and analytics (with Spark core instead of MapReduce and Hive), machine learning (with Spark MLlib instead of Mahout), and streaming ETL and analytics (with Spark Streaming instead of Storm).

MapReduce is so 2012…

Comments closed

Resilient Distributed Datasets

Spark is built around the concept of Resilient Distributed Datasets.  If you have not read Matei Zaharia, et al’s paper on the topic, I highly recommend it:

Spark exposes RDDs through a language-integrated API similar to DryadLINQ [31] and FlumeJava [8], where each dataset is represented as an object and transformations are invoked using methods on these objects.

Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map and filter). They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system. Examples of actions include count (which returns the number of elements in the dataset), collect (which returns the elements themselves), and save (which outputs the dataset to a storage system). Like DryadLINQ, Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations.

In addition, programmers can call a persist method to indicate which RDDs they want to reuse in future operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM. Users can also request other persistence strategies, such as storing the RDD only on disk or replicating it across machines, through flags to persist. Finally, users can set a persistence priority on each RDD to specify which in-memory data should spill to disk first.

The link also has a video of their initial presentation at NSDI.  Check it out.

Comments closed