We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. But the question is… on Apache Hive or on Apache Spark? Well, why not both? We could use SparkSQL to extract men’s age distribution and HiveQL to extract women’s age distribution. We could then compare the two distributions and see if they show any difference.
But the main question, as usual, is: Will SparkSQL queries and HiveQL queries blend?
Topic: Age distribution for men and women in the U.S. state of Maine.
Challenge: Blend results from Hive SQL and Spark SQL queries.
Access mode: Apache Spark and Apache Hive nodes for SQL processing.
Using KNIME, the authors are able to blend together data from different sources.
If you are using the
dplyrpackage with a database or with
Apache Spark: I respectfully advise you inspect your code to ensure you are not using any values created inside a
dplyr::mutate()statement inside the same
dplyr::mutate()statement. This has been my coding advice for some time, and it is a simple and safe re-factoring to break up such statements into safer sequences (simply by introducing more
I have since encountered a non-signaling (or silent) result corruption version of the issue. We are now advising code inspection as we now have confirmation that not seeing a thrown error is not a reliable indication of correct execution and correct results.
Thanks to John for reporting, and hopefully the dplyr team can fix it.
We are excited to introduce the integration of HDInsight PySpark into Visual Studio Code (VSCode), which allows developers to easily edit Python scripts and submit PySpark statements to HDInsight clusters. For PySpark developers who value productivity of Python language, VSCode HDInsight Tools offer you a quick Python editor with simple getting started experiences, and enable you to submit PySpark statements to HDInsight clusters with interactive responses. This interactivity brings the best properties of Python and Spark to developers and empowers you to gain faster insights.
Click through to see how it’s done.
Suppose we needed to un-pivot this data into a row oriented representation. Often big data transform steps can achieve a much higher degree of parallelization with “tall data”. With the
cdatapackage this transform is easy and performant, as we show below.
Read the whole thing.
Azure Databricks: In preview, this is a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. It delivers one-click set up, streamlined workflows, and an interactive workspace all integrated with Azure SQL Data Warehouse, Azure Storage, Azure Cosmos DB, Azure Active Directory, and Power BI. More info
Azure Cosmos DB with Apache Cassandra API: In preview, this enables Cassandra developers to simply use the Cassandra API in Azure Cosmos DB and enjoy the benefits of Azure Cosmos DB with the familiarity of the Cassandra SDKs and tools, with no code changes to their application. More info. See all Cosmos DB announcements
Microsoft joins the MariaDB Foundation: Microsoft is a platinum sponsor – MariaDB is a community of the MySQL relational database management system and Microsoft will be actively contributing to MariaDB and the MariaDB community. More info
Click through for more. And if you want more info on Azure Databricks, Matei Zaharia and Peter Carlin have more information:
So how is Azure Databricks put together? At a high level, the service launches and manages worker nodes in each Azure customer’s subscription, letting customers leverage existing management tools within their account.
Specifically, when a customer launches a cluster via Databricks, a “Databricks appliance” is deployed as an Azure resource in the customer’s subscription. The customer specifies the types of VMs to use and how many, but Databricks manages all other aspects. In addition to this appliance, a managed resource group is deployed into the customer’s subscription that we populate with a VNet, a security group, and a storage account. These are concepts Azure users are familiar with. Once these services are ready, users can manage the Databricks cluster through the Azure Databricks UI or through features such as autoscaling. All metadata (such as scheduled jobs) is stored in an Azure Database with geo-replication for fault tolerance.
I’ve been a huge fan of the Databricks Community Edition. We’ll see if there will be a Community Edition version for Azure as well.
There are 3 interpreter modes available in Zeppelin.
1) Shared Mode
In Shared mode, a SparkContext and a Scala REPL is being shared among all interpreters in the group. So every Note will be sharing single SparkContext and single Scala REPL. In this mode, if NoteA defines variable ‘a’ then NoteB not only able to read variable ‘a’ but also able to override the variable.
2) Scoped Mode
In Scoped mode, each Note has its own Scala REPL. So variable defined in a Note can not be read or overridden in another Note. However, still single SparkContext serves all the Interpreter Groups. And all the jobs are submitted to this SparkContext and fair scheduler schedules the job. This could be useful when user does not want to share Scala session, but want to keep single Spark application and leverage its fair scheduler.
3) Isolated Mode
In Isolated mode, each Note has its own SparkContext and Scala REPL.
The default mode of %spark interpreter is ‘Globally Shared’.
This is mostly a step-by-step on installing Zeppelin, but does go into some detail on how Zeppelin works.
To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala, and then invoke them from Python.
Vectorized UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high performance UDFs entirely in Python.
This looks like a good performance improvement coming to PySpark, bringing it closer to Scala/Java performance with respect to UDFs.
No streaming events are free of duplicate entries. Dropping duplicate entries in record-at-a-time systems is imperative—and often a cumbersome operation for a couple of reasons. First, you’ll have to process small or large batches of records at time to discard them. Second, some events, because of network high latencies, may arrive out-of-order or late, which may force you to reiterate or repeat the process. How do you account for that?
Structured Streaming, which ensures exactly once-semantics, can drop duplicate messages as they come in based on arbitrary keys. To deduplicate data, Spark will maintain a number of user-specified keys and ensure that duplicates, when encountered, are discarded.
Just as other stateful processing APIs in Structured Streaming are bounded by declaring watermarking for late data semantics, so is dropping duplicates. Without watermarking, the maintained state can grow infinitely over the course of your stream.
In this scenario, you would still want some sort of de-duplication code at the far end of your process if you can never have duplicates come in across the lifetime of the application. This sounds like it’s more about preventing bursty duplicates from sensors.
At Databricks, we used Databricks Notebooks and cluster management to set up a reproducible benchmarking harness that compares the performance of Apache Spark’s Structured Streaming, running on Databricks Unified Analytics Platform, against other open source streaming systems such as Apache Kafka Streams and Apache Flink. In particular, we used the following systems and versions in our benchmarks:
The Yahoo Streaming Benchmark is a well-known benchmark used in industry to evaluate streaming systems. When setting up our benchmark, we wanted to push each streaming system to its absolute limits, yet keep the business logic the same as in the Yahoo Streaming Benchmark. We shared some of the results we achieved from these benchmarks during Spark Summit West 2017 keynote showing that Spark can reach 5x or higher throughput over other popular streaming systems. In this blog, we discuss in more detail about how we performed this benchmark, and how you can reproduce the results yourselves.
Standard vendor-based metric warnings aside, you can get the benchmark details at their GitHub repo.
As a recent client requirement I needed to propose a solution in order to add spark2 as interpreter to zeppelin in HDP (Hortonworks Data Platform) 2.5.3
The first hurdle is, HDP 2.5.3 comes with zeppelin 0.6.0 which does not support spark2, which was included as a technical preview. Upgrade the HDP version was not an option due to the effort and platform availability. At the end I found in the HCC (Hortonworks Community Connection) a solution, which involves installing a standalone zeppelin which does not affect the Ambari managed zeppelin delivered with HDP 2.5.3.
I want to share how I did it with you.
Read on to see how Paul did it. It’s not trivial but Paul lays out the process step-by-step.