To use this post to play around with streaming data, you need an AWS account and AWS CLI configured on your machine. The entire pattern can be implemented in few simple steps:
Create an Amazon Kinesis stream.
Spin up an EMR cluster with Hadoop, Spark, and Zeppelin applications from advanced options.
Use a Simple Java producer to push random IoT events data into the Amazon Kinesis stream.
Connect to the Zeppelin notebook.
Import the Zeppelin notebook from GitHub.
Analyze and visualize the streaming data.
This is a good way of getting started with streaming data. I’ve grown quite fond of notebooks in the short time that I’ve used them, as they make it very easy for people who know what they’re doing to provide code and information to people who want to know what they’re doing.
Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affects the performance of your application as well as your infrastructure. Let’s dig deeper into some of the things you should care about monitoring.
In Spark, it is well known that Memory related issues are typical if you haven’t paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.
There are a few metrics of note here. Check it out.
Spark provides metrics for each of the above components through different endpoints. For example, if you want to look at the Spark driver details, you need to know the exact URL, which keeps changing over time–Spark keeps you guessing on the URL. The typical problem is when you start your driver in cluster mode. How do you detect on which worker node the driver was started? Once there, how do you identify the port on which the Spark driver exposes its UI? This seems to be a common annoying issue for most developers and DevOps professionals who are managing Spark clusters. In fact, most end up running their driver in client mode as a workaround, so they have a fixed URL endpoint to look at. However, this is being done at the cost of losing failover protection for the driver. Your monitoring solution should be automatically able to figure out where the driver for your application is running, find out the port for the application and automatically configure itself to start collecting metrics.
For a dynamic infrastructure like Spark, your cluster can get resized on the fly. You must ensure your newly spawned components (Workers, executors) are automatically configured for monitoring. There is no room for manual intervention here. You shouldn’t miss out monitoring newer processes that show up on the cluster. On the other hand, you shouldn’t be generating false alerts when executors get moved around. A general monitoring solution will typically start alerting you if an executor gets killed and starts up on a new worker–this is because generic monitoring solutions just monitor your port to check if it’s up or down. With a real time streaming system like Spark, the core idea is that things can move around all the time.
Spark does add a bit of complexity to monitoring, but there are solutions in place. Read the whole thing.
meinestadt.de web servers generate up to 20 million user sessions per day, which can easily result in up to several thousand HTTP GET requests per second during peak times (and expected to scale to much higher volumes in the future). Although there is a permanent fraction of bad requests, at times the number of bad requests jumps.
The meinestadt.de approach is to use a Spark Streaming application to feed an Impala table every n minutes with the current counts of HTTP status codes within the n minutes window. Analysts and engineers query the table via standard BI tools to detect bad requests.
What follows is a fairly detailed architectural walkthrough as well as configuration and implementation work. It’s a fairly long read, but if you’re interested in delving into Hadoop, it’s a good place to start.
In the upcoming Apache Spark 2.0 release, we have substantially expanded the SQL standard capabilities. In this brief blog post, we will introduce subqueries in Apache Spark 2.0, including their limitations, potential pitfalls and future expansions, and through a notebook, we will explore both the scalar and predicate type of subqueries, with short examples that you can try yourself.
A subquery is a query that is nested inside of another query. A subquery as a source (inside a
SQL FROMclause) is technically also a subquery, but it is beyond the scope of this post. There are basically two kinds of subqueries: scalar and predicate subqueries. And within scalar and predicate queries, there are uncorrelated scalar and correlated scalar queries and nested predicate queries respectively.
They also link to a Notebook which you can use to follow along. If you’re interested in window functions, here are notes from Spark 1.4.
We’ll look at both zTot and nTot, and consider the player’s age and experience.The latter is potentially important because there have been shifts in what ages players joined the league over the timespan we are considering. It used to be rare for players to skip college, then it wasn’t, now they are required to play at least one year. It will be interesting to see if we see a difference in age versus experience in the numbers.
We start with the RDD containing all the raw stats, z-scores, and normalized z-scores. Another piece of data to consider is how a player’s z-score and normalized z-score change each year, so we’ll calculate the change in both from year to year. We’ll save off two sets of data, one a key-value pair of age-values, and one a key-value pair of experience-values. (Note that in this analysis, we disregard all players who played in 1980, as we don’t have sufficient data to determine their experience level.)
Jordan also looks at player performance over time and makes data analysis look pretty easy.
We are excited to announce the General Availability (GA) of Databricks Community Edition(DCE). As a free version of the Databricks service, DCE enables everyone to learn and exploreApache Spark, by providing access to a simple, integrated development environment for data analysts, data scientists and engineers with high quality training materials and sample application notebooks.
Less than four months ago, at Spark Summit New York, we introduced Databricks Community Edition (DCE) beta. Its introduction generated tremendous interest with thousands of people requesting accounts. Today, we are delighted to report that more than 8,000 users have signed on DCE, many of them using the service heavily. The top 10% active users are averaging over 6 hours per week, and are executing over 10,000 commands on average.
They also just started an EdX course on an introduction to Spark yesterday. If you’re interested in Spark but haven’t had the time to learn, this might be a good course to take.
The Databricks just-in-time data platform takes a holistic approach to solving the enterprise security challenge by building all the facets of security — encryption, identity management, role-based access control, data governance, and compliance standards — natively into the data platform with DBES.
- Encryption: Provides strong encryption at rest and inflight with best-in-class standards such as SSL and keys stored in AWS Key Management System (KMS).
- Integrated Identity Management: Facilitates seamless integration with enterprise identity providers via SAML 2.0 and Active Directory.
- Role-Based Access Control: Enables fine-grain management access to every component of the enterprise data infrastructure, including files, clusters, code, application deployments, dashboards, and reports.
- Data Governance: Guarantees the ability to monitor and audit all actions taken in every aspect of the enterprise data infrastructure.
- Compliance Standards: Achieves security compliance standards that exceed the high standards of FedRAMP as part of Databricks’ ongoing DBES strategy.
In short, DBES will provide holistic security in every aspect of the entire big d
As enterprises come to depend on technologies like Spark and Hadoop, they need to have techniques and technologies to ensure that data remains secure. This is a sign of a maturing platform.
This problem is easy to solve, right? You can write scripts that run jobs in sequence, and use the output of one program as the input to another—no problem. But what if your workflow is complex and requires specific triggers, such as specific data volumes or resource constraints, or must meet strict SLAs? What if parts of your workflow don’t depend on each other and can be run in parallel?
Building your own infrastructure around this problem can seem like an attractive idea, but doing so can quickly become laborious. If, or rather when, those requirements change, modifying such a tool isn’t easy . And what if you need monitoring around these jobs? Monitoring requires another set of tools and headaches.
This is a pretty detailed look at the basics of Oozie.
The basic idea is to create a lookup table of distinct categories indexed by unique integer identifiers. The way to avoid is to collect the unique categories to the driver, loop through them to add the corresponding index to each to create the lookup table (as
Mapor equivalent) and then broadcast the lookup table to all executors. The amount of data that can be collected at the driver is controlled by the
spark.driver.maxResultSizeconfiguration which by default is set at 1 GB for Spark 1.6.1. Both
broadcastwill eventually run into the physical memory limits of the driver and the executors respectively at some point beyond certain number of distinct categories, resulting in a non-scalable solution.
The solution is pretty interesting: build out a new RDD of unique results, and then join that set back. If you’re using SQL (including Spark SQL), I would use the DENSE_RANK() window function.