Next, we’ll define a DataFrame by loading data from a CSV file, which is stored in HDFS.
facebook_combined.txtcontains two columns to represent links between network nodes. The first column is called source (
src), and the second is the destination (
dst) of the link. (Some other systems, such as Gephi, use “source” and “target” instead.)
First we define a custom schema, and than we load the DataFrame, using
It sounds like Spark graph database engines are early in their lifecycle, but they might already be useful for simple analysis.
Hadoop’s ability to work with Amazon S3 storage goes back to 2006 and the issue HADOOP-574, “FileSystem implementation for Amazon S3”. This filesystem client, “s3://” implemented an inode-style filesystem atop S3: it could support bigger files than S3 could then support, some its operations (directory rename and delete) were fast. The s3 filesystem allowed Hadoop to be run in Amazon’s EMR infrastructure, using S3 as the persistent store of work. This piece of open source code predated Amazon’s release of EMR, “Elastic MapReduce” by over two years. It’s also notable as the piece of work which gained Tom White, author of “Hadoop, the Definitive Guide”, committer status.
It’s interesting to see how this project has matured over the past decade.
First, it’s interesting to note that the Polybase engine uses “pdw_user” as its user account. That’s not a blocker here because I have an open door policy on my Hadoop cluster: no security lockdown because it’s a sandbox with no important information. Second, my IP address on the main machine is 192.168.58.1 and the name node for my Hadoop sandbox is at 192.168.58.129. These logs show that my main machine runs a getfileinfo command against /tmp/ootp/secondbasemen.csv. Then, the Polybase engine asks permission to open /tmp/ootp/secondbasemen.csv and is granted permission. Then…nothing. It waits for 20-30 seconds and tries again. After four failures, it gives up. This is why it’s taking about 90 seconds to return an error message: it tries four times.
Aside from this audit log, there was nothing interesting on the Hadoop side. The YARN logs had nothing in them, indicating that whatever request happened never made it that far.
Here’s hoping there’s a solution in the future.
Overview of Spark Streaming.
Fault-tolerance Semantics & Performance Tuning.
Spark Streaming Integration with Kafka.
Click through for the slide deck. Combine that with the AWS blog post on the same topic and you get a pretty good intro.
Stream processing walkthrough
The entire pattern can be implemented in a few simple steps:
Set up Kafka on AWS.
Spin up an EMR 5.0 cluster with Hadoop, Hive, and Spark.
Create a Kafka topic.
Run the Spark Streaming app to process clickstream events.
Use the Kafka producer app to publish clickstream events into Kafka topic.
Explore clickstream events data with SparkSQL.
This is a pretty easy-to-follow walkthrough with some good tips at the end.
2016 and beyond – this is an interesting timing for “Big Data”. Cloudera’s valuation has dropped by 38%. Hortonwork’s valuation has dropped by almost 40%, forcing them to cut the professional services department. Pivotal has abandoned its Hadoop distribution, going to market jointly with Hortonworks. What happened and why? I think the main driver of this decline is enterprise customers that started adoption of technology in 2014-2015. After a couple of years playing around with “Big Data” they has finally understood that Hadoop is only an instrument for solving specific problems, it is not a turnkey solution to take over your competitors by leveraging the holy power of “Big Data”. Moreover, you don’t need Hadoop if you don’t really have a problem of huge data volumes in your enterprise, so hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters – Hadoop technology just doesn’t shine at this scale. All of this has caused a big wave of priorities re-evaluation by enterprises, shrinking their investments into “Big Data” and focusing on solving specific business problems.
There are some good points around product saturation and a general skills shortage, but even if you look at it pessimistically, this is a product with 30% market penetration, and which is currently making the move from being a large batch data processing product to a streaming + batch processing product.
In the previous blog, we looked at on converting the CSV format into Parquet format using Hive. It was a matter of creating a regular table, map it to the CSV data and finally move the data from the regular table to the Parquet table using the Insert Overwrite syntax. In this blog we will look at how to do the same thing with Spark using the dataframes feature.
Most of the code is basic setup; writing to Parquet is really a one-liner.
According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity. (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).
There are several considerations around picking the correct file format, and it’s probably best to experiment with them in your specific environment.
Hello geeks, we have discussed how to start programming with Spark in Scala. In this blog, we will discuss how we can use Hive with Spark 2.0.
When you start to work with Hive, you need HiveContext (inherits SqlContext), core-site.xml,hdfs-site.xml, and hive-site.xml for Spark. In case you don’t configure hive-site.xml then the context automatically creates metastore_db in the current directory and creates warehousedirectory indicated by HiveConf(which defaults user/hive/warehouse).
Rahul has made his demo code available on GitHub.
At a high level, Kudu is a new storage manager that enables durable single-record inserts, updates, and deletes, as well as fast and efficient columnar scans due to its in-memory row format and on-disk columnar format. This architecture makes Kudu very attractive for data that arrives as a single record at a time or that may need to be modified at a later time.
Today, many users try to solve this challenge via a Lambda architecture, which presents inherent challenges by requiring different code bases and storage for the necessary batch and real-time components. Using Kudu and Impala together completely avoids this problematic complexity by easily and immediately making data inserted into Kudu available for querying and analytics via Impala. (For more technical details on how Impala and Kudu work together for analytical workloads, see this post.)
I’d jokingly say “Someday, somebody’s going to reinvent the relational database inside of Hadoop.” But it seems like that’s less of a joke than a medium-term prediction.