I like to think of Pig as a high-level Map/Reduce commands pipeline. As a former SQL programmer, I find it quite intuitive, and at my organization our Hadoop jobs are still mostly developed in Pig.
Pig has a lot of qualities: it is stable, scales very well, and integrates natively with the Hive metastore HCatalog. By describing each step atomically, it minimizes conceptual bugs that you often find in complicated SQL code.
But sometimes, Pig has some limitations that makes it a poor programming paradigm to fit your needs.
Philippe includes a couple of examples in Pig, PySpark, and SparkSQL. Even if you aren’t familiar with Pig, this is a good article to help familiarize yourself with Spark.
Actually I’m talking about .NET Core. That Docker image is built on Debian Jessie, and it installs Hadoop and .NET Core. You can run compiled .NET Core assemblies just by running
dotnet /path/to/my/assembly.dll. The same image can run as Hadoop master and worker nodes, so all workers have .NET Core installed and can run .NET Core apps.
Which means you can run .NET Core apps as mappers or reducers in Hadoop jobs, thanks to theHadoop Streaming interface. Hadoop Streaming lets you shell out to any program the host can execute, to run the tasks for your job. Your code integrates with Hadoop at a basic level – through the standard input and output streams.
Elton has some good courses on other Hadoop-and-.NET topics (like his HDInsight deep dive course). It looks like running containerized Hadoop + .NET Core is getting closer, but still has room for improvement.
Nearly 12 minutes doesn’t sound fantastic, but let’s remember that this is running on a single-node sandbox hosted on my laptop. That’s hardly a fair setup for a distributed processing system. Also, I have done nothing to optimize the files; I’m using compressed, comma-separated text files, have not partitioned the data in any meaningful way, and have taken the easy way out whenever possible. This means that an optimized file structure running on a real cluster with powerful servers behind it could return the data set a lot faster…but for our purposes, that’s not very important. I’m using the same hardware in all three cases, so in that sense this is a fair comp.
Despite my hemming and hawing, Polybase still performed as well as Hive and kicked sand in the linked server’s face. I have several ideas for how to tune and want to continue down this track, showing various ways to optimize Polybase and Hive queries.
As soon as I kick this off, I get an error:
Msg 7320, Level 16, State 110, Line 1
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints.
Well, that’s not good… Checking sys.dm_exec_compute_node_errors gives me four error rows with stack trace results, none of which seems very useful to me (as opposed to a Microsoft support tech).
I don’t have any good answers in this blog post, so I’m hoping to learn more and report back later.
The HDInsight Tool for Eclipse extends Eclipse to allow you to create and develop HDInsight Spark applications and easily submit Spark jobs to Microsoft Azure HDInsight Spark clusters using the Eclipse development environment. It integrates seamlessly with Azure, enabling you to easily navigate HDInsight Spark clusters and to view associated Azure storage accounts. To further boost productivity, the HDInsight tool for Eclipse also offers the capability to view Spark job history and display detailed job logs.
Check out the link for videos and additional resources.
So the takeaway that many DB developers would have you believe is ‘Hadoop Good’, ‘RDBMS Bad’.
But wait. RDBMS EDW hasn’t gone away and won’t. That’s where we keep our single version of the truth, the business data that record legal transactions with customers, suppliers, and employees. We also get strong SLAs, strong fault tolerance, and highly curated data based on strong ETL, provenance, and governance. Those are all things that are missing in our Data Lake.
Anybody who sells you on one technology to solve all problems is shilling snake oil. Bill’s answer is an Adjunct Data Warehouse, which sits separate from the Enterprise Data Warehouse. You go to the EDW when you risk getting fired or going to jail if the data’s wrong; you go to the ADW when you need data not in your EDW, or when you need larger-scale analytics in which it’s okay to be 1% off.
One of the most interesting use cases of Polybase is the ability to store historical data from relational databases into a Hadoop File System. The storage costs could be reduced while keeping the data accessible and still can be joined with the regular relational tables. So let`s do the first steps for our new archiving solution.
Archival is a very good use case for external table insertion, and if you don’t have a Hadoop cluster, you could insert into Azure blob storage.
The stream of user click events is considered to be a record stream, where each data record represents a self-contained datum. In contrast, the stream of user geo-location updates is interpreted as a changelog stream, where each data record represents an update (i.e. any previous data records having the same record key will be replaced by the latest update). In Kafka Streams, a record stream is represented via the so-called KStream interface and a changelog stream via the KTable interface. Going from the high-level view to the technical view, this means that our streaming application will demonstrate how to perform a join operation between a KStream and a KTable, i.e. it is an example of a stateful computation. This KStream-KTable join also happens to be Kafka Streams’ equivalent of performing a table lookup in a streaming context, where the table is updated continuously and concurrently. Specifically, for each user click event in the KStream, we will lookup the user’s region (e.g. “europe”) in the KTable in order to subsequently compute the total number of user clicks per region.
Let’s showcase the beginning (input) and the end (expected output) of this data pipeline with some example data.
This article is fairly detailed, but it covers a rather interesting topic in a good way.
Here are a few best practices for exporting with Sqoop:
Options file—As commands with Sqoop export and Sqoop import tend to be bigger in size, I recommend storing the commands in an options file. By keeping it in an options file, you can even make it part of a version control pipeline to monitor changes to the command.
Field termination—With Sqoop export, I recommend providing field termination metadata using the “–fields-terminated-by” option. Also, other formatting options such as “lines-terminated-by”, “enclosed-by”, “escaped-by”, etc., can be used as required.
Mapper tuning—When an export job is submitted, Sqoop creates a Java class and submits a MapReduce job based on input splits; then, each mapper connects to the database to export the data. The default number of mappers is 4, so I recommend tuning the number of mappers depending on the availability of processors on the cluster. Too many mappers might cause the load to increase on the database. We recommend that you monitor the number of connections and keep track of processlist on MySQL.
Staging table—The Sqoop export job is broken down into multiple transactions based on the mappers. Each transaction is therefore atomic and does not have any dependencies on other transactions. I recommend using the “–staging-table” option that acts as the buffer table for the separate transactions. After all transactions have been committed, a single transaction move is made to move the data to the final destination. Use the “–clear-staging-table” option to clean up the staging table after the export job.
There’s a lot in here which is Amazon-specific and there are a couple of things you’d have to change to deploy to SQL Server, but there’s a lot of useful information here. I like that Sai shows how to use the Hadoop credential API instead of doing something silly like saving your password in plaintext.
LLAP KEY BENEFITS
LLAP enables as fast as sub-second query in Hive by keeping all data and servers running and in-memory all the time, while retaining the ability to scale elastically within a YARN cluster.
LLAP, along with Apache Ranger enables fine-grained security for the Hadoop ecosystem, including data masking and filtering, by providing interfaces for external clients like Spark to read.
LLAP is great for cloud because it caches data in memory and keeps it compressed, overcoming long cloud storage access times and stretching the amount of data you can fit in RAM.
This sounds very much like a response to Spark.