Nearly 12 minutes doesn’t sound fantastic, but let’s remember that this is running on a single-node sandbox hosted on my laptop. That’s hardly a fair setup for a distributed processing system. Also, I have done nothing to optimize the files; I’m using compressed, comma-separated text files, have not partitioned the data in any meaningful way, and have taken the easy way out whenever possible. This means that an optimized file structure running on a real cluster with powerful servers behind it could return the data set a lot faster…but for our purposes, that’s not very important. I’m using the same hardware in all three cases, so in that sense this is a fair comp.
Despite my hemming and hawing, Polybase still performed as well as Hive and kicked sand in the linked server’s face. I have several ideas for how to tune and want to continue down this track, showing various ways to optimize Polybase and Hive queries.
As soon as I kick this off, I get an error:
Msg 7320, Level 16, State 110, Line 1
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints.
Well, that’s not good… Checking sys.dm_exec_compute_node_errors gives me four error rows with stack trace results, none of which seems very useful to me (as opposed to a Microsoft support tech).
I don’t have any good answers in this blog post, so I’m hoping to learn more and report back later.
The HDInsight Tool for Eclipse extends Eclipse to allow you to create and develop HDInsight Spark applications and easily submit Spark jobs to Microsoft Azure HDInsight Spark clusters using the Eclipse development environment. It integrates seamlessly with Azure, enabling you to easily navigate HDInsight Spark clusters and to view associated Azure storage accounts. To further boost productivity, the HDInsight tool for Eclipse also offers the capability to view Spark job history and display detailed job logs.
Check out the link for videos and additional resources.
So the takeaway that many DB developers would have you believe is ‘Hadoop Good’, ‘RDBMS Bad’.
But wait. RDBMS EDW hasn’t gone away and won’t. That’s where we keep our single version of the truth, the business data that record legal transactions with customers, suppliers, and employees. We also get strong SLAs, strong fault tolerance, and highly curated data based on strong ETL, provenance, and governance. Those are all things that are missing in our Data Lake.
Anybody who sells you on one technology to solve all problems is shilling snake oil. Bill’s answer is an Adjunct Data Warehouse, which sits separate from the Enterprise Data Warehouse. You go to the EDW when you risk getting fired or going to jail if the data’s wrong; you go to the ADW when you need data not in your EDW, or when you need larger-scale analytics in which it’s okay to be 1% off.
One of the most interesting use cases of Polybase is the ability to store historical data from relational databases into a Hadoop File System. The storage costs could be reduced while keeping the data accessible and still can be joined with the regular relational tables. So let`s do the first steps for our new archiving solution.
Archival is a very good use case for external table insertion, and if you don’t have a Hadoop cluster, you could insert into Azure blob storage.
The stream of user click events is considered to be a record stream, where each data record represents a self-contained datum. In contrast, the stream of user geo-location updates is interpreted as a changelog stream, where each data record represents an update (i.e. any previous data records having the same record key will be replaced by the latest update). In Kafka Streams, a record stream is represented via the so-called KStream interface and a changelog stream via the KTable interface. Going from the high-level view to the technical view, this means that our streaming application will demonstrate how to perform a join operation between a KStream and a KTable, i.e. it is an example of a stateful computation. This KStream-KTable join also happens to be Kafka Streams’ equivalent of performing a table lookup in a streaming context, where the table is updated continuously and concurrently. Specifically, for each user click event in the KStream, we will lookup the user’s region (e.g. “europe”) in the KTable in order to subsequently compute the total number of user clicks per region.
Let’s showcase the beginning (input) and the end (expected output) of this data pipeline with some example data.
This article is fairly detailed, but it covers a rather interesting topic in a good way.
Here are a few best practices for exporting with Sqoop:
Options file—As commands with Sqoop export and Sqoop import tend to be bigger in size, I recommend storing the commands in an options file. By keeping it in an options file, you can even make it part of a version control pipeline to monitor changes to the command.
Field termination—With Sqoop export, I recommend providing field termination metadata using the “–fields-terminated-by” option. Also, other formatting options such as “lines-terminated-by”, “enclosed-by”, “escaped-by”, etc., can be used as required.
Mapper tuning—When an export job is submitted, Sqoop creates a Java class and submits a MapReduce job based on input splits; then, each mapper connects to the database to export the data. The default number of mappers is 4, so I recommend tuning the number of mappers depending on the availability of processors on the cluster. Too many mappers might cause the load to increase on the database. We recommend that you monitor the number of connections and keep track of processlist on MySQL.
Staging table—The Sqoop export job is broken down into multiple transactions based on the mappers. Each transaction is therefore atomic and does not have any dependencies on other transactions. I recommend using the “–staging-table” option that acts as the buffer table for the separate transactions. After all transactions have been committed, a single transaction move is made to move the data to the final destination. Use the “–clear-staging-table” option to clean up the staging table after the export job.
There’s a lot in here which is Amazon-specific and there are a couple of things you’d have to change to deploy to SQL Server, but there’s a lot of useful information here. I like that Sai shows how to use the Hadoop credential API instead of doing something silly like saving your password in plaintext.
LLAP KEY BENEFITS
LLAP enables as fast as sub-second query in Hive by keeping all data and servers running and in-memory all the time, while retaining the ability to scale elastically within a YARN cluster.
LLAP, along with Apache Ranger enables fine-grained security for the Hadoop ecosystem, including data masking and filtering, by providing interfaces for external clients like Spark to read.
LLAP is great for cloud because it caches data in memory and keeps it compressed, overcoming long cloud storage access times and stretching the amount of data you can fit in RAM.
This sounds very much like a response to Spark.
During a data center failover like the exampleabove, we could have a “late arrival,” i.e. the stream processor might see the AdClickEvent possibly a few minutes after the AdViewEvent. A poorly written stream processor might deduce that the ad was a low-quality ad when instead the ad might have actually been good. Another anomaly is that the stream processor might see the AdClickEvent before it sees the corresponding AdViewEvent. To ensure that the output of the stream processor is correct there has to be logic to handle this “out of order message arrival.”
In the example above, the geo-distributed nature of the data centers makes it easy to explain the delays. However delays can exist even within the same data center due to GC issues, Kafka cluster upgrades, partition rebalances, and other naturally occurring distributed system phenomena.
This is a pretty long article and absolutely worth a read if you are looking at streaming data.
Let’s walk through this one step at a time and understand what the DMV is telling us. Unfortunately, the DMV documentation is a little sparse, so some of this is guesswork on my part.
A RandomIDOperation appears to create a temporary table. In this case, the table (whose name is randomly generated) is named TEMP_ID_53. I’m not sure where that name comes from; the session I ran this from was 54, so it wasn’t a session ID.
After the table gets created, each Compute node gets told to create a table called TMP_ID_53 in tempdb whose structure matches our external table’s structure. One thing you can’t see from the screenshot is that this table is created with DATA_COMPRESSION = PAGE. I have to wonder if that’d be the same if my Compute node were on Standard edition.
We add an extended property on the table, flagging it as IS_EXTERNAL_STREAMING_TABLE.
We then update the statistics on that temp table based on expected values. 629 rows are expected here.
Then, we create the dest stat, meaning that the temp table now has exactly the same statistics as our external table.
The next step is that the Head node begins a MultiStreamOperation, which tells the Compute nodes to begin working. This operator does not show up in the documentation, but we can see that the elapsed time is 58.8 seconds, which is just about as long as my query took. My guess is that this is where the Head node passes code to the Compute nodes and tells them what to do.
We have a HadoopRoundRobinOperation on DMS, which stands for “Data Movement Step” according to the location_type documentation. What’s interesting is that according to the DMV, that operation is still going. Even after I checked it 40 minutes later, it still claimed to be running. If you check the full query, it’s basically a SELECT * from our external table.
Next is a StreamingReturnOperation, which includes our predicate WHERE dest = ‘ORD’ in it. This is a Data Movement Step and includes all of the Compute nodes (all one of them, that is) sending data back to the Head node so that I can see the results.
Finally, we drop TEMP_ID_53 because we’re done with the table.
This post was about 70% legwork and 30% guesswork. That’s a bit higher a percentage than I’d ideally like, but there isn’t that much information readily available yet, so I’m trying (in my own small way) to fix that.