Using a view, we were able to create a “partitioned” Polybase experience, similar to what we had in SQL Server 2000. This form of poor man’s partitioning allows us to segment out data sets and query them independently, something which can be helpful when storing very large amounts of data off-site and only occasionally needing to query it. The thing to remember, though, is that if you store this in Azure Blob Storage, you will need to pull down the entire table’s worth of data to do any processing.
This leads to a concept I first heard from Ginger Grant: pseudo-StretchDB. Instead of paying for what Stretch offers, you get an important subset of the functionality at a much, much lower price. If you do store the data in Azure Blob Storage, you’re paying pennies per gigabyte per month. For cold storage, like a scenario in which you need to keep data around to keep the auditors happy but your main application doesn’t use that information, it can work fine. But if you need to query this data frequently, performance might be a killer.
For Polybase tables without the ability to perform external pushdown, coming up with a good partitioning strategy is probably one of the two best ways to improve performance, with creating a Polybase scale-out cluster the other method.
Even for a simple query, I’m not going to expect you to read 174 lines of XML; I’m not a sadist, after all…
What follows is a look at significant lines and my commentary.
Don’t listen to me there; that guy really is a sadist who wants you to read 174 lines of XML.
As a reminder, in order to allow predicate pushdown to occur, we need to hit a Hadoop cluster; we can’t use predicate pushdown on other systems like Azure Blob Storage. Second, we need to have a resource manager link set up in our external data source. Third, we need to make sure that everything is configured correctly on the Polybase side. But once you have those items in place, it’s possible to use the FORCE EXTERNALPUSHDOWN command like so:
There’s also discussion of preventing MapReduce job creation as well as a pushdown-related error I had received in the past.
Notice how 3bd shows up for pretty much all of these services. This is not what you’d want to do in a real production environment, but because we want to use Docker and easily pass ports through, it’s the simplest way for me to set this up. If you knew beforehand which node would host which service, you could modify the run.sh batch script that we discussed earlier and open those specific ports.
After assigning masters, we next have to define which nodes are clients in which clusters.
Click through for a screenshot-laden walkthrough.
This is a very interesting set of results. First, 7Zip archived files do not work with the default encoding. I’m not particularly surprised by this result, as 7Zip support is relatively scarce across the board and it’s a niche file format (though a very efficient format).
The next failure case is tar. Tar is a weird case because it missed the first row in the file but was able to collect the remaining 776 records. Same goes for .tar.gz. I unpackaged the .tar file and the constituent SecondBasemen.csv file did in fact have all 777 records, so it’s something weird about the codec.
Stick to BZip2 and GZip if you’re using flat files.
There are a couple of things I want to point out here. First, the Type is HADOOP, one of the three types currently available: HADOOP (for Hadoop, Azure SQL Data Warehouse, and Azure Blob Storage), SHARD_MAP_MANAGER (for sharded Azure SQL Database Elastic Database queries), and RDBMS (for cross-database Elastic Database queries on Azure SQL Database).
Second, the Location is my name node on port 8020. If you’re curious about how we figure that one out, go to Ambari (which, for me, is http://sandbox.hortonworks.com:8080) and go to HDFS and then Configs. In the Advanced tab, you can see the name node:
There are different options available for different sources, but this post is focused on Hadoop.
We’re going a bunch of setup work here, so let’s take it from the top. First, I declare a consumer group, which I’m calling “Airplane Enricher.” Kafka uses the concept of consumer groups to allow consumers to work in parallel. Imagine that we have ten separate servers available to process messages from the Flights topic. Each flight message is independent, so it doesn’t matter which consumer gets it. What does matter, though, is that multiple consumers don’t get the same message, as that’s a waste of resources and could lead to duplicate data processing, which would be bad.
The way Kafka works around this is to use consumer groups: within a consumer group, only one consumer will get a particular message. That way, I can have my ten servers processing messages “for real” and maybe have another consumer in a different consumer group just reading through the messages getting a count of how many records are in the topic. Once you treat topics as logs rather than queues, consumer design changes significantly.
This is a fairly lengthy read, but directly business-applicable, so I think it’s well worth it.
There are three important things here: first, our Zookeeper port is 2181. Zookeeper is great for centralized configuration and coordination; if you want to learn more, check out this Sean Mackrory post.
The second bit of important information is how long our retention period is. Right now, it’s set to 7 days, and that’s our default. Remember that messages in a Kafka topic don’t go away simply because some consumer somewhere accessed them; they stay in the log until we say they can go.
Finally, we have a set of listeners. For the sandbox, the only listener is on port 6667. We connect to listeners from our outside applications, so knowing those addresses and ports is vital.
This is still quick-start level stuff, but I’m building up to custom development, honest!
First, it’s interesting to note that the Polybase engine uses “pdw_user” as its user account. That’s not a blocker here because I have an open door policy on my Hadoop cluster: no security lockdown because it’s a sandbox with no important information. Second, my IP address on the main machine is 192.168.58.1 and the name node for my Hadoop sandbox is at 192.168.58.129. These logs show that my main machine runs a getfileinfo command against /tmp/ootp/secondbasemen.csv. Then, the Polybase engine asks permission to open /tmp/ootp/secondbasemen.csv and is granted permission. Then…nothing. It waits for 20-30 seconds and tries again. After four failures, it gives up. This is why it’s taking about 90 seconds to return an error message: it tries four times.
Aside from this audit log, there was nothing interesting on the Hadoop side. The YARN logs had nothing in them, indicating that whatever request happened never made it that far.
Here’s hoping there’s a solution in the future.
Now it’s time to take some backups. First, let’s back up the various keys and certificates:
--Back up the service master key
--Note that the password here is the FILE password and not the KEY password!
BACKUP SERVICE MASTER
'Service Master Key Password'
--Back up the database master key
--Again, the password here is the FILE password and not the KEY password.
'Database Master Key Password'
--Back up the TDE certificate we created.
--We could create a private key with password here as well.
BACKUP CERTIFICATE [TDECertificate]
'Some Private Key Password'
Click through for the details.