To use this post to play around with streaming data, you need an AWS account and AWS CLI configured on your machine. The entire pattern can be implemented in few simple steps:
Create an Amazon Kinesis stream.
Spin up an EMR cluster with Hadoop, Spark, and Zeppelin applications from advanced options.
Use a Simple Java producer to push random IoT events data into the Amazon Kinesis stream.
Connect to the Zeppelin notebook.
Import the Zeppelin notebook from GitHub.
Analyze and visualize the streaming data.
This is a good way of getting started with streaming data. I’ve grown quite fond of notebooks in the short time that I’ve used them, as they make it very easy for people who know what they’re doing to provide code and information to people who want to know what they’re doing.
Today, we learned that Polybase statistics are stored in the same way as other statistics; as far as SQL Server is concerned, they’re just more statistics built from a table (remembering that the way stats get created involves loading data into a temp table and building stats off of that temp table). We can do most of what you’d expect with these stats, but beware calling sys.dm_db_stats_properties() on Polybase stats, as they may not show up.
Also, remember that you cannot maintain, auto-create, auto-update, or otherwise modify these stats. The only way to modify Polybase stats is to drop and re-create them, and if you’re dealing with a large enough table, you might want to take a sample.
The result isn’t very surprising in retrospect, and it’s good that “stats are stats are stats” is the correct answer.
Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affects the performance of your application as well as your infrastructure. Let’s dig deeper into some of the things you should care about monitoring.
In Spark, it is well known that Memory related issues are typical if you haven’t paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.
There are a few metrics of note here. Check it out.
meinestadt.de web servers generate up to 20 million user sessions per day, which can easily result in up to several thousand HTTP GET requests per second during peak times (and expected to scale to much higher volumes in the future). Although there is a permanent fraction of bad requests, at times the number of bad requests jumps.
The meinestadt.de approach is to use a Spark Streaming application to feed an Impala table every n minutes with the current counts of HTTP status codes within the n minutes window. Analysts and engineers query the table via standard BI tools to detect bad requests.
What follows is a fairly detailed architectural walkthrough as well as configuration and implementation work. It’s a fairly long read, but if you’re interested in delving into Hadoop, it’s a good place to start.
Any Select query fails with the following error.
Msg 106000, Level 16, State 1, Line 1
Java heap space
Illegal input may cause the java out of memory error. In this particular case the file was not in UTF8 format. DMS tries to read the whole file as one row since it cannot decode the row delimiter and runs into Java heap space error.
Convert the file to UTF8 format since PolyBase currently requires UTF8 format for text delimited files.
I imagine that this page will get quite a few hits over the years, as there currently exists limited information on how to solve these issues if you run into them, and some of the error messages (especially the one quoted above) have nothing to do with root causes.
Polybase offers the ability to create statistics on tables, the same way that you would on normal tables. There are a few rules about statistics:
Stats are not auto-created. You need to create all statistics manually.
Stats are not auto-updated. You will need to update all statistics manually, and currently, the only way you can do that is to drop and re-create the stats.
When you create statistics, SQL Server pulls the data into a temp table, so if you have a billion-row table, you’d better have the tempdb space to pull that off. To mitigate this, you can run stats on a sample of the data.
Round one did not end on a high note, so we’ll see what round two has to offer.
We’ll look at both zTot and nTot, and consider the player’s age and experience.The latter is potentially important because there have been shifts in what ages players joined the league over the timespan we are considering. It used to be rare for players to skip college, then it wasn’t, now they are required to play at least one year. It will be interesting to see if we see a difference in age versus experience in the numbers.
We start with the RDD containing all the raw stats, z-scores, and normalized z-scores. Another piece of data to consider is how a player’s z-score and normalized z-score change each year, so we’ll calculate the change in both from year to year. We’ll save off two sets of data, one a key-value pair of age-values, and one a key-value pair of experience-values. (Note that in this analysis, we disregard all players who played in 1980, as we don’t have sufficient data to determine their experience level.)
Jordan also looks at player performance over time and makes data analysis look pretty easy.
An interesting thing about FIELD_TERMINATOR is that it can be multi-character. MSDN uses ~|~ as a potential delimiter. The reason you’d look at a multi-character delimiter is that not all file formats handle quoted identifiers—for example, putting quotation marks around strings that have commas in them to indicate that commas inside quotation marks are punctuation marks rather than field separators—very well. For example, the default Hive SerDe (Serializer and Deserializer) does not handle quoted identifiers; you can easily grab a different SerDe which does offer quoted identifiers and use it instead, or you can make your delimiter something which is guaranteed not to show up in the file.
You can also set some defaults such as date format, string format, and data compression codec you’re using, but we don’t need those here. Read the MSDN doc above if you’re interested in digging into that a bit further.
It’s a bit of a read, but the end result is that we can retrieve data from a Hadoop cluster as though it were coming from a standard SQL Server table. This is easily my favorite feature in SQL Server 2016.
Microsoft’s next recommendation is to make sure that predicate pushdown is enabled. To do that, we’re going to go back to the Hadoop VM and grab our yarn.application.classpath from there. To do that, cd to /etc/hadoop/conf/ and vi yarn-site.xml (or use whatever other text reader you want). Copy the value for yarn.application.classpath, which should be a pretty long string. Mine looks like:
Now that you have a copy of that value, go to your SQL Server installation directory (by default, C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf) and open up yarn-site.xml. Paste the value into the corresponding yarn.application.classpath setting and you’re good to go.
This is part one of a series on using Polybase.
The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set. Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.
In this use case we will configure a Time-partitioned File Set that stores data in Avro format by usingTPFSAvro as the sink.
I like the fact that there’s a UI for this. Between this tool and NiFi, the Hadoop ecosystem is getting some tools to make data migration easier to understand, and I think that will help adoption.