Press "Enter" to skip to content

Day: August 2, 2016

Don’t Use Cron For Scheduling Hadoop Jobs

Matthew Rathbone explains why cron is not a great choice for scheduling Hadoop and Spark jobs:

Reason 3: Poor transparency for teammates

Which jobs are running right now? Which are going to run today? How long do these jobs take? How do I schedule my job? What machine should I schedule it on? These are all questions that are impossible to answer without building custom orchestration around your Cron process – time you’d be better off spending on building a better system.

Matthew then gives us four alternative products.

Comments closed

Plotting Variables Against One Another

Simon Jackson shows how to plot multiple variables against one another using R:

This post is an extension of a previous one that appears here:https://drsimonj.svbtle.com/quick-plot-of-all-variables.

In that prior post, I explained a method for plotting the univariate distributions of many numeric variables in a data frame. This post does something very similar, but with a few tweaks that produce a very useful result. So, in general, I’ll skip over a few minor parts that appear in the previous post (e.g., how to use purrr::keep() if you want only variables of a particular type).

Read on for code, including a good bit of tidyr.

Comments closed

Multi-Column, Auto-Created Statistics

Shaun J. Stuart looks into a scenario in which it appears that multi-column, auto-created statistics were generated:

Wow.. That sure looks like three auto-created, multi-column statistics! We have three stats: stats_ids 3, 4, and 5. The sys.stats_column table contains one row for each column that is in a statistic, so multiple rows for a single statistic (i.e., a single stats_id value), indicate multiple columns in that stat. Indeed, the column_id values indicate the table columns contained the stat. So stats_id 3 contains columns VersionMajor and ApplicationID (column_ids 3 and 1), stats_id 4 contains columns VersionMinor and ApplicationID (column_ids 4 and 1), and stats_id 5 contains columns VersionRevision and ApplicationID (column_ids 5 and 1). And, clearly, the auto_created flag is true, so these three stats were auto-created. What’s going on?

Read on for the answer.

Comments closed

Structured Streaming

Andrew Ray explains streaming solutions using Spark 2.0:

If you are familiar with traditional Spark streaming you may notice that the above example is lacking an explicit batch duration. In structured streaming the equivalent feature is a trigger. By default it will run batches as quickly as possible, starting the next batch as soon as more data is available and the previous batch is complete. You can also set a more traditional fixed batch interval for your trigger. In the future more flexible trigger options will be added.

A related consequence is that windows are no longer forced to be a multiple of the batch duration. Furthermore, windows needn’t be only on processing time anymore, we can rearrange events that may have been delayed or arrived out of order and window by event time. Suppose our input stream had a column event_time that we wanted to do windowed counts on. Then we could do something like the following to get counts of events in a 1 minute window:

Right now, there are some pretty strict limitations on this new streaming, but I imagine they’ll loosen up quite soon.

Comments closed

RetainSameConnection

I explain what the RetainSameConnection property on an Integration Services connection does:

My co-worker had set up a dynamic connection (see Rafael Salas or Hari Bagra for details on how to do this), but something weird was happening:  the package was trying to push everything to the same server.  I confirmed that if all relevant customers loaded were for the same server, that the process would work correctly, and that I could run each server load one at a time, so there weren’t any problems connecting to particular servers or parameters overriding this choice.  It’s like the connection was “sticky,” connecting successfully to the first server and then ignoring the later changes.

RetainSameConnection allows for certain benefits, but has specific limitations.  Click through to see those details.

Comments closed

Shredding Event Data

Jason Brimhall has a script to shred extended events:

In the following script, I have tried to accomplish just that – a single script to create the entire XML parser for me, for you, and for anybody wishing to use it. I don’t want to have to remember the subtle nuances of how to parse each of the events each time I need to parse the session data. I want something that is quick, easy, and repeatable.

With all of that said, here is the script that I now use to parse my session data. You should notice that it has been simplified and is more extensive now.

Jason also has sample usage.  Check this out for sure.

Comments closed

Impersonation

Kenneth Fisher shows how to use impersonation to perform tasks without being explicitly granted permissions:

A developer wants to be able to truncate a table.

This isn’t an unreasonable request right? She’s writing a piece of code that loads a bunch of data into a staging table. She want’s to be able to truncate the table once the load is over. Unfortunately the permission required to do this is ALTER on the table. That’s not just going to let her truncate the table, it’s going to let her change the structure of the table. Not acceptable in a production environment. A development environment sure. Not a production one. So what do we do?

We use impersonation.

Check out the post to see how to do this.

Comments closed

Differences In Type

Grant Fritchey explains the differences between table variables, temporary tables, and common table expressions:

Don’t go getting all excited. I recognize that these two plans look similar, but they are different. First, let me point out that we have more reads with 1546 and an increase in duration to 273ms. This comes from two places. First, we’re creating statistics on the data in the temporary table where none exist on the table variable. Second, because I want to run this script over and over, I’m including the DROP TABLE statement, which is adding overhead that I wouldn’t see if I treated it like the table variable (which I could, but not here). However, breaking down the to the statement level, I get 250ms duration, just like with the table variable, but, I see 924 reads.

What’s going on?

There certainly are differences between the three.  Read on for more details.

Comments closed

Polybase Data Loading

Meagan Longoria explains that loading data using Polybase can be finicky:

First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.

One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.

This is definitely something to keep in mind.  I haven’t dealt with data with newlines within attributes, so I haven’t run into this yet, but don’t let it bite you.

Comments closed