Press "Enter" to skip to content

Month: January 2019

Data Lake Organization Tips

Melissa Coates has some great advice for people working with data lakes:

Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?

Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.

Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv

Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv

Click through for all of Melissa’s advice in FAQ form.

Comments closed

Calculating Median In SQL Server 2019

Tomaz Kastrun shows that batch aggregation mode on window functions allow PERCENTILE_CONT finally to become useful:

Next query, for median calculation was a window function query.


SELECT DISTINCT

PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c1)

OVER (PARTITION BY (SELECT 1)) AS MedianCont

FROM t1


To my surprise, the performance was even worse, and at this time, I have to say, I was running this on SQL Server 2017 with CU7. But luckily, I had a SQL Server 2019 CTP 2.0 also installed and here, with no further optimization the query ran little over 1 second.

I’ve warned people away from this function for all but tiny data sets because of how poorly it performs. With SQL Server 2019, I might be able to recommend it.

1 Comment

Inline Operators In R With wrapr

John Mount shows how to use inline operators in R with the wrapr package:

The above code is assuming you have the wrapr package attached via already having run library('wrapr').

Notice we picked R-related operator names. We stayed away from overloading the + operator, as the arithmetic operators are somewhat special in how they dispatch in R. The goal wasn’t to make R more like Python, but to adapt a good idea from Python to improve R.

Also, it’s a little late to pick up the discount (though Manning has discounts pretty much every day so be patient and you’ll find 40+ percent off) but check out the second edition of Practical Data Science with R by Mount and Nina Zumel. I’ve held off on reading it so far because I want to wait until it’s closer to completion, but it is on my to-read list.

Comments closed

Feature And Text Classification Using Naive Bayes In R

I wrap up my series on the Naive Bayes class of algorithms, finally writing some code along the way:

Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, which is under 3MB zipped in 2000 reviews in total.

Unike last time, I’m going to break this out into sections with commentary in between. If you want the full script with notebook, check out the GitHub repo I put together for this talk.

Assuming I ever get a chance to do this talk again, I’m probably going to change the data sets in the example given how overplayed iris is.

Comments closed

An Explanation Of Convolutional Neural Networks

Shirin Glander explains some of the mechanics behind Convolutional Neural Networks:

Convolutional Neural Nets are usually abbreviated either CNNs or ConvNets. They are a specific type of neural network that has very particular differences compared to MLPs. Basically, you can think of CNNs as working similarly to the receptive fields of photoreceptors in the human eye. Receptive fields in our eyes are small connected areas on the retina where groups of many photo-receptors stimulate much fewer ganglion cells. Thus, each ganglion cell can be stimulated by a large number of receptors, so that a complex input is condensed into a compressed output before it is further processed in the brain.

If you’re interested in understanding why a CNN will classify the way it does, chapter 5 of Deep Learning with R is a great reference.

Comments closed

Getting CSV Row Counts

Dave Mason shares a few techniques for getting row counts of CSV files:

I was reminded of this recently as I was working with R, trying to read a nearly 2 GB data file. I wanted to read in 5% of the data and output it to a smaller file that would make the test code run faster. The particular function I was working with needed a row count as one of its parameters. For me, that meant I had to determine the number of rows in the source file and multiply by 0.05. I tied the code for all of those tasks into one script block.
Now, none to my surprise, it was slow. In my short experience, I’ve found R isn’t particularly snappy–even when the data can fit comfortably in memory. I was pretty sure I could beat R’s record count performance handily with C#. And I did. I found some related questions on StackOverflow. A small handful of answers discussed the efficiency of various approaches. I only tried two C# variations: my original attempt, and a second version that was supposed to be faster (the improvement was nominal).

To fact-check Dave (because this blog is about nothing other than making sure Dave is right), I checked the source code to the wc command. That command also streams through the entire file, so Dave’s premise looks good.

Comments closed

CosmosDB Time To Live Support

Hasan Savran explains the Time To Live (TTL) counter in CosmosDB:

Another great feature of Cosmos DB is, TTL (Time To Live) support. This is a great option to have if you need a database system with Caching option, or you need to purge your data and you don’t want to develop a function to remove data from your dataset. CosmosDB’s TTL feature is pretty simple, all you need to do is, turn this feature on and declare when data should be removed from your dataset. Best part about TTL in CosmosDB is, CosmosDB does not charge you when it removes the data from your containers so you can use your Request Units for other transactions!

There are two ways to set Time to Live value. You can set the TTL value on a container or you can set it on a specific item by using CosmosDB SDK. TTL value must be in seconds.
 TTL timer resets if data gets modified for any reason.

Click through for an example of it in action.

Comments closed

Miminal Rights For Bulk Inserts

Timothy Smith takes us through least privilege while allowing bulk insert operations:

While this file path serves as a useful location for us to load flat files, we should consider that the user account that is executing the underlying insert statement must be able to read (and possibly write to) that file location. The writing part of the equation comes in when it involves logging, even if the permissions of the written logging data are tied down strictly in the output, in that the user doesn’t control what gets written, but that errors are written. In the least, we want to ensure that a separate folder with strict permissions exists for any flat file import to restrict the account access – notice that we’re not reading off the root drive, as we’ve seen that we can insert an entire file of data – think about using SQL bulk insert to view files through SQL Server by inserting the file’s data and reviewing it.

It’s more than just “check the box for the server-level role.”

Comments closed

Bubbling Up HTTP Status Errors In Power Query

Tony McGovern takes us through a method involving Power Query + M of giving end users useful information when a web request fails:

So how does this relate to error tables? Like most well-documented APIs, the U.S. Census Bureau API has a page devoted to listing and describing all the possible response codes that can be returned by their service. I take this information and build an internal table within the query that defines and describes these response codes in my own words. I’m now able to throw custom messages that make the difference between a 400 response code and a 404 response code more obvious.

For example, in the code below, I use the Error.Record function to create individual records that allow me to catch these unsuccessful requests and throw my own custom error messages to the user. I then create an extra field in each record called ‘Status’, which maps each HTTP response code returned by the API to a corresponding error message of my choosing:

There’s a bit of work, but the end result is a fairly simple explanation for end users.

Comments closed

Auditing SQL Agent Jobs

Jason Brimhall has some clever techniques for auditing SQL Agent Jobs with Extended Events:

Once upon a time, I was in the position of trying to figure out why a job failed. After a bunch of digging and troubleshooting, it was discovered that the job had changed but nobody knew when or why. Because of that, I was asked to provide a low cost audit solution to try and at least provide answers to the when and who of the change.

Tracking who made a change to an agent job should be a task added to each database professionals checklist / toolbox. Being caught off guard from a change to a system under your purview isn’t necessarily a fun conversation – nor is it pleasant to be the one to find that somebody changed your jobs without notice – two weeks after the fact! Usually, that means that there is little to no information about the change and you find yourself getting frustrated.

Click through to see how Jason does it.

Comments closed