Azure Data Lake Updates

Michael Rys has the October updates for Azure Data Lake:

We seem to be just cranking out new stuff :). Here are the October 2016 Updates for Azure Data Lake U-SQL!

The main take away is that the October refresh has now removed the old deprecated syntax of the items we have announced over the last couple of release notes!

Thanks for those who volunteered to test the new version of more scalable file set. Please contact us if you want to try it and help us validate it.

Click through for the release notes.

Machine Learning Algorithms In R

Ginger Grant has a list of machine learning algorithms and their implementations in R:

Often times determining which algorithm to use can take a while.  Here is a pretty good flowchart for determining which algorithm should be used given some examples of what the desired outcomes and data contain. The diagram lists the algorithms, which are implemented in Azure ML.  The same algorithms can be implemented in R.  In R there are libraries to help with nearly every task.  Here’s a list of libraries and their accompanying links which can be used in Machine Learning.  This list is no means comprehensive as there are libraries and functions other than the ones listed here, but if you are trying to write a Machine Learning Experiment in R, and are looking at the flowchart, these R functions and Libraries will provide the tools to do the types of Machine Learning Analysis listed.

I think algorithm determination is one of the most difficult parts of machine learning.  Even if you don’t mean to go there, the garden of forking paths is dangerous.

Custom Sorts

Kevin Feasel

2016-10-18

T-SQL

Rob Farley looks at ways of sorting data more efficiently:

Another option, which is more longwinded (some might suggest that would suit me – and if you thought that: Oi! Don’t be so rude!), and uses more reads, is to consider what we’d do in real life if we needed to do this.

If I had a pile of 73,595 orders, sorted by Salesperson order, and I needed to return them with a particular Salesperson first, I wouldn’t disregard the order they were in and simply sort them all, I’d start by diving in and finding the ones for Salesperson 7 – keeping them in the order they were in. Then I’d find the ones that weren’t the ones that weren’t Salesperson 7 – putting them next, and again keeping them in the order they were already in.

My first inclination is to think that this is a fragile solution—what about parameterization?  Will that affect the execution plan in unexpected ways?  I like the approach, however, and will have to add it to the toolbox for those cases in which it makes sense.

Kafka Consumer

Kevin Feasel

2016-10-18

Hadoop

I build a consumer and aggregator of Kafka data:

From here, I hook into the OnMessage event just like before, and like before we decode the Kafka payload and turn it into a string.  Unlike before, however, I call Newtonsoft’s DeserializeObject method and return a Flight type, which I’ve defined above.  This is the same definition as in the Producer, so in a production-quality environment, I’d pull that out to a single location rather than duplicating it.

Going back to the main function, I call the consumer.Start() method and let ‘er rip.  When I’m ready to aggregate, I’ll hit the enter key and that’ll call consumer.Stop().  When that happens, I’m going to have up to 7 million records in a list called flights.  Out of all of this information, I only need two attributes:  the destination state and the arrival delay in minutes.  I get those by using the map function on my sequence of flights, taking advantage of F#’s match syntax to get all relevant scenarios safely and put the result into a tuple.  The resulting sequence of tuples is called flightTuple.  I pass that into the delaysByState function.

By the time I give this presentation, I’m going to change the way I aggregate just a little bit to cut down on the gigs of RAM necessary to do this operation.  But hey, at least it works…

Computed Columns And Indexes

Derik Hammer looks at computed columns with a bonus section on unique indexes (or maybe vice versa):

What can an index do that a constraint cannot?

  • Set FILL FACTOR.
  • Add non-key (INCLUDED) columns.
  • Data compression.
  • Lock management settings.

What can a constraint do that an index cannot?

The only potential benefit I was able to find, or think of, is that constraints can be disabled. If you could enable and disable a unique constraint, that could be one feature that the unique index does not. This is counter-intuitive knowing that the unique constraint is enforced with a unique index behind the scenes.

For me, the big benefit of unique constraints over unique indexes is that they provide a logical separation.  Unique constraints show up in the Keys section in Management Studio and let you say, “Yes, I made this thing because the data model requires uniqueness here.”  By contrast a unique index can tell the same thing, but could also say “for the subset of data in a filtered index, I can assume uniqueness” or “for performance reasons, this combination is unique, but its uniqueness is not important to the logical data model.”  I will happily admit that I’m splitting the hair pretty thin on this one and that in practice, there are benefits to using unique indexes across the board.

DMVs For Azure SQL Database

Arun Sirpal looks at some Azure-specific DMVs, followed by the wait stats DMV:

Sys.geo_replication_links

This query will return primary and secondary databases in a geo-replication partnership. Here you will get the status of replication which could be in 1 of 3 states – pending, seeding or catch-up. You should run this under the master database.

Read on for the rest.

Advanced Time Slicer

Devin Knight’s custom visuals series moves on to the next visual:

  • Great way to visualize and filter time data

  • Filtering the visual will provide a miniaturized preview of the data for the provided time period

I like this visual; you get the benefit of focus while still keeping the full picture on a time series.

Validating Views

Kevin Feasel

2016-10-18

Biml, T-SQL

Ben Weissman has a Biml snippet to check each view to make sure it is still valid:

As I recently got asked for it in a talk, this piece of code gives you all the Views in a database that are currently broken.
This could be useful for “what if”-scenarios when playing with your metadata.

Click through for the code.  This is another in Ben’s enjoyable ongoing series of non-ETL things you can do with Biml.

Snippets

Kenneth Fisher has a demonstration of snippets in Management Studio:

A while ago I talked about Templates. This is an easy way to get a, well, template of a piece of code. But a much faster way to get a simple template is to use asnippet. With a simple key command (ctrl-K, ctrl-X) you can open up the snippet picker and quickly navigate to the snippet you need.

Note: You can also open the snippet picker by right clicking and selectingInsert Snippet… or from the menu Edit -> Intellisense -> Insert Snippet…

Further note: snippet picker sounds really odd but it was the name in BOL so we’ll go with it.

Click through for a demo.

Page And Key WaitResources For Deadlocks

Kendra Little explains page and key information in deadlock graphs and blocking chains:

1.4) Can I see the data on the page that was locked?

Well, yes. But … do you really need to?

This is slow even on small tables. But it’s kinda fun, so… since you read this far… let’s talk about %%physloc%%!

%%physloc%% is an undocumented piece of magic that will return the physical record locator for every row. You can  use %%physloc%% with sys.fn_PhysLocFormatter in SQL Server 2008 and higher.

This was a very interesting read; check it out.

Categories

October 2016
MTWTFSS
« Sep Nov »
 12
3456789
10111213141516
17181920212223
24252627282930
31