Month: July 2017

Copying and pasting is a feature that’s available in nearly every text editor (“nearly” — anyone remember the days before iOS had a clipboard?).

However, SQL Server Management Studio goes above and beyond the regular copy and paste feature set — it has a clipboard ring.

What’s a clipboard ring you ask?

The clipboard ring let’s you cycle through the last 20 things you copied onto your clipboard when you go to paste in SSMS.

Read on to learn how to use the clipboard ring. When I remember that it exists, it’s quite useful.

Comments closed

Top Jobs Report

Published 2017-07-26 by Kevin Feasel

Kenneth Fisher only looks at the best SQL Agent jobs:

But sometimes what you want is an overall history. Maybe you want to see your longest running jobs? Or the most frequent jobs? Or jobs that are failing? There is a great pre-built report just for that.

Now, be warned. The report and the history view both pull from the system views and they only hold so much history. You need to make s

With that warning stated, this is a good report if you just want to get some idea of what’s happening on your server.

Comments closed

Prioritizing Rows In A Union

Published 2017-07-26 by Kevin Feasel

Daniel Hutmacher shows several ways to combine data from multiple sources, prioritized by source:

For the sake of simplicity, I’ll assume that our example data has a clustered index on the “primary key” that we want to use to determine which rows have already been included in the set – in our example, the primary key is (FirstName, LastName). The data I’m using is a fictional example, but here’s the jist:

#Employees has about 33 000 rows.

#Customers has about 44 000 rows.

#Passengers has about 500 000 rows.

The data is constructed in a way that these queries should return 530 000 rows, so we’re looking at some overlap but far from totally overlapping rows.

Example #492,806 that T-SQL is not a true 4th Generation Language, that how you write the query can greatly matter for performance.

Comments closed

Kafka Streams Basics

Published 2017-07-25 by Kevin Feasel

Anuj Saxena walks through Kafka Streams and provides a quick example:

The features provided by Kafka Streams:

Highly scalable, elastic, distributed, and fault-tolerant application.
Stateful and stateless processing.
Event-time processing with windowing, joins, and aggregations.
We can use the already-defined most common transformation operation using Kafka Streams DSL or the lower-level processor API, which allow us to define and connect custom processors.
Low barrier to entry, which means it does not take much configuration and setup to run a small scale trial of stream processing; the rest depends on your use case.
No separate cluster requirements for processing (integrated with Kafka).
Employs one-record-at-a-time processing to achieve millisecond processing latency, and supports event-time based windowing operations with the late arrival of records.
Supports Kafka Connect to connect to different applications and databases.

Read on for more details as well as a sample script to get started.

Comments closed

Random Forests In R

Published 2017-07-25 by Kevin Feasel

Anish Sing Walia explains the basics of random forests and provides sample code in R:

Random Forests are similar to a famous Ensemble technique called Bagging but have a different tweak in it. In Random Forests the idea is to decorrelate the several trees which are generated on the different bootstrapped samples from training Data.And then we simply reduce the Variance in the Trees by averaging them.
Averaging the Trees helps us to reduce the variance and also improve the Perfomance of Decision Trees on Test Set and eventually avoid Overfitting.

The idea is to build lots of Trees in such a way to make the Correlation between the Trees smaller.

Random forests frequently give a good answer to classification problems, enough so as to make them a nice starting point.

Comments closed

Using seplyr Instead Of dplyr

Published 2017-07-25 by Kevin Feasel

John Mount explains seplyr and why it can be better for certain use cases than dplyr:

seplyr is a dplyr adapter layer that prefers “slightly clunkier” standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.

The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our practical example.

Click through for a great example, and also read John’s comment on the Pascal-style assignment operator he uses.

Comments closed

Data Lake Analysis With Excel And Power BI

Published 2017-07-25 by Kevin Feasel

Sachin C Sheth announces support for Azure Data Lake Store within Excel and Power BI:

Until now, if you had to analyze data stored in ADLS with Excel, you would have to copy it into a relational data store like Azure SQL Data Warehouse or download the data onto a machine, and then use Excel to analyze that data. This was rather cumbersome involving additional cost and time. With this new support, you can now access files stored in ADLS with Excel in-place, without having to copy them to other stores or locations. You can quickly get advanced insights into raw or prepared data. Models and queries you have created using Excel that ran against local data, can be run seamlessly against data stored in ADLS.

Security capabilities of ADLS allow administrators to control access to the data stored in ADLS in a discretionary manner. With this you can limit the access that Excel users have for the data in ADLS. In this manner, data in the ADLS-based data lake continues to be the single source of truth with no redundant copies and can be analyzed by analytics tools of your own choice .

Click through for a demo video.

Comments closed

Diving Into The Data Lake

Published 2017-07-25 by Kevin Feasel

Jesse Gorter explains the data lake metaphor:

A data lake is a concept that opposes the idea of a data mart. Where a data mart is a silo with structured and cleansed data, a data lake is a huge data collection that is unstructured and raw. You could also say that a data mart is a bottle of clean water whereas the data lake is the lake with (not so clean) water. 🙂

Now why would you want a data lake? Imagine you are generating huge logfiles, for example in airplanes. Machines that track air pressure, temperature etc. If something goes wrong, you definitely want to be alerted. That is event-driven: “if A and B happen, alert pilot, or do C” and there are tools for dealing with that kind of streaming data. But what if the plane landed safely? What do you do with all that data? You do not need it anymore right?

Well, some people would say: “Wrong”. You might need that data later for reasons you do not know today. Google, Microsoft and Facebook are all hoarding data. Also data they are not sure they might need someday. This data could later prove to be valuable for AI, machine learning or for something else.

Read the whole thing. The data lake concept is powerful, but it requires at least as much data governance as prior models. Just because you can dump a bunch of files without thinking about it doesn’t mean you’ll get back something useful later.

Comments closed

Mixing VARCHAR And NVARCHAR

Published 2017-07-25 by Kevin Feasel

Solomon Rutzky walks through some of the nuance of mixing VARCHAR and NVARCHAR data types with respect to your collation:

There are two types of Collations in SQL Server: SQL Server Collations and Windows Collations:

SQL Server Collations (those starting with SQL_) are older Collations that were the only ones available prior to SQL Server 2000. These Collations use simplistic sort orders, and do not handle the great variety of linguistic rules defined by Unicode. In fact, they do not have any Unicode rules defined at all, so N[VAR]CHAR data in these Collations will actually use OS-level Collation rules.
Windows Collations (those not starting with SQL_) were introduced in SQL Server 2000. These Collations not only have the Unicode rules defined, but they also apply those same linguistic rules to [VAR]CHAR data. While this does come at a slight cost to performance, it also allows for consistency of behavior. And it is this consistency that helps out greatly when there is a mismatch of datatypes (as we will see in a moment).

The simple advice to avoid mixing NVARCHAR and VARCHAR data types is still sound, but do read the whole thing.

Comments closed

Reading Drive Files Recursively From T-SQL

Published 2017-07-25 by Kevin Feasel

Slava Murygin shows how to use a common table expression to read a recursive drive listing:

In this blog I’ll use undocumented in SQL Server extended stored procedure “xp_dirtree“.
That procedure is well described in Patrick Keisler’s blog.

Briefly, “xp_dirtree” extended procedure returns tree of sub-directories from given folder and has three parameters:
1. Starting or root folder
2. Depth level – determines how deep SQL Server will dig into a sub-directory structure. If it set to zero or omitted whole tree will be returned.
3. If omitted or set to zero will return only folders .If set to “1” will return files as well.

If you want to be really mean to somebody, create a shortcut to the parent directory within a directory.

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31