Press "Enter" to skip to content

Month: September 2017

Trigram Search In SQL Server

Paul White shows how to implement trigram wildcard searches in SQL Server:

The basic idea of a trigram search is quite simple:

  1. Persist three-character substrings (trigrams) of the target data.
  2. Split the search term(s) into trigrams.
  3. Match search trigrams against the stored trigrams (equality search)
  4. Intersect the qualified rows to find strings that match all trigrams
  5. Apply the original search filter to the much-reduced intersection

We will work through an example to see exactly how this all works, and what the trade-offs are.

A must-read.  N-grams in SQL Server is an example of a non-obvious data architecture which performs much better than the obvious alternative, at least when the conditions are right.

Comments closed

Transit Data Visualization In R

Goncalo Trincao Cunha shows us how to plot General Transit Feed Specification data in R:

GTFS (General Transit Feed Specification) is a specification that defines a data format for public transportation routes, stop, schedules, and associated geographic information.

In this post, we’ll use R with ggplot2 and ggmap to visualize GTFS route and schedule information on a map.

This post uses a GTFS feed from CARRIS, which is a bus public transport operator from the city of Lisbon.

Click through for code and a few interesting maps of Lisbon, Portugal.

Comments closed

Deleting SSAS Cube Partitions With Powershell

Richie Lee shows how to remove Analysis Services cube partitions using Powershell:

One such an example of ad-hoc DBA tasks was when I had to delete about 600 partitions from a measure group that had thousands of partitions. Doing this manually would be ridiculous, so at the time I created a SQL script that used some dynamic T-SQL to create the delete commands in XMLA. XMLA has no “delete if exist” type syntax, so if I needed to run this again, this dynamic SQL output wouldn’t work. And so I decided that if I had to run the same task a gain I would write a PowerShell script that would run DSC-style and drop the partitions that were no longer required. And funnily enough, that is exactly what I had to do.

I knew I would be able to create a Powershell script that used AMO to check if a partition exists and drop it if it did. I also wanted the script to take into account any other partitions in other measure groups that may also need to be dropped. So I made sure the script uses PowerShell switches that can be included when calling the function, and if they are included then the pertaining partitions in that measure group will be deleted. So you can run the script for one, some or all of the measure groups in a cube.

Click through for the script.

Comments closed

Instant Log Initialization In Azure

Dimitri Furman shows a benefit of creating database files with Azure Blob Storage:

Recently, we were working on a performance testing exercise using a SQL Server database with files in Azure Blob Storage. After creating the database using the default 8 MB size for data and log file (as in the example above), we wanted to increase the size of all files to be sufficient for the expected workload. IFI was not yet enabled for the SQL Server instance we were working with, and growing the data file from 8 MB to 1 TB took about one minute (using Premium Storage). This was expected, since the data file had to be fully initialized. We expected that the log growth to 1 TB would take about as much time, for the same reason. It was very surprising then that the same operation on the log file completed in less than one second.

It turns out that this is due to differences in Azure Blob Storage versus traditional storage systems.

Comments closed

Starting Out With Powershell

Rob Sewell has his own introduction to Powershell:

PowerShell is another language, you are not going to be as proficient in a new language as you are in the language you spend all day working with. You will have to go through the learning curve and you will have to understand how to interpret errors. There is a learning curve just like with any language. Undoubtedly you will get frustrated at times. Reach out for help. Use twitter, use the #powershellhelp in the SQL Server Community Slack channel. There are many other places and plenty of people who will be glad to help you.

Rob has a pretty detailed introduction to the topic, so it’s well worth the read for new Powershell users.

Comments closed

Power BI Drillthrough

Reid Havens shows how to use the new drillthrough functionality in Power BI:

Report Drill Through enables users to create a report page, filtered to a single entity (E.g. Customer, Employee, Store, Product). Reports often have a summary landing page where there might be a lot of information, but not much detail about a specific item. Traditionally in Power BI we’d create a separate reporting page for further detail breakouts on an entity. However, that would require it’s own set of SLICERS that you would have to re-select if you wanted it to mirror the reporting page you were coming from…that’s too many clicks!

Selecting an item in Drill Through in a table from one reporting page will take you to another page, FILTERED to the entity you selected!This feature essentially let’s us create detail sub-pages that are linked to whatever primary reporting page the report uses. Features like this have been available for YEARS in Excel using linked cells, I’m super happy we finally have this as a feature in Power BI Desktop. It only takes a few steps to setup, but I’ll leave the instructions for that over at the Power BI September release page. With that said, let me run you quickly through how this looks like in a sample report.

Click through for an example, showing how useful drillthrough can be.

Comments closed

Getting Started With Powershell

Rob Farley gives us a basic introduction to Powershell:

I want to take a few minutes to get you started with PowerShell, in a way that I think is less overwhelming than how I found most PowerShell material back in the day. Because you’ve told me you keep hearing about PowerShell, and that you can’t tell how to get into it.

The first thing to know is that it feels like Command Prompt on steroids, but a lot more useful than Command Prompt. You’ll see what I mean as soon as you open it. Let’s do that, but let’s open it using ‘Run as Administrator’, because that just helps for some of the demos. In real life, you’ll only open it with Admin rights if you really need to.

So click on the start button and start typing PowerShell. I don’t want you to open the ISE (yet), just find the “Windows PowerShell” shortcut, and right-click on it to choose ‘Run as Administrator’.

This T-SQL Tuesday is all about Powershell, and Rob’s is one of the first posts you should read if you’re not familiar.

Comments closed

The Use And Misuse Of P Values

John Mount and Nina Zumel explain what p-values are and how people routinely misuse them:

The many things I happen to have issues with in common mis-use of p-values include:

  1. p-hacking. This includes censored data bias, repeated measurement bias, and even outright fraud.

  2. “Statsmanship” (the deliberate use of statistical terminology for obscurity, not for clarity). For example: saying p instead of saying what you are testing such as “significance of a null hypothesis”.

  3. Logical fallacies. This is the (false) claim that p being low implies that the probability that your model is good is high. At best a low-p eliminates a null hypothesis (or even a family of them). But saying such disproof “proves something” is just saying “the butler did it” because you find the cook innocent (a simple case of a fallacy of an excluded middle).

  4. Confusion of population and individual statistics. This is the use of deviation of sample means (which typically decreases as sample size goes up) when deviation of individual differences (which typically does not decrease as sample size goes up) is what is appropriate . This is one of the biggest scams in data science and marketing science: showing that you are good at predicting aggregate (say, the mean number of traffic deaths in the next week in a large city) and claiming this means your model is good at predicting per-individual risk. Some of this comes from the usual statistical word games: saying “standard error” (instead of “standard error of the mean or population”) and “standard deviation” (“instead of standard deviation of individual cases”); with some luck somebody won’t remember which is which and be too afraid to ask.

Even if you know what p-values are, this is definitely worth reading, as it’s so easy to misuse p-values (even when I’m not on my Bayesian post hurling tomatoes at frequentists).

Comments closed

Using Multiple Cosmos DB APIs

Vincent-Philippe Lauzon shows how to access graph data stored in Cosmos DB using the DocumentDB API:

Now here’s a little secret:  although we choose the “model” (e.g. Gremlin) at the Cosmos DB account level, we can use other models to query the data.

Not all combination are possible, but many are.  Specifically, we can query a Gremlin graph using DocumentDB / SQL query language.

The graph is then projected into documents.

We will explore that in this article.

Why is that interesting?  Because there are a lot of tools out there we might be familiar with to manipulate DocumentDB (or MongoDB).  Having to possibility to look at a Graph with other APIs extends our toolset from Gremlin-based ones.

That is interesting.

Comments closed