Author: Kevin Feasel

Sit a spell and let Grandpa Andy tell yall a story about some data integratin’.

Suppose for a minute that you’ve read and taken my advice about writing small, unit-of-work SSIS packages. I wouldn’t blame you for taking this advice. It’s not only online, it’s written in a couple books (I know, I wrote that part of those books). One reason for building small, function-y SSIS packages is that it promotes code re-use. For example, SSIS packages that perform daily incremental loads can be re-used to perform monthly incremental loads to a database that serves as a data mart by simply changing a few parameters.

Change the parameter values and the monthly incremental load can load both quarterly and yearly data marts.

You want better performance out of the daily process, so you read and implement the parallel execution advice* you’ve found online. For our purposes let’s assume you’ve designed a star schema instead of one of those pesky data vaults (with their inherent many-to-many relationships and the ability to withstand isolated and independent loads and refreshes…).

You have dependencies. The dimensions must be loaded before the facts. You decide to manage parallelism by examining historical execution times. Since you load data in chronological order and use a brute-force change detection pattern, the daily dimension loads always complete before the fact loads reach the latest data. You decide to fire all packages at the same time and your daily execution time drops by half, monthly executions time drops to 40% of its former execution time, and everyone is ecstatic…

…until the quarterly loads.

This is a great post.

Comments closed

The Value Of Schemabinding

Published 2018-02-15 by Kevin Feasel

Vitaly Bruk explains what schemabinding is and why we sometimes need WITH SCHEMABINDING in our code:

In SQL Server, when we use the “WITH SCHEMABINDING” clause in the definition of an object (view or function), we bind the object to the schema of all the underlying tables and views. This means that the underlying tables and views cannot be modified in a way that would affect the definition of the schema-bound object. It also means that the underlying objects cannot be dropped. We can still modify those tables or views, as longs as we don’t affect the definition of the schema-bound object (view or function).

If we reference a view or function in a schema-bound object (view or function), then the underlying view or function must also be schema-bound. And we cannot change the collation of a database, if it contains schema-bound objects.

I’ve only used schemabinding when mandated (e.g., using row-level security or creating an indexed view), but I can see the value behind using it with normal development.

Comments closed

Loading Excel Files With Powershell

Published 2018-02-15 by Kevin Feasel

Garry Bargsley walks us through a solution he implemented to load Excel file data into Powershell:

Recently one of our development teams has increased the request of importing an Excel file with 20 sheets in to 20 tables in a database from about once a quarter to multiple times a week and this past Monday was three times in one day. I have been the lucky DBA to get these requests as of late and after Monday I was determined to fix the process. The current procedure is to use the good ol’ Import/Export Wizard since this was a rare request. (This included a lot of point and click and possibility for manual error) With increased requests and increased table counts I knew there had to be a better way to get this accomplished without grimacing each time I see the request.

Garry has a script which he uses, but which can be tailored for other uses pretty easily.

Comments closed

Executing Powershell Against Multiple Servers

Published 2018-02-15 by Kevin Feasel

Stuart Moore shows an easy way to execute a Powershell script against multiple servers:

We setup new PsSessions using New-PsSession, I set ErrorAction to SilentlyContinue just in case a host isn’t available for some reason (if I was being good I’d try/catch here).

As we’re just using PS standard functionality here with Get-Service there’s no need to build a a new function, we can just call this directly here. By calling Invoke-Command against a session pointed at numerous hosts we can PowerShell handle all the connection management here and just assume the command will be ran against each host. if we were running against a lot of hosts then we would want to look into using the -ThrottleLimit parameter to limit the number of concurrent hosts we’re hitting. The one little trick here is using the using scope modifier here so PS pulls in the variable defined in our main scope (gory details on scoping here

Click through for the script, and do check out the comments, where Stuart gives a bit of advice when you’re trying to execute against a large number of servers.

Comments closed

GROUP BY vs DISTINCT

Published 2018-02-15 by Kevin Feasel

Rob Farley looks at how GROUP BY and DISTINCT and lead you down different execution plan paths:

What I want to explore in this post is the particular example that we both used… to bring an important point that could be missed because of the similarity of our examples.

You see, we both happened to use a FOR XML concatenation query, looking back at the same table. We did this to simulate a practical GROUP BY – somewhere that you might feel like GROUP BY is useful, but you know that you’re not using an aggregate function like SUM or MAX, but there isn’t one available. Ok, for Aaron he could’ve used the really new STRING_AGG, but for the old-timer like me, having to use SQL Server 2005, that wasn’t available.

In this post, Rob looks at a different sort of example and sees a more complicated scenario unfold.

Comments closed

Creating Modal Dialogues In Shiny

Published 2018-02-14 by Kevin Feasel

Dean Attali announces a new shiny package:

shinyalert uses the sweetalert JavaScript library to create simple and elegant modals in Shiny. Modals can contain text, images, OK/Cancel buttons, an input to get a response from the user, and many more customizable options. A modal can also have a timer to close automatically.

Simply call shinyalert() with the desired arguments, such as a title and text, and a modal will show up. In order to be able to call shinyalert() in a Shiny app, you must first call useShinyalert() anywhere in the app’s UI.

It does look nice. Check out Dean’s GitHub repo for more information. H/T R-Bloggers

Comments closed

Visualizing Cholesterol Data With ggplot2

Published 2018-02-14 by Kevin Feasel

Anisa Dhana uses the National Health and Nutrition Examination Survey and visualizes results with ggplot2:

From the plots above I find that regardless the different levels of diastolic and systolic blood pressure there is no substantial correlation between cholesterol and blood pressure. However, it is better to build the correlation line with geom_smooth or to calculate the Spearman correlation, although in this post we focus only on the visualization.

Lets build the correlation line.

Click through for several examples of visuals.

Comments closed

Microsoft + R

Published 2018-02-14 by Kevin Feasel

David Smith points out a bunch of the ways that Microsoft integrates R into products:

You can call R from within some data oriented Microsoft products, and apply R functions (from base R, from packages, or R functions you’ve written) to the data they contain.

SQL Server (the database) allows you to call R from SQL, or publish R functions to a SQL Server for database adminstrators to use from SQL.
Power BI (the reporting and visualization tool) allows you to call R functions to process data, create graphics, or apply statistical models to data.
Visual Studio (the integrated development environment) includes R as a fully-supported language with syntax highlighting, debugging, etc.
R is supported in various cloud-based services in Azure, including the Data Science Virtual Machine and Azure Machine Learning Studio. You can also publish R functions to Azure with the AzureML package, and then call those R functions from applications like Excel or apps you write yourself.

They’re pretty well invested in both R and Python, which is a good thing.

Comments closed

Using cowplot With ggplot2

Published 2018-02-14 by Kevin Feasel

I have a post on extending ggplot2’s functionality with cowplot:

Notice that I used geom_path(). This is a geom I did not cover earlier in the series. It’s not a common geom, though it does show up in charts like this where we want to display data for three variables. The geom_line() geom follows the basic rules for a line: that the variable on the y axis is a function of the variable on the x axis, which means that for each element of the domain, there is one and only one corresponding element of the range (and I have a middle school algebra teacher who would be very happy right now that I still remember the definition she drilled into our heads all those years ago).

But when you have two variables which change over time, there’s no guarantee that this will be the case, and that’s where geom_path() comes in. The geom_path() geom does not plot y based on sequential x values, but instead plots values according to a third variable. The trick is, though, that we don’t define this third variable—it’s implicit in the data set order. In our case, our data frame comes in ordered by year, but we could decide to order by, for example, life expectancy by setting data = arrange(global_avg, m_lifeExp). Note that in a scenario like these global numbers, geom_line() and geom_path() produce the same output because we’ve seen consistent improvements in both GDP per capita and life expectancy over the 55-year data set. So let’s look at a place where that’s not true.

The cowplot library gives you an easier way of linking together different plots of different sizes in a couple lines of code, which is much easier than using ggplot2 by itself.

Comments closed

Web Scraping With Power BI

Published 2018-02-14 by Kevin Feasel

Imke Feldmann shows how to use Power BI to scrape multiple tables from a webpage:

I will present 2 methods here:

Append-method: This is the obvious one and is fast for just a few tables.

Add-Column-method: A bit more complicated but will be faster for a large number of tables and is also suitable for a dynamic number of tables.

You will also find 2 options at the end of this article:

Use custom functions for multi-step table transformations
Use dynamic filters to select the desired tables

Read the whole thing.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30