Press "Enter" to skip to content

Author: Kevin Feasel

Training, Validation, and Test Data Sets with SAS Viya

Beth Ebersole takes us through creating training, validation, and test data sets using SAS Viya:

Training data are used to fit each model. Training a model involves using an algorithm to determine model parameters (e.g., weights) or other logic to map inputs (independent variables) to a target (dependent variable). Model fitting can also include input variable (feature) selection. Models are trained by minimizing an error function.

For illustration purposes, let’s say we have a very simple ordinary least squares regression model with one input (independent variable, x) and one output (dependent variable, y). Perhaps our input variable is how many hours of training a dog or cat has received, and the output variable is the combined total of how many fingers or limbs we will lose in a single encounter with the animal.

Read on for some good notes, including the difference between mean squared error and average squared error.

Comments closed

Shaded Ranges in Excel

Elizabeth Ricks shows how to create shaded ranges in Excel:

We can see there’s clear seasonality in this business—overall volume is highest in the summer and each outing type generally follows the same monthly pattern. Let’s say you manage the Family rentals and you’d like to compare your monthly volume to what you’re seeing across the entire fleet. 

For the purpose of this tactical illustration, let’s assume the shape of the data—relative peaks and valleys—is more important than the specifics of each category individually. If that’s the case, I can simplify by showing a shaded region to depict the range of absolute passengers each month.

This technique is excellent when you have a large number of lines but only care about one versus the norm, and individual lines would be too distracting.

Comments closed

Running Oracle on Azure

Kellyn Pot’vin-Gorman takes us through various options on running Oracle in Azure:

Running Oracle on Azure VM environments aren’t that different from running Oracle on VMs in your on-premises for a DBA.  The DBAs and developers that I work with still have their jobs, still work with their favorite tools and also get the chance to learn new skills such as cloud administration.

Click through for more, including a setup script.

Comments closed

MSDTC and Availability Groups

Ryan Adams provides guidance on using distributed transactions against Availability Groups:

A paramount concept to understand is how to make the DTC highly available.  We can see from the precedence order that SQL Server will use the local DTC out of the box.  This makes it appear that everything is working, and it is, but it is not exactly highly available.

I see a lot of customers leave it configured this way because they either don’t know the ramifications or do not realize they are using the MSDTC (Linked Servers). Since it simply works out of the box, things get left this way until they end up with a suspect database and error messages that look like this:

“SQL Server detected a DTC/KTM in-doubt transaction with UOW  {598B7EDD-F7A1-9DC1-8D3E-303A4C93AAB4}.Please resolve it following the guideline for Troubleshooting DTC Transactions.”

Read the whole thing. There are a lot of small areas between processes where things can fail, and the combination of DTC + AGs is no different.

Comments closed

Comparing Azure SQL DB and RDS on Price and Performance

Joey D’Antoni summarizes a recent publication:

Yesterday, GigaOm published a benchmark of Azure SQL Database as compared to Amazon’s RDS service. It’s an interesting test case that tries to compare the performance of these platform as a service database offerings. One of the many challenges of this kind of a study is that the product offerings are not exactly analogous. However, GigaOm was able to build a test case where they used similar sized offerings as shown in the image below.

Read on for the summary and Joey’s thoughts.

Comments closed

Query Performance With and Without Stored Procedures

Bert Wagner takes on the idea that stored procedures are faster (or slower) than ad hoc SQL:

A few months ago I was presenting for a user group when someone asked the following question:

Does a query embedded in a stored procedure execute faster than that same query submitted to SQL Server as a stand alone statement?

The room was pretty evenly split on the answer: some thought the stored procedures will always perform faster while others thought it wouldn’t really matter.

Bert explains the answer. For me, performance isn’t even on the radar for how I try to convince people to use procedures. Instead, explain it to developers as an interface: developers program against a contract, where they can send in a specified set of inputs (procedure parameters), get back a specified output (the shape of the result set), and not have to care about the fine details. Programming to interfaces is extremely common in business development, and this is just one more interface.

Comments closed

The Benefits of Delta Lake

Kaushik Nath explains what a Delta Lake is and why it is beneficial:

Data lakes have generated a large amount of publicity as the new storage technology for our big data era. Because something new is always better, right? 

All this hype around data lakes has ignored their inherent drawbacks and limitations. Well, I’m Not Here to create a debate by saying that no one should ever use data lakes. But I am saying that companies should enter into the data lake investment with eyes wide open. Otherwise it might lead to some serious complications.

Delta Lake is a concept intended to mitigate some of the issues with data lakes in general, turning them into data swamps.

Comments closed

Family History with SQL Graph

Mala Mahadevan takes us through family histories in a graph database:

I have been working a lot of SQL Graph related queries and applications of the graph data concept to the extent possible within SQL Server’s graph capabilities. Genealogy, or querying family trees is an important graph data application. A lot of us may not have work related applications that are genealogy related, necessarily. But conceptually, this can apply to many similar tree/hierarchy type structures. I was looking into some data to play with in this regard. Sometime ago – we were discussing novels by famed novelist James Michener. My friend Buck Woody made a tweet-remark that it would need a graph database to keep track of the characters and relationships in some of Michener’s novels. I am a big fan of Michener’s novels, and the most recent one I have read is ‘Hawaii’. It is based on history and evolution of the Hawaiian islands, and has a rather complex network of characters, with many ethnicities and several interwoven relationships. I decided to use the characters in Hawaii as my test data to understand how to query geneological data, stored in graph database format.

Read on to see Mala’s table and a procedure to retrieve this data.

Comments closed

Mapping Usernames within the Power BI Gateway

Jeff Pries takes us through a difficult scenario:

With some data sources, such as Analysis Services, you want to pass the username of the person running the report back to the server executing the query (such as in a Row Level Security configuration). Adam Saxton of Guy in a Cube does a great job of explaining how this works at a high level in the video linked here.

In the video, Adam mentions that if our Power BI login does not match a UPN in our local Active Directory, then the lookup will fail…which is a problem if you just don’t have the option of having your Power BI login match a UPN in your local Active Directory and don’t want to manage static user mappings over time. The following will cover a method of allowing the lookup to use a different Active Directory property, such as “mail” to perform the matching.

Read on for the steps.

Comments closed