Summary Improvements In R

Kevin Feasel

2017-06-05

R

John Mount points out a nice quasi-bugfix in R 3.4.0:

In older versions of R (say R 3.3.1) the above code gave the following undesirable result:

summary(15555)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 15560 15560 15560 15560 15560 15560 

This was always very confusing and hard to explain to beginners. To justify this you had to explain that “R, by default, calculates the summary rounded to 4 significant digits, and is simultaneously configured to give absolutely no indication has to how many significant digits are in fact being displayed.” To add insult to injury summary()picked a different number of sigfigs than the default numeric presentation. One could type “median(15555)” and get the expected presentation “15555“.

I like this change.

Jupyter And Kubernetes

David Crook shows how to use Jupyter notebooks inside Kubernetes:

We start with a 16.04 image, we run some upgrades, install python, upgrade pip, install our requirements and expose port 8888 (jupyter’s default port).

Here is our requirements.txt file

1
2
3
4
5
6
7
8
9
numpy
pandas
scipy
jupyter
azure_common
azure-storage
scikit-learn
nltk
plotly

Notice how Jupyter is in there, I also added a few other things that I very commonly use including numpy, pandas, plotly, scikit-learn and some azure stuff.

The big benefit to doing this is that your installation of Jupyter can exist independently from your notebooks, so if you accidentally mess up Jupyter, you kill and reload from the image in a couple commands.

Using Hive As A Power BI Data Source

Ust Oldfield shows how to use Hive via Azure HDInsight as a data source for Power BI:

As Hive is part of the Azure HDInsight stack it would be tempting to select the HDInsight or Hadoop connector when you’re getting data. However, note HDFS in brackets beside the Azure HDInsight and Hadoop File options as this means that you’ll be connecting to the underlying data store, which can be Azure Data Lake Store or Azure Blob Storage – both of which use HDFS architectures.

But this doesn’t help when you want to access a Hive table. In order to access a Hive table you will first of all need to install the Hive ODBC driver from Microsoft. Once you’ve downloaded and installed the driver you’ll be able to make your connection to Hive using the ODBC connector in PowerBI.

Read the whole thing.  Connecting to Hive is pretty easy.

Trimming The Fat From Power BI Data Models

Philip Seamark shows how things like getting rid of ID columns can reduce a Power BI data model’s size significantly:

Once saved, the Power BI file size was 289MB!  Is this good for 10 million rows?  It’s certainly better than the 360MB CSV file but not by much.  Certainly not close to the 10:1 compression claimed to be achievable using the SSAS Tabular engine used by Power BI.

I think we can do better than that….

Read on to see the specific optimizations, turning this from a 289 MB data model into a 9 MB data model.

Graph Database In SQL Server

Kevin Feasel

2017-06-05

Graph

Niko Neugebauer has started a new series, this time on graph database functionality in SQL Server 2017:

Graph databases are based on graph theory, and employ nodes, edges, and properties. The graph theory is the study of the graphs that are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of nodes, edges which are connected by edges, arcs, or lines.
A graph can be directed or undirected (uni or bi-directional) that might point the direction of the relationship between the edges.
Graph databases can be compared to the Network Model Databases, that were focusing on solving the same problem the interconnected world.

The most popular graph database in the world currently is NEO4J, which is implemented in Java and is using CQL (Cypher Query Language), a language that has definitely inspired the SQL Graph T-SQL language extension.

Niko notes that this is not a fully mature product yet, but it’s an interesting start.

Limitations In sys.dm_exec_query_plan

Brent Ozar shows a limitation in the sys.dm_exec_query_plan DMV:

The query with six joins has a cached plan that you can click on – because it’s from sys.dm_exec_query_plan, a management object that returns execution plans as XML.

The query with seven joins does not. Its cached plan is too large or complex for that DMV. Microsoft quickly realized this problem when 2005 came out, so in 2005 Service Pack 2, they introduced sys.dm_exec_text_query_plan. That function returns bigger data – but it only comes back in text format.

There are a couple of workarounds, at least, but they aren’t ideal.

Discovering Orphaned Users

Adrian Buckman troubleshoots an access scenario:

First lets check that the User Does actually exist, we know the Server login exists otherwise the user would be complaining that they cannot connect to the SQL server instance.

Sure enough – there is the user ‘SQLUndercoverUser’ lets check out the permissions:

No problems there – the user has [db_datareader], [db_datawriter] and [db_owner] so we know there is not a permissions issue, so lets test this login by connecting to SQL server with the user credentials:

Connected to the Server with no issues, lets open a new query against SQLUnderCoverDB:

Hmm so despite having  permissions to access the database  we are receiving this error – we know that the password is correct too otherwise we wouldn’t be able to access the Server at all….

Adrian does a nice job of walking through the troubleshooting process, going from simple problems (does the user actually exist? does the user have permissions?) and into the real cause, which was orphaned SQL authenticated users.  Read the whole thing.

Multi-Database Query Store Data Retrieval

Tracy Boggiano wants to get Query Store data for a large number of databases on the same instance and has written a Powershell script to this effect:

In SQL Server 2016, Microsoft introduced to us the Query Store.  This has been proven to be a very beneficial feature especially at my job, saved the day on Christmas Eve. One of the limitations if see is when you have server with several databases on it trying to find the one that has the problem. So until Microsoft gives use a better tool or someone builds one I wrote a quick PowerShell function to query across all the databases and return data to a grid so you can sort and find that troublesome query.

To call it simply provide the server, which metric you want, the top number of queries to return, who many hours to look back, and the minimum execution count.

This is one of two issues I have with Query Store, the other being that its data gets saved to the primary filegroup without any recourse.  Click through for the script.

Scraping Session Data

Amy Herold has scraped PASS Summit 2017 submissions using Powershell:

Never having done a web scrape before, this was the perfect subject for my first time – grabbing all the sessions submitted to PASS Summit 2017…and doing it with PowerShell! Here is the script I used for this. I have accounted for the following:

  • Apostrophes (aka single quote). They will break your insert unless you have two of them, and for some reason, people seem to use them all over the place.

  • Formatting the string data for insert. No, your data will not magically come out right in your insert with single quotes so you need to add them.

  • Additional ID and deleted fields.

  • Speaker URL and ID. Will be using this to scrape speaker details later.

  • Accurate lower and upper bounds. These were arrived at by trial and error (you’re welcome), as well as the clean up of the data I scraped. More on this later.

Powershell probably wouldn’t be my first language for web scrapes—that’d be Python—but Amy shows how to get a scrape going.

Using OtterTune To Tune Databases

Dana Van Aken, Geoff Gordon, and Any Pavlo show off OtterTune, which uses machine learning techniques to tune database management systems like MySQL and Postgres:

OtterTune, a new tool that’s being developed by students and researchers in the Carnegie Mellon Database Group, can automatically find good settings for a DBMS’s configuration knobs. The goal is to make it easier for anyone to deploy a DBMS, even those without any expertise in database administration.

OtterTune differs from other DBMS configuration tools because it leverages knowledge gained from tuning previous DBMS deployments to tune new ones. This significantly reduces the amount of time and resources needed to tune a new DBMS deployment. To do this, OtterTune maintains a repository of tuning data collected from previous tuning sessions. It uses this data to build machine learning (ML) models that capture how the DBMS responds to different configurations. OtterTune uses these models to guide experimentation for new applications, recommending settings that improve a target objective (for example, reducing latency or improving throughput).

In this post, we discuss each of the components in OtterTune’s ML pipeline, and show how they interact with each other to tune a DBMS’s configuration. Then, we evaluate OtterTune’s tuning efficacy on MySQL and Postgres by comparing the performance of its best configuration with configurations selected by database administrators (DBAs) and other automatic tuning tools.

This is potentially a very interesting technology and is not the only one of its kind—we’ve seen Microsoft enter this space as well for SQL Server index and tuning recommendations.

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930