Press "Enter" to skip to content

Month: June 2017

Watch Those Aggregates

Ronald Dameron warns us about graphs in the Azure portal:

I would expect to see the CPU spike with the same value no matter what time range I selected. But, to see the spike that fired the alert, I had to to “Edit” the chart and select different time ranges to see the differences. It wasn’t until I selected a narrow custom time range that the CPU graph would display the CPU spike that corresponded to the alert firing. The alert fires if the CPU percentage exceeds 80% over 15 minutes. So, if you “know” something happened, try different time ranges but especially the custom range to find what you are looking for.

For your highbrow reference of the quarter, FA Hayek on John Maynard Keynes’s Treatise of Money:  “Mr. Keynes’s aggregates conceal the most fundamental mechanisms of change.”

This isn’t an Azure-specific problem; it’s something we have to think about whenever we aggregate data.

Comments closed

Misleading Perfmon Counters

Kendra Little describes five Performance Monitor counters whose metrics aren’t quite as straightforward as you’d think:

SQLServer: Locks – Average Wait Time (ms); Lock Wait Time (ms);  Lock Waits/sec

The idea: Alert if you have high lock waits

The problem: These counters update when the lock wait ends, not while it’s ongoing. Let’s say you have a query that’s blocked for five minutes. You’ll only see this in the perf monitor when it becomes un-blocked.

These counters can cause confusion and make you look for blocking queries at the wrong times.

Kendra provides better solutions for each of the five perfmon counters, so read the whole thing.

Comments closed

Build Your Own Perfmon Counter

Daniel Janik shows us two ways to create our own Perfmon counters, the good way and the not-as-good way:

Whether you’re just now finding out about these or you’ve been reminded that these exist, I’m sure your asking: “How can I use them?”

Some many years ago, I used the procedures in SQL 2000 to monitor blocked processes. I was onsite at a customer in San Jose, Costa Rica and I figured pictures would help bridge the gap since I didn’t speak Spanish.

The customer was experiencing outages that they could see with blocking. While investigating I found that blocking was directly related to disk latency. More Latency = More Blocking.

I was unaware that this even existed, but it could be very useful in some circumstances.

Comments closed

Endpoints In SQL Server

Jason Brimhall explains what SQL Server endpoints are:

You may have heard the term endpoints thrown around in technical discussion and wondered “what the heck is an endpoint?” Well, that is a good question. An endpoint in the simplest form is a connection or point of entry into SQL server.

Another way of thinking about an endpoint is to look at the ends of a line. Generally speaking, there is a stop point at each end of the line. At this stopping point, the line may connect to something else and therefore be a point of entry into that “something else”.

Read on for more information, as well as a script to show who owns each endpoint.

Comments closed

OLEDB Waits In CHECKTABLE

Lonny Niederstadt has dropped a boatload of information on us, with a four-part series looking at the OLEDB wait type when running DBCC CHECKTABLE.  Part 1 sets up the problem:

So… yeah.  Definitely a relationship between the number of pages in the heap and the number of OLEDB waits.  Something(s) increase the number of OLEDB waits beyond the number of data pages, the number of used pages and the number of reserved pages in a way that I haven’t deciphered yet, though.

Part 2 looks at what happens when logical checks run and not just physical checks:

On this very simple heap, with nearly the same number of rows as pages in the heap, the number of OLEDB wait events is nearly the same – only increased by one – when “upgrading” the checktable operation from physical_only to a checktable with logical checks.

Part 3 switches to using a clustered index, and things start getting wacky:

So how do the results match with expectations?  Whoa!!  25745 OLEDB wait events as the starting point for MAXDOP 1 and 2.  Then single-stepping up to 25752 OLEDB wait events for MAXDOP 8.  25752 is exactly 4 times the used page count of 6438 for the clustered index.  So the OLEDB wait events went from a small number more than the count of pages/rows with the heap… to almost 4 times as many with the clustered index.  *That* was a bit unexpected.

Part 4 looks at a heap with a non-clustered index:

All right.  What if we use a heap with a nonclustered include index?  Based on the table and index definition below, I expect the nonclustered include index nci_OLEDB_TEST_HEAP_NCI_INCLUDE to be remarkably similar to the clustered index we previously used.

Interesting reading so far in this series.

Comments closed

Summary Improvements In R

John Mount points out a nice quasi-bugfix in R 3.4.0:

In older versions of R (say R 3.3.1) the above code gave the following undesirable result:

summary(15555)

#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   15560   15560   15560   15560   15560   15560 

This was always very confusing and hard to explain to beginners. To justify this you had to explain that “R, by default, calculates the summary rounded to 4 significant digits, and is simultaneously configured to give absolutely no indication has to how many significant digits are in fact being displayed.” To add insult to injury summary()picked a different number of sigfigs than the default numeric presentation. One could type “median(15555)” and get the expected presentation “15555“.

I like this change.

1 Comment

Jupyter And Kubernetes

David Crook shows how to use Jupyter notebooks inside Kubernetes:

We start with a 16.04 image, we run some upgrades, install python, upgrade pip, install our requirements and expose port 8888 (jupyter’s default port).

Here is our requirements.txt file

1
2
3
4
5
6
7
8
9
numpy
pandas
scipy
jupyter
azure_common
azure-storage
scikit-learn
nltk
plotly

Notice how Jupyter is in there, I also added a few other things that I very commonly use including numpy, pandas, plotly, scikit-learn and some azure stuff.

The big benefit to doing this is that your installation of Jupyter can exist independently from your notebooks, so if you accidentally mess up Jupyter, you kill and reload from the image in a couple commands.

Comments closed

Using Hive As A Power BI Data Source

Ust Oldfield shows how to use Hive via Azure HDInsight as a data source for Power BI:

As Hive is part of the Azure HDInsight stack it would be tempting to select the HDInsight or Hadoop connector when you’re getting data. However, note HDFS in brackets beside the Azure HDInsight and Hadoop File options as this means that you’ll be connecting to the underlying data store, which can be Azure Data Lake Store or Azure Blob Storage – both of which use HDFS architectures.

But this doesn’t help when you want to access a Hive table. In order to access a Hive table you will first of all need to install the Hive ODBC driver from Microsoft. Once you’ve downloaded and installed the driver you’ll be able to make your connection to Hive using the ODBC connector in PowerBI.

Read the whole thing.  Connecting to Hive is pretty easy.

Comments closed

Trimming The Fat From Power BI Data Models

Philip Seamark shows how things like getting rid of ID columns can reduce a Power BI data model’s size significantly:

Once saved, the Power BI file size was 289MB!  Is this good for 10 million rows?  It’s certainly better than the 360MB CSV file but not by much.  Certainly not close to the 10:1 compression claimed to be achievable using the SSAS Tabular engine used by Power BI.

I think we can do better than that….

Read on to see the specific optimizations, turning this from a 289 MB data model into a 9 MB data model.

Comments closed

Graph Database In SQL Server

Niko Neugebauer has started a new series, this time on graph database functionality in SQL Server 2017:

Graph databases are based on graph theory, and employ nodes, edges, and properties. The graph theory is the study of the graphs that are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of nodes, edges which are connected by edges, arcs, or lines.
A graph can be directed or undirected (uni or bi-directional) that might point the direction of the relationship between the edges.
Graph databases can be compared to the Network Model Databases, that were focusing on solving the same problem the interconnected world.

The most popular graph database in the world currently is NEO4J, which is implemented in Java and is using CQL (Cypher Query Language), a language that has definitely inspired the SQL Graph T-SQL language extension.

Niko notes that this is not a fully mature product yet, but it’s an interesting start.

Comments closed