Press "Enter" to skip to content

Curated SQL Posts

Azure SQL DW Statistics

Emma Stewart looks at how statistics are created in Azure SQL Data Warehouse:

In Azure SQL Data Warehouse, statistics have to be created manually. On previous SQL Server projects, creating and maintaining statistics wasn’t something that we had to incorporate into our design (and really think about!) however with SQL DW we need to make sure we think about how to include it in our process in order to make sure we take advantage of the benefits of working with Azure DW.

The major selling point of Azure SQL Data Warehouse is that it is capable of processing huge volumes of data, one of the specific performance optimisations that has been made is the distributed query optimiser. Using the information obtained from the statistics (information on data size and distribution), the service is able to optimize queries by assessing the cost of specific distributed query operations. Therefore, since the query optimiser is cost-based, SQL DW will always choose the plan with the lowest cost.

Azure SQL Data Warehouse is a bit of a strange animal, with differences in statistics being one of the smaller changes versus “classic” SQL Server.

Comments closed

Details On Azure SSAS

James Serra breaks down what Azure Analysis Services has to offer:

  • Developers can use SQL Server Data Tools (SSDT) in Visual Studio for creating models and deploying them to the service.  Administrators can manage the models using SQL Server Management Studio (SSMS) and investigate issues using SQL Server Profiler

  • Business users can consume the models in any major BI tool.  Supported Microsoft tools include Power BI, Excel, and SQL Server Reporting Services.  Other MDX compliant BI tools can also be used, after downloading and installing the latest drivers

  • The service currently supports tabular models (compatibility level 1200 only).  Support for multidimensional models will be considered for a future release, based on customer demand

Between tabular-only support and the max size being 100 GB (if I’m reading this correctly), they’re not yet ready to push the product hard.  Given that it just came out, that makes sense, and hopefully the training wheels come off.

Comments closed

Sparklines

Devin Knight continues his custom visuals series with the sparkline:

Key Takeaways

  • Shows trends in data most often by time.

  • It can only visualize a single measure.

  • The Sparkline can be configured to animate across time.

In the right circumstances, I love sparklines.  My circumstances are as follows:  when you are viewing time series data for relatively few elements in which the trend is more important than the levels.  In that scenario, sparklines are efficient and tell the story without extraneous numbers or clutter getting in the way.

Comments closed

Latency Vs Throughput

I have a post up on understanding latency versus throughput:

The primary method in which data moves from one process to another is through buffers.  We break up data into smaller portions and push them to their destination.  In Integration Services, we have buffers.  When passing data through TCP, we use packets.

Okay, so what’s the trade-off?  The trade-off is between latency and throughput.  Let’s take TCP packets as an example.  Say you have a series of 50-byte messages you want to send from a source to a destination.  We have two primary options:  push messages as fast as possible, or hold off until you have the most data you can store in a packet and send it along.  For simplicity’s sake, we’ll say that we can fit about 1350 bytes in a packet, so we can store 27 messages in a packet.  We’ll also assume that it takes 10 milliseconds to send a packet from the source to the destination (regardless of packet size, as we’re using powerful connections) and 1 millisecond to produce a message.

We use pipes as metaphors in IT, especially around data transfer, and I think it’s a solid metaphor because it intuitively includes most of the important concepts we need to worry about with data.  We have latency (how long it takes something to go from one end of the pipe to the other), throughput (how much we can move at any point in time, which is determined by things like the diameter of the pipe), back pressure (in the pipe scenario, resistance caused by pipe directional changes; in the data world, when downstream operators are slower than upstream operators), etc.

Comments closed

Visualizing Checkins

Rob Sewell uses Power BI to map where he’s been:

I am using the swarm API but the principle is the same for any other API that provides you with data. For example, I used the same principles to create the embedded reports on the PASS PowerShell Virtual Chapter page showing the status of the cards suggesting improvements to the sqlserver module for the product team to work on. Hopefully, this post will give you some ideas to work on and show you that it is quite easy to get excellent data visualisation from APIs

First up we need to get the data. I took a look at the Swarm developers page ( The Trello ishere by the way) I had to register for an app, which gave me a client id and a secret. I then followed the steps here to get my user token I was only interested in my own check ins so I used the steps under Token flow Client applications to get my access token which I used in an URL like this.

This post includes some Powershell and quite a few animated GIFs, making it easy to follow.

Comments closed

View Performance

Grant Fritchey looks at view performance vis-a-vis stored procedures:

The difference in the performance including compile time for the procedure alone is 700mc better on average than the view. That’s an 8% difference. It was almost that high for the view that used the procedure at 7%.

If we’re just talking compile time then, there is a significant win if we avoid the view. This is no doubt because of the extra work involved in unpacking the view and going through the simplification process within the optimizer. Plus, the view alone in our query was parameterized by the optimizer in order to assist it’s performance over time (as we saw in the average results without the recompile). All that extra work explains the 8% difference.

Read the whole thing.

Comments closed

Perfmon And SQL Server Memory

Lonny Niederstadt looks at using Perfmon to understand what’s happening with memory allocations on your SQL Server instance:

Lets look at stolen memory a bit.  The relationship between memory grants and stolen memory is probably the least intuitive relationship.  Remember – if a query gets a memory grant the grant happens at the beginning of query execution.  Its just a promise of sort/hash memory to be made available when the query needs it.  The grant memory isn’t stolen immediately – rather its stolen in small allocations over time as needed by the query.

In the graph immediately below, the outstanding grants are shown over time.  There are no pending grants during the observation period.  Granted memory and reserved memory are both shown as areas, with reserved memory in front of granted memory.  Granted memory is consistently greater than reserved memory (in this case, no resource pools have been added beyond the pre-existing default and internal pools).  This is how we can determine that the reserved memory is granted memory which hasn’t been stolen yet.

This is a great explanation of what stolen memory is and why it’s important.

Comments closed

Basics Of Spark

Jen Underwood gives a quick explanation of Spark as well as an introduction to SparkSQL and PySpark:

Spark’s distributed data-sharing concept is called “Resilient Distributed Datasets,” or RDD. RDDs are fault-tolerant collections of objects partitioned across a cluster that can be queried in parallel and used in a variety of workload types. RDDs are created by applying operations called “transformations” with map, filter, and groupBy clauses. They can persist in memory for rapid reuse. If an RDD data does not fit in memory, Spark will overflow it to disk.

If you’re not familiar with Spark, now’s as good a time as any to learn.

Comments closed

Building Metadata With Biml

Ben Weissman provides a set of Biml scripts to load a metadata table based off of existing tables and columns:

For each member of that collection, we follow some simple rules:

– Our table’s original name is the name of the table in the staging area without our connectionname prefix
– If our tablename still includes an underscore, we will split the name and assign the table- and schemaname respectively. Otherwise, our schema will be DBO.
– Create a DELETE statement towards our metadata store
– Create an INSERT statement towards our metadata store

Admittedly, I would have seen this as a one-time process and would have just written some scripts against sys.tables and sys.columns to generate this metadata, but “one-time processes” tend to happen over and over.

Comments closed

Alert On SQL Jobs Missing Schedules

Brian Hansen wraps up a three-part series on scheduled job alerts:

The first two parts of this series addressed the general approach that I use in an SSIS script task to discover and alert on missed SQL Agent jobs. With apologies for the delay in producing this final post in the series, here I bring these approaches together and present the complete package.

To create the SSIS, start with an empty SSIS package and add a data flow task. In the task, add the following transformations.

Regardless of how you do it, knowing when jobs fail is important enough to build some infrastructure around answering this question.

Comments closed