Press "Enter" to skip to content

Curated SQL Posts

How Meltdown And Spectre Have Affected Spark Performance

Chris Stevens, et al, show how DAtabricks customers have fared in a post-Meltdown+Spectre world:

On AWS, we have observed a small performance degradation up to 5% since January 4th. On i3-series instance types, where we cache data on the local NVMe SSDs (Databricks Cache), we have observed a degradation up to 5%. On r3-series instance types, in which the benchmark jobs read data exclusively from remote storage (S3), we have observed a smaller increase of up to 3%. The greater percentage slowdown for the i3 instance type is explained by the larger number of syscalls performed when reading from the local SSD cache.

The chart below shows before and after January 3rd in AWS for a r3-series (memory optimized) and i3-series (storage optimized) based cluster.  Both tests fixed to the same runtime version and cluster size. The data represents the average of the full benchmark’s runtime per day, for a total of 7 days prior to January 3 (before is in blue) and 7 days after January 3 (after is in red). We exclude January 3rd to prevent partial results.  As mentioned, the i3-series has the Databricks Cache enabled on the local SSDs, resulting in roughly half of the total execution time (faster) compared to the r3-series results.

Overall, they’re seeing a degredation of 2-5%.  Click through for some more information on how they collected their metrics.

Comments closed

Comparing Data Lake Job Runs

Yanan Cai shows how to compare stats on different executions of a job:

Troubleshooting issues in recurring job is a time-consuming task. It starts with searching through the Job Browser to find instances of a recurring job and identifying both baseline and anomalous performance. This is followed by multi-way comparisons between job instances to figure out what has been changed in the query, data or environment. This is followed by analysis to discover which changes may have performance impact. While this is happening production workloads continue to under-perform or go down.

Azure Data Lake Tools for Visual Studio now makes it easy to spot anomalies and quickly trace the key characteristics across recurring job instances allowing for an efficient debugging experience. The Pipeline Browser automatically groups recurring jobs to simplify discovery of all runs. The Related Job View collects data about inputs, outputs and execution across multiple runs into a single visualization.

Read on for more.

Comments closed

Out Of Memory Conditions And In-Memory OLTP

Ned Otter explains how a memory-optimized database can run out of its most critical resource:

What can cause a memory-optimized database to run out of memory? It could be that resource consumption (memory) exceeded:

  • the relevant percentage of committed_target_kb from the sys.dm_os_sys_info DMV (explained in a moment)
  • MAX_MEMORY_PERCENT value of a Resource Pool that the database is bound to (if running Enterprise Edition and using Resource Governor)

or:

  • garbage collection is not operational (the purpose of GC is to reclaim memory consumed by stale row versions)
  • updates to memory-optimized table variables caused row versions to be created, and because GC does not operate on table variables, you ran out of memory (for table variables that have a very large amount of rows)

The only thing that can prevent GC from working is a long running transaction.

This is the final post of Ned’s resource issues miniseries and it’s been a good one.

Comments closed

Visual Principles And Dashboards

I continue my series on dashboard visualization by looking at pictures:

In a bit more detail, you can make a dashboard glanceable by following these guidelines:

  1. Ensure that there is clear purpose in your metric design and display.  In other words, think about which metrics you want to show, how you want to show them, and where you put metrics in relation to one another.

  2. Group metrics by function into sections.  Look at the dashboard above.  It has four clusters of metrics:  those around revenue, new customers, revenue per customer, and customer acquisition cost.  All of the revenue metrics are clustered in the top-left quadrant of the dashboard.  Furthermore, all revenue-related metrics (that is, revenue metrics and revenue per customer metrics) are on the left-hand side of the dashboard, so the CEO can focus on that half and learn about revenue and revenue per customer.  She doesn’t need to look in the top-left corner for one revenue measure and in the bottom right for another; she can focus down to a portion of the dashboard and get an answer.

  3. It should be easy to see and differentiate those clusters of metrics.  Our natural instinct might be to put borders around the clusters, but whitespace is your friend—remember, less is more.  If you add a bit more whitespace between clusters of measures, you’ll make it easy for people to see that there’s a difference without distracting them with unnecessary lines.

I cover the Rule of Thirds, Glanceability, and Color Vision Deficiency, three important considerations for a designer.

Comments closed

SQL Server Backups To Azure Blob Storage

Kevin Hill shows  how to configure SQL Server to back up a database to Azure blob storage:

Note the “no blobs found” in the container.  After a successful backup, you will see it here.

Click on ‘Container Properties’ to get the URL for this specific container…this will be used in Backup and Restore statements.  Click the button next to the URL to copy it.  For now just remember where this is or copy it to Notepad, Query window etc.  When we start to build our T-SQL statements, we will need both the Access key from earlier and the URL.

Kevin gives clear, step by step instructions on the process.

Comments closed

Specifying IP Address On A Point To Site VPN

John Morehouse shows how to force a particular IP address when building an Azure point-to-site VPN:

Recently, I got to work with a client on something interesting. We implemented transactional replication to send data to an Azure virtual machine.  This was being done to perform some testing for a project.

Given that the two machines were NOT within the same Active Directory domain, we wanted to make sure our client’s data was protected, so we utilized a Point-to-Site VPN to facilitate this.  With the client using a VPN connection, this helps to ensure that any data transmitted to the virtual machine is encrypted and secured.  Note, the process on how to configure and implement the VPN connection is for another blog post.

SQL Server replication requires the use of a server name rather than just the IP addresses. This meant that the virtual machine in Azure had to use an entry in the local host file that was pointed back to the client’s machine.

The down side?  When the VPN connection drops (it happens), the client machine obtains a new IP address upon reconnecting.  Potentially now the host file would have the incorrect IP address and needs to be updated.

Read on to see how John was able to solve this.

Comments closed

The Joy Of OUTPUT

Mark Wilkinson shows off some fun stuff you can do with the OUTPUT clause:

A common command in the Linux world is the tee command. What tee allows you to do is pipe the output of a command to a file as well as the console. This same functionality can be implemented using multiple OUTPUT clauses in a T-SQL statement. In this example we are going to update a few hundred records. When the update statement is run, not only will it update the MyGuid table but it will update a log table and also return the result of the update. This is accomplished by using two OUTPUT clauses.

Check it out.  I don’t use OUTPUT that often, but it can be quite useful when in a pinch or if you want to prevent scanning a table twice.

Comments closed

Getting dbatools To Version 1.0

Simone Bizzotto explains what it’s going to take to get dbatools up to version 1.0:

We’re looking for contributors to help us finally reach version 1.0. Currently, we are on par with Gmail’s beta schedule: a whopping 4 years. But, we’re almost there and need your help finalizing our changes. If you’re interested in helping us bring 1.0 alive, we identified four areas with 5 primary contacts on the SQL Server Community Slack:

  • Standardize param names (@wsmelton)
  • Create tests for existing functions (@cl and @niphlod)
  • Review existing function documentation (@alevyinroc or @gbargsley)
  • Prepare for 1.0 with “code style” (Bill of Health, more on that later)

As you can see, a few of us are the main reference (on GitHub and Slack, mostly) for each area.

Read the whole thing and, if you’ve found dbatools to be helpful in the past, see if there’s anything you can do to help them out a little in return.

Comments closed

Zippy Base R

John Mount defends the honor of base R:

The graph summarizes the performance of four solutions to the “scoring logistic regression by hand” problem:

  • Optimized Base R: a specialized “pre allocate and work with vectorized indices” method. This is fast as it is able to express our particular task in a small number of purely base R vectorized operations. We are hoping to build some teaching materials about this methodology.

  • Idiomatic Base R (shown dashed): an idiomatic R method using stats::aggregate() to solve the problem. This method is re-plotted in both graphs as a dashed line and works as a good division between what is fast versus what is slow.

  • data.table: a straightforward data.table solution (another possible demarcation between fast and slow).

  • dplyr (no grouped filter): a dplyr solution (tuned to work around some known issues).

Read the whole thing, including the comments section, where there’s a good bit of helpful back-and-forth.

Comments closed

Non-English Natural Language Processing

The folks at BNOSAC have announced a new natural language processing toolkit for R:

BNOSAC is happy to announce the release of the udpipe R package (https://bnosac.github.io/udpipe/en) which is a Natural Language Processing toolkit that provides language-agnostic ‘tokenization’, ‘parts of speech tagging’, ‘lemmatization’, ‘morphological feature tagging’ and ‘dependency parsing’ of raw text. Next to text parsing, the package also allows you to train annotation models based on data of ‘treebanks’ in ‘CoNLL-U’ format as provided at http://universaldependencies.org/format.html.

The package provides direct access to language models trained on more than 50 languages.

Click through to check it out.

Comments closed