Press "Enter" to skip to content

Curated SQL Posts

Diamond: Solving Generalized Linear Models Using Python

Tim Sweester and Aaron Bradley announce Diamond, a Python library which solves certain kinds of generalized linear models.  In a two-part series, they explain more.  Part 1 covers the mathematical principles behind it:

Many computational problems in data science and statistics can be cast as convex problems. There are many advantages to doing so:

  • Convex problems have a unique global solution, i.e. there is one best answer
  • There are well-known, efficient, and reliable algorithms for finding it

One ubiquitous example of a convex problem in data science is finding the coefficients of an L2L2-regularized logistic regression model using maximum likelihood. In this post, we’ll talk about some basic algorithms for convex optimization, and discuss our attempts to make them scale up to the size of our models. Unlike many applications, the “scale” challenge we faced was not the number of observations, but the number of features in our datasets. First, let’s review the model we want to fit.

Part 2 looks at one interesting use case:

In this example, GLMMs allow you to pool information across different brands, while still learning individual effects for each brand. It breaks the problem into sets of fixed and random effects. The fixed effects are similar to what you would find in a traditional logistic regression model, while the random effects allow the regression relationship to vary for each brand. One of the advantages of GLMMs is that they learn how different brands are from each other. Brands that are very similar to the overall average will have small random effect estimates. Because of the regularization of these models, brands with few observations will also have small random effect estimates, and be treated more like the overall average. In contrast, for brands that are very different from the average, with lots of data to support that, GLMMs will learn large random effect estimates.

Check it out.  Part 2 also contains a link to the GitHub repo if you want to try it on your own.

Comments closed

Generating U-SQL Extract Scripts From Visual Studio

Yanan Cai shows a GUI for creating U-SQL EXTRACT scripts via Azure Data Lake Tools for Visual Studio:

One of U-SQL’s core capabilities is to be able to schematize unstructured data on the fly without having to create a metadata object for it. This capability is provided by the EXTRACT expression that will invoke either a user-defined extractor or built-in extractor to process the input file or set of files specified in the FROM clause and produces a rowset whose schema is specified in the EXTRACT clause.

While using the build-in extractor to schema semi-structured data, like data in .csv file, the schema definition in U-SQL is slow and error prone, especially for the .csv file contains hundreds of columns.

Recently, we released a new feature in the latest version of Azure Data Lake Tools for Visual Studio to help you generate this U-SQL EXTRACT statement automatically.

Click through for an example as well as a video showing the process.

Comments closed

Contributing To Open Source: Understanding GitHub

Andy Levy has a great guide showing how to pull the dbatools repo from GitHub:

I’m putting this together here for my own reference and to hopefully write it up in a way that helps things “click” for some people who need that extra nudge to get into “aha!” territory. A number of the examples I’ve seen elsewhere have mixed the command-line and GUI clients, but the more I use git GUIs, the less I like them for the basic workflow. You only need to know a handful of commands to be productive and for that, the command line beats the GUI in my opinion.

So, here we go. My GitHub workflow for working on dbatools, with as much command-line work as possible. This walk-through assumes basic familiarity with source control concepts.

This is a great guide for people who are not familiar with Git.

Comments closed

Pagination In Oracle Versus SQL Server

Daniel Janik is currently running an Oracle versus SQL Server series, looking at how the two database systems expose common functionality.  His latest topic is pagination:

Today’s topic is Pagination. Paging is a really important feature for web pages and applications. Without it you’d be passing large amounts of data to the application and expecting the application code to figure out which rows it needed to display.

Thankfully, someone smart came up with a way to do this on the database so you’re not returning gigs and gigs of data to the webserver to sort through.

Read on to see how the two platforms do this.

Comments closed

DevOps And The DBA

Kellyn Pot’Vin-Gorman gives a bit of DevOps 101 to database administrators:

Monitoring changes a bit with DevOps. It’s less about a simple tier and moves to the entire infrastructure. A need to monitor application, host, database and availability between each is essential. As these different tiers rarely come from one vendor and many may even be proprietary, there are requirements to monitor using multiple tools, scripts and interfaces.

Two of the main products for monitoring, recognized in the DevOps community are New Relic and AppDynamics. Monitoring can be as simple as a suite of scripts that report the health and status of processes and orchestration, notifying if there is any failure. This choice normally has a scaling limit and at some point, a more robust solution is required or gaps are felt in the monitoring process or failure at certain tiers. More enterprise solutions, such as New Relic and AppDynamics and enhanced by logging suites like Splunk and Sumo Logic.

Read the whole thing.

Comments closed

Index Changes And Recompilation

Kendra Little explains how index modifications affect the plan cache:

Creating an index doesn’t remove a plan from cache

We run our Demo Query five times. Querying sys.dm_exec_query_stats and related DMVs, we see five executions, plan_generation_num = one.

Then we create an index on a table referenced by the query.

Querying the DMVs we still see five executions, plan_generation_num = one, and the query plan in cache. It hasn’t been removed!

Read on for more, including what happens when you run the query again, what happens when you rebuild indexes, and what happens when you drop an index.

Comments closed

Deletes And Indexes

Jeff Schwartz looks at the performance cost of indexes when it comes to deleting rows:

Many articles concerning SQL Server discuss how record insertion overhead increases with each additional index. They discuss b-tree manipulations and page splits in addition to leaf and non-leaf levels. However, few discuss the fact that deletion overhead increases as well, especially when large numbers of records are deleted by individual queries. Recently, I was working with a client who regularly purged large numbers of records from tables that ranged in size from large to gigantic. For example, one table contained over 6.5 billion records. I added an index (4th overall) to one table expressly for the purpose of expediting the large deletion process, and the deletion run ran longer, despite using the new index! To determine how the numbers of indices and records to be deleted interact, I conducted an experiment to test several combinations. The specifics of the tests and their corresponding results are summarized below.

Check it out.  There’s certainly more to the story than “add indexes to improve performance.”

Comments closed

Cloudera Director And AWS Spot Instances

David Han shows off some new features in Cloudera Director 2.5 to help when building Hadoop clusters on AWS spot price instances:

You can configure Spot instances in Cloudera Director’s instance templates. These instance templates contain a flag indicating whether Spot instances should be used, as well as a field specifying the bid price for those instances.

Each instance group in the cluster template includes a field that indicates the minimum number of instances required in that group for the cluster to be considered successful. Cloudera Director will continue with bootstrapping or growing a cluster if the minimum count for each instance group is satisfied. Spot instances should not be used for instance groups that are required for the normal operation of the cluster, such as HDFS DataNodes. Instance groups configured to use Spot instances should set their minimum number to zero with the expectation that the instances may not be provisioned due to the Spot bid price being lower than the Spot price.

If you’re able to take advantage of spot instances, you can end up saving a pretty good amount of money.

Comments closed

Big Performance Tuning Mistakes

Erik Darling enumerates common performance tuning mistakes:

Coming in at number five and looking alive! Did you know that query plans can be different on busy servers? I bet not! And aside from that, your performance problem might not even be the query itself, it may be blocking, or a poison wait. This stuff may not show up in Dev, unless yous spend a lot of time and money engineering load tests.

This is what safety groups call “Situational Awareness”, and this is the kind of stuff that you really want a monitoring tool in place for. Sure, that query ran slowly, but if that’s all you know, and you can’t reproduce it, then you need to start digging deeper.

There are a number of tips here, and that number is five.

Comments closed

Searching In Execution Plans

Pedro Lopes gives a reason to upgrade to SSMS 17.2:

But what if I could search for node id 30 while looking at the graphical showplan?

Starting with SSMS 17.2, just use CTRL+F to start a search in graphical showplan (or right-click on a blank area of the plan, and in the context menu click on Find Node option), and you can quickly see exactly where node id 30 is:

I will have to see whether that also lets you quickly find the origin of Expr#### columns.

Comments closed