Press "Enter" to skip to content

Curated SQL Posts

Understanding Query Optimizer Timeouts

Joseph Pilov answers frequently asked questions about SQL Server’s query optimizer when it times out:

What Is Optimizer Timeout?

SQL Server uses a cost-based query optimizer. Therefore, it selects a query plan with the lowest cost after it has built and examined multiple query plans. One of the objectives of the SQL Server query optimizer (QO) is to spend a “reasonable time” in query optimization as compared to query execution. Therefore, QO has a built-in threshold of tasks to consider before it stops the optimization process. If this threshold is reached before QO has considered most, if not all, possible plans then it has reached the Optimizer TimeOut limit. An event is reported in the query plan as Time Out under “Reason For Early Termination of Statement Optimization.” It’s important to understand that this threshold isn’t based on clock time but on number of possibilities considered. In current SQL QO versions, over a half million possibilities are considered before a time out is reached.

Optimizer timeout is designed in Microsoft SQL Server and in many cases encountering it is not a factor affecting query performance. However, in some cases the SQL query plan choice may be affected by optimizer timeout and thus performance could be impacted. When you encounter such issues, if you understand optimizer timeout mechanism and how complex queries can be affected in SQL Server, it can help you to better troubleshoot and improve your performance issue.

Read the whole thing.

Comments closed

Packages For Testing R Packages

Maelle Salmon shows us how to test our R packages within R:

If you’re brand-new to unit testing your R package, I’d recommend reading this chapter from Hadley Wickham’s book about R packages.

There’s an R package called RUnit for unit testing, but in the whole post we’ll mention resources around the testthat package since it’s the one we use in our packages, and arguably the most popular one. testthat is great! Don’t hesitate to reads its docs again if you started using it a while ago, since the latest major release added the setup() and teardown() functions to run code before and after all tests, very handy.

To setup testing in an existing package i.e. creating the test folder and adding testthat as a dependency, run usethis::use_testthat(). In our WIP pRojects package, we set up the tests directory for you so you don’t forget. Then, in any case, add new tests for a function using usethis::use_test().

The testthis package might help make your testing workflow even smoother. In particular, test_this() “reloads the package and runs tests associated with the currently open R script file.”, and there’s also a function for opening the test file associated with the current R script.

This is an area where I know I need to get better, and Maelle gives us a plethora of tooling for tests.

Comments closed

Clients For Working With HDFS

Mark Litwintschik reviews several clients for working with the Hadoop Distributed Filesystem:

The Hadoop Distributed File System (HDFS) allows you to both federate storage across many computers as well as distribute files in a redundant manor across a cluster. HDFS is a key component to many storage clusters that possess more than a petabyte of capacity.

Each computer acting as a storage node in a cluster can contain one or more storage devices. This can allow several mechanical storage drives to both store data more reliably than SSDs, keep the cost per gigabyte down as well as go some way to exhausting the SATA bus capacity of a given system.

Hadoop ships with a feature-rich and robust JVM-based HDFS client. For many that interact with HDFS directly it is the go-to tool for any given task. That said, there is a growing population of alternative HDFS clients. Some optimise for responsiveness while others make it easier to utilise HDFS in Python applications. In this post I’ll walk through a few of these offerings.

Read on for reviews of those offerings.

Comments closed

Monitoring Apache NiFi With A Custom Dashboard

Tim Spann has started a new series on monitoring Apache NiFi:

In this little proof of concept work, we grab some of these flows process them in Apache NiFi and then store them in Apache Hive 3 tables for analytics. We should probably push the data to HBase for aggregates and Druid for time series. We will see as this expands.

There are also other data access options including the NiFi REST API and the NiFi Python APIs.

Boostrap Notifier

  • Send notification when the NiFi starts, stops or died unexpectedly
  • Two OOTB notifications
  • Email notification service
  • HTTP notification service
  • It’s easy to write a custom notification service

Reporting Tasks

  • AmbariReportingTask (global, per process group)

  • MonitorDiskUsage(Flowfile, content, provenance repositories)

  • MonitorMemory

Much of this is an overview of the tools and measures available.

Comments closed

Reshaping Data Frames With tidyr

Anisa Dhana shows off some of the data reshaping functionality available in the tidyr package:

As it is shown above, the variable agegp has 6 groups (i.e., 25-34, 35-44) which has different alcohol intake and smoking use combinations. I think it would be interesting to transform this dataset from long to wide and to create a column for each age group and show the respective cases. Let see how the dataset will look like.

dt %>% spread(agegp, ncases) %>% slice(1:5)

Click through for a few additional transformations.

Comments closed

Querying Web API From Power BI

Paul Turley shows us how to hit secured Web API endpoints with Power BI:

Having recently worked-through numerous issues with API data feeds and deployed report configurations, I’ve learned a few important best practices and caveats – at least for some common use cases.  In one example, we have a client who expose their software-as-a-service (SaaS) customer data through several web API endpoints.  Each SaaS customer has a unique security key which they can use with Power BI, Power Query or Excel and other tools to create reporting solutions.  If we need a list of available products, it is a simple matter to create a long URL string consisting of the web address for the  endpoint, security key and other parameters; an then just pass this to Power Query as a web data source.  However, it’s not quite that easy for non-trivial reporting scenarios.

Thanks to Jamie Mikami from CSG Pro for helping me with the Azure function code for demonstrating this with demo data.  Thanks also to Chris Webb who has meticulously covered several facets of API data sources in great detail on his blog, making this process much easier.

Click through for the instructions.

Comments closed

Understanding ANY And ALL In SQL

Doug Kline explains the ANY and ALL operators in SQL:

-- note that this creates a single column of values
-- which could be used in something like IN
-- for example
SELECT 1
WHERE 12 IN ( SELECT tempField FROM (VALUES(11),(12),(7)) tempTable(tempField))
-- I could rephrase this as:
SELECT 1
WHERE 12 = ANY ( SELECT tempField FROM (VALUES(11),(12),(7)) tempTable(tempField))

I rarely see these operators in the wild and might have used them in production code a couple of times if that.

Comments closed

Big Data Clusters In SQL Server 2019

James Serra lays out some of the architecture behind SQL Server 2019 Big Data Clusters:

While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer.  The virtual data layer allows users to query data from many sources through a single, unified interface.  Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date.  Storage costs and data governance complexity are minimized.  See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and  Data Virtualization vs. Data Movement.

SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:

James covers some of the reasoning behind this and the shift from using Polybase to integrate data with Hadoop + Azure Blob Storage to using SQL Server as a data virtualization engine.

Comments closed

Finding Databases With Multiple Data Or Log Files

Lori Brown has a couple of quick scripts to help find databases made up of several data or log files:

This might be kind of basic but since I am working on a comprehensive script to discover things that a DBA really needs to know about, I made a couple of queries that will produce a list of the databases that have multiple files along with the locations of the physical files.  One query finds multiple database files (mdf’s) and the other looks for multiple transaction log files (ldf’s).  This will also find the Filestream file locations.  Since I often have to take on instances without ever having seen them, it is good to know about little things like this.

This script might be helpful in finding minor performance gains by looking for places to add data files or remove log files.

Comments closed

112 Million Cab Rides In Azure SQL Data Warehouse

Derik Hammer wants a real test of Azure SQL Data Warehouse:

The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.

I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.

Read on to see how you can set it up yourself.  As Derik points out at the end, though, this is still one big table, but there are a few columns which can lead to dimensions, things like rate code, location, and payment type.

Comments closed