Press "Enter" to skip to content

Curated SQL Posts

Databricks Delta Now Available On Azure

Cihan Biyikoglu and Singh Garewal announce the availability of Databricks Delta on Azure Databricks:

Using an innovative new table design, Delta supports both batch and streaming use cases with high query performance and strong data reliability while requiring a simpler data pipeline architecture:

Increased query performance – Able to deliver 10 to 100 times faster performance than Apache Spark(™) on Parquet through the use of key enablers such as compaction, flexible indexing, multi-dimensional clustering and data caching.

Improved data reliability – By employing ACID (“all or nothing”) transactions, schema validation / enforcement, exactly once semantics, snapshot isolation and support for UPSERTS and DELETES.

Reduced system complexity – Through the unification of batch and streaming in a common pipeline architecture – being able to operate on the same table also means a shorter time from data ingest to query result. Schema evolution provides the ability to infer schema from input data making it easier to deal with changing business needs.

The Azure version of Databricks is quickly reaching parity with the classic AWS-hosed version.

Comments closed

Other Ignite Announcements

Denny Cherry gives us a quick roundup of Ignite announcements:

On the Azure Data Platform side of the world, we have the announcement that Azure SQL DB now supports databases up to 100 TB in size using the Hyperscale feature of Azure SQL DB which you’ll see coming on October 1st, 2018.  Hyperscale is an excellent move for customers, as many customers were blocked by the fact that they couldn’t move the database to Azure SQL DB simple because of size; and this limit is going away in just a few short days.

Along with the legacy database platform, we have Managed Instance which was in Public Preview.  The fact is that it is in preview is no-more; Managed Instance is being released in General Availability starting on October 1st, 2018.  Managed Instance will make migrations to Azure much more accessible for many clients that need support for a SQL Server instance because of features that aren’t available in Azure SQL DB. Managed Instance will bridge this gap for customers giving customers basically full SQL Server functionality within a PaaS service.

In the Azure SQL DB space, we see new features for optimization of query performance getting released to General Availability.  These features include three new features called row mode: memory grant feedback, approximate query processing, and table variable deferred compilation. With minimal effort, these features can collectively optimize your memory usage and improve overall query performance.

They’re throwing a lot of stuff our way, including a less expensive version of Azure SQL Data Warehouse.

Comments closed

New Use Hint In SQL Server 2017 CU10

Pedro Lopes shows us a new use hint introduced in SQL Server 2017 CU10:

In this scenario, you only have this one query that apparently does better in SQL Server 2014 than 2017. That’s all “New CE” – there’s no CE70 vs CE 120+ at issue here. Using any known trace flag, the FORCE_LEGACY_CARDINALITY_ESTIMATION hint or the FORCE_DEFAULT_CARDINALITY_ESTIMATION hint doesn’t help. Rewriting the query is an option, but in the interim, I need a quick fix. How?

In SQL Server 2017 CU10, we have introduced a few new USE HINTs: the QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_n, where n is a supported database compatibility level. This forces the query optimizer behavior at a query level, as if the query was compiled with database compatibility level. You can refer to sys.dm_exec_valid_use_hints for a list of currently supported values for n.

So to be clear, the new hint is not forcing only a specific CE model, it’s forcing the equivalent of the specific database compatibility level’s query optimizer behavior, including any query optimizer fixes that are enabled by default in that database compatibility level.

Something to keep in mind, though ideally not something you’d want to use regularly.

Comments closed

Batch Mode On Rowstore

Kevin Farlee announces another query processing improvement in SQL Server 2019:

In the SQL Sever 2019 preview, we are further expanding query processing capabilities with several new features under the Intelligent Query Processing (QP) feature family.  In this blog post we’ll discuss one of these Intelligent QP features that is now available in public preview, batch mode on rowstore. This feature unlocks the advantages of batch mode execution in cases where there is no columnstore participating in the query.

Batch mode is a different execution mode primarily targeted at analytics queries which are characterized as scanning many rows, and doing significant aggregations, sorts, and group-by operations across these rows.  Batch mode has been reserved for queries which involve columnstore  indexes until now.

Performing scans and calculations using batches of ~ 900 rows at a time rather than row by row is much more efficient for analytic-type queries.  For queries that can take advantage of it, batch mode can easily make queries execute many times faster than the same query against the same data in row mode.

To date, the workaround you could use was to create an empty filtered columnstore index on a rowstore table.  This solution is more architecturally pleasing and means one less hack to explain.

Comments closed

The Evolution Of Polybase

Asad Khan gets into improvements in SQL Server 2019:

  • Break down data silos and deliver one view across all of your data using data virtualization. Starting in SQL Server 2016, PolyBase has enabled you to run a T-SQL query inside SQL Server to pull data from your data lake and return it in a structured format—all without moving or copying the data. Now in SQL Server 2019, we’re expanding that concept of data virtualization to additional data sources, including Oracle, Teradata, MongoDB, PostgreSQL, and others. Using the new PolyBase, you can break down data silos and easily combine data from many sources using virtualization to avoid the time, effort, security risks and duplicate data created by data movement and replication. New elastically scalable “data pools” and “compute pools” make querying virtualized data lighting fast by caching data and distributing query execution across many instances of SQL Server.

Just in time for me to scramble to update Polybase slides for Conference Season…

Comments closed

What’s In SQL Server 2019 CTP 2.0?

Aaron Bertrand gives us the highlights:

  • Certificate Management in Config Manager View and validate all of your certificates from a single interface, and manage and deploy certificate changes across all of the replicas in an Availability Group or all of the nodes in a Failover Cluster Instance.

  • Built-in data classification A new ADD SENSITIVITY CLASSIFICATION statement helps you identify and automatically audit sensitive data, a huge step up from the previous SSMS wizard (which just used extended properties).

Aaron also digs into the engine a bit:

APPROX_COUNT_DISTINCT

This new aggregate function is designed for data warehouse scenarios, and is an equivalent for COUNT(DISTINCT()). Instead of performing expensive distinct sort operations to determine actual counts, it relies instead on statistics to get something relatively accurate. You should find that the margin of error is within 2% of the precise count, 97% of the time, which is usually fine for high-level analytics, values that populate a dashboard, or quick estimates.

On my system I created a table with integer columns ranging from 100 to 1,000,000 unique values, and string columns ranging from 100 to 100,000 unique values. There were no indexes other than a clustered primary key on the leading integer column. Here are the results of COUNT(DISTINCT()) vs. APPROX_COUNT_DISTINCT() against those columns, so you can see where it is off by a bit (but always well within 2%):

By the way, APPROX_COUNT_DISTINCT() is a really good idea, and I’m glad it’s here.

Comments closed

SQL Operations Studio Is Now Azure Data Studio

David Hiltenbrand notes a name change:

Will SQL Operations Studio upgrade automatically to Azure Data Studio? 

NO! Although they’re effectively the same thing currently, you do need to install Azure Data Studio separately from your existing sqlops install. You can install the new Azure Data Studio after downloading it from here: https://aka.ms/getazuredatastudio. The docs also include a helpful section, Move User Settings, that will help you migrate any custom settings you don’t want to lose from your sqlops configuration.

Personally, I’m not a big fan of the name change.  But Grant Fritchey clues us in on the reason behind it:

The core concept here is to have a development tool that gives you a common framework for working with data, not just SQL data, but CosmosDB and others. Further, a tool that you can run where you work. Do you have a Mac? Cool. Use Azure Data Studio. Running Linux? Cool. Use Azure Data Studio. Still on Windows with me? We also get Azure Data Studio.

I do get the benefit of a tool which can hit different data sources, including something which is not SQL-based.  But the “Azure” in the name throws me.  I’ll still connect to my on-prem and AWS-based SQL Servers with it though.

Comments closed

R From The Year 2000

Colin Gillespie takes us down memory lane with some old, old code:

Last week I spent some time reminiscing about my PhD and looking through some old R code. This trip down memory lane led to some of my old R scripts that amazingly still run. My R scripts were fairly simple and just created a few graphs. However now that I’ve been programming in R for a while, with hindsight (and also things have changed), my original R code could be improved.

I wrote this code around April 2000. To put this into perspective,

  • R mailing list was started in 1997
  • R version 1.0 was released in Feb 29, 2000
  • The initial release of Git was in 2005
  • Twitter started in 2006
  • StackOverflow was launched in 2008

Basically, sharing code and getting help was much more tricky than today – so cut me some slack!

It’s a good sign when an arbitrary task becomes easier to understand as a language evolves.  And I’m glad they dumped the underscore assignment operator.

Comments closed

Reticulate: Python-R Interop

Adnan Fiaz walks us through an example of using the reticulate library to call Python from R:

So what exactly does reticulate do? It’s goal is to facilitate interoperability between Python and R. It does this by embedding a Python session within the R session which enables you to call Python functionality from within R. I’m not going to go into the nitty gritty of how the package works here; RStudio have done a great job in providing some excellent documentation and a webinar. Instead I’ll show a few examples of the main functionality.

Just like R, the House of Python was built upon packages. Except in Python you don’t load functionality from a package through a call to librarybut instead you import a module. reticulate mimics this behaviour and opens up all the goodness from the module that is imported.

This is a good intro to a package which is already useful but I think will be even better over time as R & Python interoperability becomes the norm.  H/T R-Bloggers

Comments closed

Working With Data Frames In R

Dave Mason has a couple of blog posts on data frames.  First, the basics:

Conceptually, a dataset is a grid or table of data elements. It consists of rows, which we specifically call “observations”, and of columns , which are called “variables”. (Observations may also be referred to as “instances”. Variables may also be referred to as “properties”.) The data frame in R is designed for data sets. As the R documentation tells us, data frames are “used as the fundamental data structure by most of R’s modeling software”.

The function we’ll be working with primarily in this post is the data.frame() function. I have read that in R programming, creating data frames with this function is rather uncommon. Most of the time, data frames are created by invoking other functions that read data from an external data source (like a file or a database table) with a data frame as the return type. But for simplicity, data.frame() will serve our purposes.

Then, subsetting data frames:

Adding columns to a data frame is easy–easy compared to adding rows. We’ll get to that. To add a column, first create a vector. The class doesn’t matter. But the number of elements does–it has to match the number of observations in the data frame. Now that we have our vector, here are some options to add it as a new column to a data frame: use the $ shortcut, use double brackets with the new column name, bind the vector to the dataframe with cbind().

The data frame (or tibble, if using the tidyverse version) is probably the single most important data type in R for getting work done.

Comments closed