Press "Enter" to skip to content

Curated SQL Posts

Batch Mode Processing On Rowstore Tables

Dmitry Pilugin shares some thoughts on the expansion of batch mode processing to rowstore tables:

The main advantages of Batch Mode are:

  • Algorithms optimized for the multi-core modern CPUs;
  • Better CPU cache utilization and increased memory throughput;
  • Reduced number of CPU instructions per processed row.

All these features make Batch Mode much faster than Row Mode (typically an order of magnitude, 10x-100x times faster) for analytical queries with CS indexes. One major condition for Batch Mode is a presence of a CS index. If you don’t have a CS index on a table involved in a query, you won’t get Batch Mode.

However, some analytical queries may benefit from Batch Mode without a CS index, or CS cannot be created due to some limitations.

There are a few tricks that allow you to enable Batch Mode on a Rowstore table for example with a dummy filtered CS index (see this post from Itzik Ben-Gan), but SQL Server 2019 may use Batch Mode on Rowstore without any extra efforts from your side.

Dmitry dives into the debugger and teases out the specific circumstances which can help get a query considered for rowstore.  If you want a deep dive into what’s currently available, this is your post.

Comments closed

Deploying An Azure Container Within A Virtual Network

Andrew Pruski shows us that you can now deploy an Azure container running SQL Server within an Azure virtual network:

Up until now Azure Container Instances only had one option to allow us to connect. That was assigning a public IP address that was directly exposed to the internet.

Not really great as exposing SQL Server on port 1433 to the internet is generally a bad idea: –

Now I know there’s a lot of debated about whether or not you should change the port that SQL is listening on to prevent this from happening. My personal opinion is, that if someone wants to get into your SQL instance, changing the port isn’t going to slow them down much. However, a port change will stop opportunistic hacks (such as the above).

But now we have another option. The ability to deploy a ACI within a virtual network in Azure! So let’s run through how to deploy.

Click through for those instructions.

Comments closed

Stateful Services With Kubernetes

Kevin Sookocheff explains some scenarios in which stateful Kubernetes services can work well:

With leader election, you begin with a set of candidates that wish to become the leader and each of these candidates race to see who will be the first to be declared the leader. Once a candidate has been elected the leader, it continually sends a heart beat signal to keep renewing their position as the leader. If that heart beat fails, the other candidates again race to become the new leader. Implementing a leader election algorithm usually requires either deploying software such as ZooKeeper, or etcd and using it to determine consensus, or alternately, implementing a consensus algorithm on your own. Neither of these are ideal: ZooKeeper and etcd are complicated pieces of software that can be difficult to operate, and implementing a consensus algorithm on your own is a road fraught with peril. Thankfully, Kubernetes already runs an etcd cluster that consistently stores Kubernetes cluster state, and we can leverage that cluster to perform leader election simply by leveraging the Kubernetes API server.

Kubernetes already uses the Endpoints resource to represent a replicated set of pods that comprise a service and we can re-use that same object to retrieve all the pods that make up your distributed system. Given this list of pods, we leverage two other properties of the Kubernetes API: ResourceVersions and Annotations. Annotations are arbitrary key/value pairs that can be used by Kubernetes clients, and ResourceVersions mark the unique version of every Kubernetes resource in the cluster. Given these two primitives, we can perform leader election in a fairly straightforward manner: query the Endpoints resource to get the list of all pods running your service, and set Annotations on those resources. Each change to an Annotation also updates the ResourceVersion metadata. Because the Kubernetes API server is backed by etcd, a strongly consistent datastore, you can use Annotations and the ResourceVersion metadata to implement a simple compare-and-swap algorithm.

Google has used this approach to implement leader election as a Kubernetes Service, and you can run that service as a sidecar to your application to perform leader election backed by etc. For more on running a leader election algorithm in Kubernetes, refer to this blog post.

This is one of the parts that container services like Docker are striving to answer, but I don’t think they have it quite nailed down yet.

Comments closed

Multi-Class Classification With vtreat

John Mount has an example of using the vtreat package for multi-class classification in R:

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are now part of vtreat.

Click through for an example of this in action.

Comments closed

Azure Data Factory Or Integration Services?

Teo Lachev contrasts use cases for Integration Services vesus Azure Data Factory V2:

So, ADF was incorrectly positioned as “SSIS for the Cloud” and unfortunately once that message made it out there was a messaging problem that Microsoft has been fighting ever since. Like Azure ML, on the glory road to the cloud things that were difficult with SSIS (installation, projects, deployment) became simple, and things that were simple became difficult. Naturally, Microsoft took a lot of criticism from the customers and community, including from your humble correspondent. ADF, or course, has nothing to do with SSIS, thus leaving many data integration practitioners with a difficult choice: should you take the risk and take the road less traveled with ADF, or continue with the tried-and-true SSIS for data integration on Azure?

To Microsoft’s credits, ADF v2 has made significant enhancements in features, usability, and maintainability. There is an also a “lift and shift” option to run SSIS inside ADF but since this architecture requires a VM, I consider it a narrow case scenario, such as when you need to extend ADF with SSIS features that it doesn’t have. Otherwise, why would you start new development with SSIS hosted under ADF, if you could provision and license the VM yourself and have full control over it?

All in all, Teo is not the biggest fan of ADF at this point and leans heavily toward SSIS; read on for the reasoning.

Comments closed

Forcing A Plan Is A Temporary Solution

Erin Stellato explains when she forces plans—and that this is not a permanent solution to a performance problem:

Whether you force plans manually, or let SQL Server force them with the Automatic Plan Correction feature, I still view plan forcing as a temporary solution.  I don’t expect you to have plans forced for years, let alone months.  The life of a forced plan will, of course, depend on how quickly code and schema changes are ported to production.  If you go the “set it and forget it route”, theoretically a manually forced plan could get used for a very long time.  In that scenario, it’s your responsibility to periodically check to ensure that plan is still the “best” one for the query.  I would be checking every couple weeks; once a month at most.  Whether or not the plan remains optimal depends on the tables involved in the query, the data in the tables, how that data changes (if it changes), other schema changes that may be introduced, and more.

Further, you don’t want to ignore forced plans because there are cases where a forced plan won’t be used (you can use Extended Events to monitor this).  When you force a plan manually, forcing can still fail.  For example, if the forced plan uses an index and the index is dropped, or its definition is changed to the point where it cannot be used in plan in the same manner, then forcing will fail.  Important note: if forcing fails, the query will go through normal optimization and compilation and it will execute; SQL Server does not want your query to fail!  If you’re forcing plans and not familiar with the reasons that it can fail, note the last_force_failure_reason values listed for sys.query_store_plan.  If you have manually forced a plan for a query, and the force plan fails, it remains forced.  You have to manually un-force it to stop SQL Server from trying to use that plan.  As you can see, there are multiple factors related to plan forcing, which is why you don’t just force a plan and forget it.

There is much sound advice in this post.

Comments closed

Azure Data Factory V2 Dependencies

Meagan Longoria has important notes on how Azure Data Factory V2 Dependencies differ from SQL Server Integration Services precedent constraints:

This sounds similar to SSIS precedence constraints, but there are a couple of big differences.

  1. SSIS allows us to define expressions to be evaluated to determine if the next task should be executed.
  2. SSIS allows us to choose whether we handle multiple constraints as a logical AND or a logical OR. In other words, do we need all constraints to be true or just one.

ADF V2 activity dependencies are always a logical AND. While we can design control flows in ADF similar to how we might design control flows in SSIS, this is one of several differences. Let’s look at an example.

Meagan gives us three methods of replicating SSIS functionality using ADF V2, so check it out.

Comments closed

HDP 3.0 Updates To Hive And Druid

Nishant Bangarwa has some updates to Apache Druid in HDP 3.0:

There are numerous improvements that went into HDP 3.0 and the performance improvements shown are an aggregate result of all of them. Here are some of the more noteworthy improvements related to Druid-Hive integration :

    1. Druid Expressions Support – HIVE-18893CALCITE-2170   added support for Druid expressions in Hive. In HDP 3.0, Hive can push the computation of SQL expressions as part of a Druid query and they can be evaluated by Druid.

    2. Use of Scan Query instead of Select Query – In HDP 3.0 we use Druid Scan query instead of Select Query. Scan Query is a streaming version of Select Query which returns the results in a compact streaming format. Scan query also does not need all the results to be retained in memory before they can be returned to Hive. This improves the memory usage of the historical nodes too.

    3. GroupBy Query Improvements – Many optimizations are done in order to address the performance of GroupBy queries on Druid side. Main ones are –

      1. #4660 Parallel sort for ConcurrentGrouper
      2. #4576 Array-based aggregation for GroupBy query
      3. #4668 Add IntGrouper to avoid unnecessary boxing/unboxing in array-based aggregation
      4. #4704 Parallel merging of intermediate groupby results on the broker nodes.
    4. Better column pruning – In some cases when hive cannot push any operator to druid, hive ended up pulling all the columns from druid. This led to lots of unnecessary data transfer between druid and hive. HIVE-15619 improved the column pruner logic to only fetch the columns from druid which are required to answer a query.

    5. Druid Version upgrade from 0.10.0 to 0.12.2 – HDP 3.0 comes with latest version of Druid i.e 0.12.2 which has many new features, performance enhancements and bug-fixes over the previous version.

Druid is still a specialty technology which doesn’t fit every use case, but if it does fit your use, you’ll get a lot of performance benefit out of it.

Comments closed

Analyzing Day-Over-Day Changes With Power Query

Dany Hoter shows how to use Power Query to join one row to the next in a data set (given similar criteria) and do day-by-day comparisons:

I want to analyze the daily prices of certain commodities and be able to show the patterns of daily changes side by side. I want to calculate predictably the differences between each row and the row before. Each row represents data for a day, so the difference between rows is the daily change or in some cases, several days change.

I downloaded from Quandl 50 years of daily prices of gold and silver, and my goal is to calculate the daily changes in terms of dollars and percentage from day to day. Not all days are represented, so in case of a gap I calculate the number of days in the gap, and I divide the growth % by the number of days. I already imported and appended the data for both metals into a single table in Excel and we’ll start the process from this table.

Read on for the solution.  I’d just as soon LAG() the data in SQL Server, but if that’s not an option, this certainly works.

Comments closed