Month: October 2018

Shell improvements

SSMS is based on the new VS 2017 Isolated Shell. This means a modern shell that unlocks all the accessibility features from both SSMS and VS 2017.
Smaller download size (~400 MB). This is less than half of what SSMS 17.x is.
SSMS can be installed in a custom folder. Currently, this is only available on the command line setup. Pass the extra argument to SSMS-Setup-ENU.exe, SSMSInstallRoot = C:\MyFolder
High DPI enabled by default.
Better support for multiple monitors to ensure dialogs and windows pop up on the expected monitor.
Isolation from SQL Engine. SSMS does not share components with SQL engine anymore. More isolation from SQL engine allows for more frequent updates.
Package Ids no longer needed to develop SSMS Extensions.

18.0 will install alongside 17, so you can have both at the same time.

Comments closed

Batch Mode Processing On Rowstore Tables

Published 2018-10-05 by Kevin Feasel

Dmitry Pilugin shares some thoughts on the expansion of batch mode processing to rowstore tables:

The main advantages of Batch Mode are:

Algorithms optimized for the multi-core modern CPUs;

Better CPU cache utilization and increased memory throughput;

Reduced number of CPU instructions per processed row.

All these features make Batch Mode much faster than Row Mode (typically an order of magnitude, 10x-100x times faster) for analytical queries with CS indexes. One major condition for Batch Mode is a presence of a CS index. If you don’t have a CS index on a table involved in a query, you won’t get Batch Mode.

However, some analytical queries may benefit from Batch Mode without a CS index, or CS cannot be created due to some limitations.

There are a few tricks that allow you to enable Batch Mode on a Rowstore table for example with a dummy filtered CS index (see this post from Itzik Ben-Gan), but SQL Server 2019 may use Batch Mode on Rowstore without any extra efforts from your side.

Dmitry dives into the debugger and teases out the specific circumstances which can help get a query considered for rowstore. If you want a deep dive into what’s currently available, this is your post.

Comments closed

Deploying An Azure Container Within A Virtual Network

Published 2018-10-05 by Kevin Feasel

Andrew Pruski shows us that you can now deploy an Azure container running SQL Server within an Azure virtual network:

Up until now Azure Container Instances only had one option to allow us to connect. That was assigning a public IP address that was directly exposed to the internet.

Not really great as exposing SQL Server on port 1433 to the internet is generally a bad idea: –

Now I know there’s a lot of debated about whether or not you should change the port that SQL is listening on to prevent this from happening. My personal opinion is, that if someone wants to get into your SQL instance, changing the port isn’t going to slow them down much. However, a port change will stop opportunistic hacks (such as the above).

But now we have another option. The ability to deploy a ACI within a virtual network in Azure! So let’s run through how to deploy.

Click through for those instructions.

Comments closed

Stateful Services With Kubernetes

Published 2018-10-04 by Kevin Feasel

Kevin Sookocheff explains some scenarios in which stateful Kubernetes services can work well:

With leader election, you begin with a set of candidates that wish to become the leader and each of these candidates race to see who will be the first to be declared the leader. Once a candidate has been elected the leader, it continually sends a heart beat signal to keep renewing their position as the leader. If that heart beat fails, the other candidates again race to become the new leader. Implementing a leader election algorithm usually requires either deploying software such as ZooKeeper, or etcd and using it to determine consensus, or alternately, implementing a consensus algorithm on your own. Neither of these are ideal: ZooKeeper and etcd are complicated pieces of software that can be difficult to operate, and implementing a consensus algorithm on your own is a road fraught with peril. Thankfully, Kubernetes already runs an etcd cluster that consistently stores Kubernetes cluster state, and we can leverage that cluster to perform leader election simply by leveraging the Kubernetes API server.

Kubernetes already uses the Endpoints resource to represent a replicated set of pods that comprise a service and we can re-use that same object to retrieve all the pods that make up your distributed system. Given this list of pods, we leverage two other properties of the Kubernetes API: ResourceVersions and Annotations. Annotations are arbitrary key/value pairs that can be used by Kubernetes clients, and ResourceVersions mark the unique version of every Kubernetes resource in the cluster. Given these two primitives, we can perform leader election in a fairly straightforward manner: query the Endpoints resource to get the list of all pods running your service, and set Annotations on those resources. Each change to an Annotation also updates the ResourceVersion metadata. Because the Kubernetes API server is backed by etcd, a strongly consistent datastore, you can use Annotations and the ResourceVersion metadata to implement a simple compare-and-swap algorithm.

Google has used this approach to implement leader election as a Kubernetes Service, and you can run that service as a sidecar to your application to perform leader election backed by etc. For more on running a leader election algorithm in Kubernetes, refer to this blog post.

This is one of the parts that container services like Docker are striving to answer, but I don’t think they have it quite nailed down yet.

Comments closed

Multi-Class Classification With vtreat

Published 2018-10-04 by Kevin Feasel

John Mount has an example of using the vtreat package for multi-class classification in R:

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are now part of vtreat.

Click through for an example of this in action.

Comments closed

Azure Data Factory Or Integration Services?

Published 2018-10-04 by Kevin Feasel

Teo Lachev contrasts use cases for Integration Services vesus Azure Data Factory V2:

So, ADF was incorrectly positioned as “SSIS for the Cloud” and unfortunately once that message made it out there was a messaging problem that Microsoft has been fighting ever since. Like Azure ML, on the glory road to the cloud things that were difficult with SSIS (installation, projects, deployment) became simple, and things that were simple became difficult. Naturally, Microsoft took a lot of criticism from the customers and community, including from your humble correspondent. ADF, or course, has nothing to do with SSIS, thus leaving many data integration practitioners with a difficult choice: should you take the risk and take the road less traveled with ADF, or continue with the tried-and-true SSIS for data integration on Azure?

To Microsoft’s credits, ADF v2 has made significant enhancements in features, usability, and maintainability. There is an also a “lift and shift” option to run SSIS inside ADF but since this architecture requires a VM, I consider it a narrow case scenario, such as when you need to extend ADF with SSIS features that it doesn’t have. Otherwise, why would you start new development with SSIS hosted under ADF, if you could provision and license the VM yourself and have full control over it?

All in all, Teo is not the biggest fan of ADF at this point and leans heavily toward SSIS; read on for the reasoning.

Comments closed

Forcing A Plan Is A Temporary Solution

Published 2018-10-04 by Kevin Feasel

Erin Stellato explains when she forces plans—and that this is not a permanent solution to a performance problem:

Whether you force plans manually, or let SQL Server force them with the Automatic Plan Correction feature, I still view plan forcing as a temporary solution. I don’t expect you to have plans forced for years, let alone months. The life of a forced plan will, of course, depend on how quickly code and schema changes are ported to production. If you go the “set it and forget it route”, theoretically a manually forced plan could get used for a very long time. In that scenario, it’s your responsibility to periodically check to ensure that plan is still the “best” one for the query. I would be checking every couple weeks; once a month at most. Whether or not the plan remains optimal depends on the tables involved in the query, the data in the tables, how that data changes (if it changes), other schema changes that may be introduced, and more.

Further, you don’t want to ignore forced plans because there are cases where a forced plan won’t be used (you can use Extended Events to monitor this). When you force a plan manually, forcing can still fail. For example, if the forced plan uses an index and the index is dropped, or its definition is changed to the point where it cannot be used in plan in the same manner, then forcing will fail. Important note: if forcing fails, the query will go through normal optimization and compilation and it will execute; SQL Server does not want your query to fail! If you’re forcing plans and not familiar with the reasons that it can fail, note the last_force_failure_reason values listed for sys.query_store_plan. If you have manually forced a plan for a query, and the force plan fails, it remains forced. You have to manually un-force it to stop SQL Server from trying to use that plan. As you can see, there are multiple factors related to plan forcing, which is why you don’t just force a plan and forget it.

There is much sound advice in this post.

Comments closed

Azure Data Factory V2 Dependencies

Published 2018-10-04 by Kevin Feasel

Meagan Longoria has important notes on how Azure Data Factory V2 Dependencies differ from SQL Server Integration Services precedent constraints:

This sounds similar to SSIS precedence constraints, but there are a couple of big differences.

SSIS allows us to define expressions to be evaluated to determine if the next task should be executed.

SSIS allows us to choose whether we handle multiple constraints as a logical AND or a logical OR. In other words, do we need all constraints to be true or just one.

ADF V2 activity dependencies are always a logical AND. While we can design control flows in ADF similar to how we might design control flows in SSIS, this is one of several differences. Let’s look at an example.

Meagan gives us three methods of replicating SSIS functionality using ADF V2, so check it out.

Comments closed

dbatools Updates

Published 2018-10-04 by Kevin Feasel

Chrissy LeMaire notes a slew of dbatools updates:

And don’t forget, you can use Invoke-DbatoolsRenameHelper to rename commands in your scripts.

1	Get-ChildItem .\scripts \| Invoke-DbatoolsRenameHelper

There is a slew of new functionality, but this little two-line mention is probably the best of the bunch. Of course somebody was going to go and automate the renaming of cmdlet as a cmdlet.

Comments closed

HDP 3.0 Updates To Hive And Druid

Published 2018-10-03 by Kevin Feasel

Nishant Bangarwa has some updates to Apache Druid in HDP 3.0:

There are numerous improvements that went into HDP 3.0 and the performance improvements shown are an aggregate result of all of them. Here are some of the more noteworthy improvements related to Druid-Hive integration :

1. Druid Expressions Support – HIVE-18893/ CALCITE-2170 added support for Druid expressions in Hive. In HDP 3.0, Hive can push the computation of SQL expressions as part of a Druid query and they can be evaluated by Druid.
2. Use of Scan Query instead of Select Query – In HDP 3.0 we use Druid Scan query instead of Select Query. Scan Query is a streaming version of Select Query which returns the results in a compact streaming format. Scan query also does not need all the results to be retained in memory before they can be returned to Hive. This improves the memory usage of the historical nodes too.
3. GroupBy Query Improvements – Many optimizations are done in order to address the performance of GroupBy queries on Druid side. Main ones are –
  1. #4660 Parallel sort for ConcurrentGrouper
  2. #4576 Array-based aggregation for GroupBy query
  3. #4668 Add IntGrouper to avoid unnecessary boxing/unboxing in array-based aggregation
  4. #4704 Parallel merging of intermediate groupby results on the broker nodes.
4. Better column pruning – In some cases when hive cannot push any operator to druid, hive ended up pulling all the columns from druid. This led to lots of unnecessary data transfer between druid and hive. HIVE-15619 improved the column pruner logic to only fetch the columns from druid which are required to answer a query.
5. Druid Version upgrade from 0.10.0 to 0.12.2 – HDP 3.0 comes with latest version of Druid i.e 0.12.2 which has many new features, performance enhancements and bug-fixes over the previous version.

Druid is still a specialty technology which doesn’t fit every use case, but if it does fit your use, you’ll get a lot of performance benefit out of it.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31