Press "Enter" to skip to content

Day: December 19, 2016

Partition Handling In Spark 2.1

Eric Liang, et al, discuss a change to Spark 2.1 which will make certain partitioned table access faster:

In Spark 2.1, we drastically improve the initial latency of queries that touch a small fraction of table partitions. In some cases, queries that took tens of minutes on a fresh Spark cluster now execute in seconds. Our improvements cut down on table memory overheads, and make the SQL experience starting cold comparable to that on a “hot” cluster with table metadata fully cached in memory.

This looks like a nice improvement in Spark.

Comments closed

Power BI On-Prem In 2017

Paul Turley points out a blog post from the Reporting Services team:

When will we have this next Technical Preview?

We’re targeting January 2017 to release this next Technical Preview.

What’s the release vehicle for a production-ready version?

We plan to release the production-ready version in the next SQL Server release wave. We won’t be releasing it in a Service Pack, Cumulative Update, or other form of update for SSRS 2016.

When will we have a production-ready version?

We’re targeting availability in mid-2017.

That makes it sound like they’re pushing it to coincide with the vNext release.

Comments closed

Multidplyr

Matt Dancho shows how to use multidplyr to perform parallel processing on data cleansing activities:

There’s nothing more frustrating than waiting for long-running R scripts to iteratively run. I’ve recently come across a new-ish package for parallel processing that plays nicely with the tidyverse: multidplyr. The package has saved me countless hours when applied to long-running, iterative scripts. In this post, I’ll discuss the workflow to parallelize your code, and I’ll go through a real world example of collecting stock prices where it improves speed by over 5X for a process that normally takes 2 minutes or so. Once you grasp the workflow, the parallelization can be applied to almost any iterative scripts regardless of application.

This is a longer article, but if you’re using dplyr with R today, it’s worth a read.

Comments closed

Power Query And M In Tabular

Chris Webb notes that Analysis Services Tabular will get Power Query and M support:

I’ve just argued why Microsoft was obliged to include this functionality in SSAS v.next but in fact there are many positive reasons for doing this too. The most obvious one is to do with support for more data sources. At the moment SSAS Tabular supports a pretty good range of data sources, but the world of BI is getting more and more diverse and in order to stay relevant SSAS needs to support far more than it does today. By using Power Query/M as its data access mechanism, SSAS v.next will immediately support a much larger number of data sources and this number is going to keep on growing: any investment that Microsoft or third parties make for Power BI in this area will also benefit SSAS. Also, because Power Query/M can query and fold to more than just relational databases, I suspect that in the future this will allow for DirectQuery connections to many of these non-relational data sources too.

Read the whole thing.

Comments closed

Getting To ZDT

Kendra Little gives thoughts on getting to zero-downtime releases:

I’ve worked in a bunch of environments where we rolled schema changes to production throughout the week, and worked hard to limit downtime. Our biggest challenges were:

  • The code released wouldn’t have the intended effect, and multiple hotfixes would sometimes need to be written quickly if the issue was urgent.

  • Problems would happen with replication in production – either replication would be delayed because a lot of modifications were occurring, or a change in an “upstream” system would cause a huge amount of data processing to happen in a “downstream” database

I’m in a ZDT environment.  There’s a pretty good amount of overhead work that goes into getting there; there are times in which a 20 minute window would have saved hours of development time, so it’s important to keep that trade-off in mind.

Comments closed

PowerQuery Contains Filtering

Reza Rad explains two different methods of filtering which may look similar but can result in quite different outputs:

The script tells the whole story. Despite the fact that you typed in “Dan” and Power Query showed you all FirstNames that has “Dan” in it. the script still use equity filters for every individual FirstName. For this data set there won’t be any issue obviously, because all FirstNames with “Dan” is already selected. However if new data rows coming in to this table in the future, and they have records with FirstNames that are not one of these values, for example Dandy, it won’t be picked! As a result the filter won’t work exactly as you expect. That’s why I say this is misleading.

Read on for notes on how to put the Contains operator to good use.

Comments closed

Thinking About Memory Latency

Joe Chang throws down the gauntlet:

Naturally, the database and application should be architected together with the SQL Server NUMA tuning options to support good scaling on multi-socket NUMA systems. If we neglected this in the original design, I am sure many DBA-developers know how well such a suggestion would be received by management.

Is there another option? Well yes. Get rid of the NUMA latency issue with a non-NUMA system. Such a system has a single processor socket, hence one memory node. Before anyone scoffs, the one socket is not just a Xeon E3 with four cores.

Still, a single quad-core processor today is 40 times more powerful than a 4-way system from twenty years ago (100,000 tpm-C per core is probably possible if TPC-C were still in use, versus 10,000 on a 4-way in 1996). The Xeon E3 could probably support many medium sized organizations. Maximum memory capacity is 64GB (4x16GB unbuffered ECC DIMMs, $130 each). My recollection is that many IO problems went away at the 32-64GB level. And we could still have powerful IO with 2 PCI-E x8 SSDs, or even 2 x4’s.

This is some very interesting research.  Joe has a some gaps in his research (meaning that there’s scope for people to expand upon this), but this is 100% worth the read.

Comments closed

Missing MDS Temp Directory

Koen Verbeeck ran into an error when reinstalling Master Data Services on his Windows 10 box:

The error this time: “The ‘tempDirectory’ attribute must be set to a valid absolute path”. If you can’t see the error, it’s possible you have to enable them in the web.config file of MDS. Typically you can find this configuration file in the folder “C:\Program Files\Microsoft SQL Server\130\Master Data Services\WebApplication”. The customErrors attribute should be changed to the following:

Read on for the solution.

Comments closed

Parameterizing Always Encrypted Statements

Jakub Szymaszek shows off Parameterizing for Always Encrypted in SSMS 17.0:

First thing to note is that SSMS has rewritten the query as a parameterized statement. The literal, used to initialize the @SSN variable in the original query, is being passed inside a parameter, with an auto-generated name (@pdf9f37d6e63c46879555e4ba44741aa6). This allows the .NET Framework Data Provider for SQL Server to automatically detect that the parameter needs to be encrypted. The driver achieves that by calling sp_describe_parameter_encryption that prompts SQL Server to analyze the query statement and determine which parameters should be encrypted and how. Then, the driver, transparently encrypts the parameter value, before submitting the query to SQL Server for execution via sp_executesql. SQL Server can now successfully execute the query.

Read the whole thing.  Setting this up does obviate part of a benefit to using Always Encrypted:  the ability completely to lock out a database administrator from certain pieces of data.

Comments closed

SQL Server vNext CTP 1.1

Denis Gobo notes that there’s a new CTP for SQL Server:

TRANSLATE
This acts like a bunch of replace statements, instead of REPLACE(REPLACE(REPLACE(REPLACE(SomeVal,'[‘,'(‘),’]’,,’)’),'{‘,'(‘),’}’,,’)’) you can do the following which is much cleaner
SELECT TRANSLATE('2*[3+4]/{7-2}', '[]{}', '()()');

Running that will return 2*(3+4)/(7-2)

The translate function looks very interesting.  Click through for a few more goodies and get ready for the never-ending release cycle.

Comments closed