Press "Enter" to skip to content

Curated SQL Posts

Power Query And M In Tabular

Chris Webb notes that Analysis Services Tabular will get Power Query and M support:

I’ve just argued why Microsoft was obliged to include this functionality in SSAS v.next but in fact there are many positive reasons for doing this too. The most obvious one is to do with support for more data sources. At the moment SSAS Tabular supports a pretty good range of data sources, but the world of BI is getting more and more diverse and in order to stay relevant SSAS needs to support far more than it does today. By using Power Query/M as its data access mechanism, SSAS v.next will immediately support a much larger number of data sources and this number is going to keep on growing: any investment that Microsoft or third parties make for Power BI in this area will also benefit SSAS. Also, because Power Query/M can query and fold to more than just relational databases, I suspect that in the future this will allow for DirectQuery connections to many of these non-relational data sources too.

Read the whole thing.

Comments closed

Getting To ZDT

Kendra Little gives thoughts on getting to zero-downtime releases:

I’ve worked in a bunch of environments where we rolled schema changes to production throughout the week, and worked hard to limit downtime. Our biggest challenges were:

  • The code released wouldn’t have the intended effect, and multiple hotfixes would sometimes need to be written quickly if the issue was urgent.

  • Problems would happen with replication in production – either replication would be delayed because a lot of modifications were occurring, or a change in an “upstream” system would cause a huge amount of data processing to happen in a “downstream” database

I’m in a ZDT environment.  There’s a pretty good amount of overhead work that goes into getting there; there are times in which a 20 minute window would have saved hours of development time, so it’s important to keep that trade-off in mind.

Comments closed

PowerQuery Contains Filtering

Reza Rad explains two different methods of filtering which may look similar but can result in quite different outputs:

The script tells the whole story. Despite the fact that you typed in “Dan” and Power Query showed you all FirstNames that has “Dan” in it. the script still use equity filters for every individual FirstName. For this data set there won’t be any issue obviously, because all FirstNames with “Dan” is already selected. However if new data rows coming in to this table in the future, and they have records with FirstNames that are not one of these values, for example Dandy, it won’t be picked! As a result the filter won’t work exactly as you expect. That’s why I say this is misleading.

Read on for notes on how to put the Contains operator to good use.

Comments closed

Thinking About Memory Latency

Joe Chang throws down the gauntlet:

Naturally, the database and application should be architected together with the SQL Server NUMA tuning options to support good scaling on multi-socket NUMA systems. If we neglected this in the original design, I am sure many DBA-developers know how well such a suggestion would be received by management.

Is there another option? Well yes. Get rid of the NUMA latency issue with a non-NUMA system. Such a system has a single processor socket, hence one memory node. Before anyone scoffs, the one socket is not just a Xeon E3 with four cores.

Still, a single quad-core processor today is 40 times more powerful than a 4-way system from twenty years ago (100,000 tpm-C per core is probably possible if TPC-C were still in use, versus 10,000 on a 4-way in 1996). The Xeon E3 could probably support many medium sized organizations. Maximum memory capacity is 64GB (4x16GB unbuffered ECC DIMMs, $130 each). My recollection is that many IO problems went away at the 32-64GB level. And we could still have powerful IO with 2 PCI-E x8 SSDs, or even 2 x4’s.

This is some very interesting research.  Joe has a some gaps in his research (meaning that there’s scope for people to expand upon this), but this is 100% worth the read.

Comments closed

Missing MDS Temp Directory

Koen Verbeeck ran into an error when reinstalling Master Data Services on his Windows 10 box:

The error this time: “The ‘tempDirectory’ attribute must be set to a valid absolute path”. If you can’t see the error, it’s possible you have to enable them in the web.config file of MDS. Typically you can find this configuration file in the folder “C:\Program Files\Microsoft SQL Server\130\Master Data Services\WebApplication”. The customErrors attribute should be changed to the following:

Read on for the solution.

Comments closed

Parameterizing Always Encrypted Statements

Jakub Szymaszek shows off Parameterizing for Always Encrypted in SSMS 17.0:

First thing to note is that SSMS has rewritten the query as a parameterized statement. The literal, used to initialize the @SSN variable in the original query, is being passed inside a parameter, with an auto-generated name (@pdf9f37d6e63c46879555e4ba44741aa6). This allows the .NET Framework Data Provider for SQL Server to automatically detect that the parameter needs to be encrypted. The driver achieves that by calling sp_describe_parameter_encryption that prompts SQL Server to analyze the query statement and determine which parameters should be encrypted and how. Then, the driver, transparently encrypts the parameter value, before submitting the query to SQL Server for execution via sp_executesql. SQL Server can now successfully execute the query.

Read the whole thing.  Setting this up does obviate part of a benefit to using Always Encrypted:  the ability completely to lock out a database administrator from certain pieces of data.

Comments closed

SQL Server vNext CTP 1.1

Denis Gobo notes that there’s a new CTP for SQL Server:

TRANSLATE
This acts like a bunch of replace statements, instead of REPLACE(REPLACE(REPLACE(REPLACE(SomeVal,'[‘,'(‘),’]’,,’)’),'{‘,'(‘),’}’,,’)’) you can do the following which is much cleaner
SELECT TRANSLATE('2*[3+4]/{7-2}', '[]{}', '()()');

Running that will return 2*(3+4)/(7-2)

The translate function looks very interesting.  Click through for a few more goodies and get ready for the never-ending release cycle.

Comments closed

Hadoop And Active Directory

RK Kuppala explains how to integrate a Hadoop cluster with Active Directory:

This post explains kerberizing an existing Hadoop cluster using Ambari. Kerberos helps with the Authentication part of enterprise security (while authorization, auditing and data protection being the remaining parts).

HDP uses Kerberos, which is an industry standard for authenticate users and resources and providing strong identity for users. Apache Ambari can kerberize an existing cluster by using an existing MIT key distribution center (KDC) or Microsoft’s Active Directory.

This was a lot easier than I expected.

Comments closed

Understanding Naive Bayes

Ahmet Taspinar explains the Naive Bayes classificiation algorithm and writes Python code to implement it:

Within Machine Learning many tasks are – or can be reformulated as – classification tasks.

In classification tasks we are trying to produce a model which can give the correlation between the input data $X$ and the class $C$ each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes ApplesPears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class.

Ahmet has his entire post saved as a Jupyter notebook.

Comments closed

Taxi Rides And Amazon Athena

Mark Litwintschik looks at using Amazon Athena to process the New York City taxi rides data set:

It’s important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena’s purpose is to ask questions rather than insert records quickly or update random records with low latency.

That being said, Presto’s performance, given it can work on some of the world’s largest datasets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.

Athena is Amazon’s response to Azure Data Lake Analytics.  Check out Mark’s blog post for a good way of getting started with Athena.

Comments closed