Month: October 2016

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R packagedplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

Read on for code and conclusions. I don’t think there are any shocking conclusions: the upshot is to filter data as early as possible.

Comments closed

Always Encrypted Powershell Cmdlets

Published 2016-10-10 by Kevin Feasel

Sanjay Mishra alerts us to new Powershell cmdlets for enabling Always Encrypted on columns:

The July 2016 release of SSMS (and later versions) introduced a set of PowerShell cmdlets through a new ‘SqlServer’ module. This pagedescribes the various capabilities that these cmdlets bring to the table. Of most interest to the specific scenario described above is the Set-SqlColumnEncryption cmdlet. In the post below, we will walk through the steps required to use this – first from a PowerShell session to test the capability, and then finally from a C# application which is using PowerShell Automation to invoke the cmdlets from an application.

As a side note it is worth knowing that the cmdlets in the ‘SqlServer’ PowerShell module can also be used for automating key setup and management (and are, in many ways, more powerful than SSMS – they expose more granular tasks, and thus can be used to achieve role separation and to develop a custom key management workflow – but that is likely a topic for a separate post!)

Sanjay also includes a sample Powershell script to show how it works.

Comments closed

Starting Azure Stream Analytics Jobs From Code

Published 2016-10-10 by Kevin Feasel

Hylke Peek wants to kick off an Azure Stream Analytics job from a Universal Windows Platform application:

I had one of those feelings while working with Azure Stream Analytics (ASA). My solution worked but there was one ‘elementary and simple’ thing I wanted: Start the ASA-jobs within my C#-code. That shouldn’t be hard and there’s some documentation. But no, I needed to combine several opposed solutions to a new one to make it possible.

In this post I shortly explain how you can start ASA-jobs within your C# UWP application:

I explain which components you have in the authentication process and which parameters you need.
Example code is provided. You only need to enter your parameter values.

Click through for the code.

Comments closed

Sequentially Increasing Indexes

Published 2016-10-10 by Kevin Feasel

Joe Chang discusses benchmarking and looks at a particular scenario around maximizing insert performance:

The test environment here is a single socket Xeon E3 v3, quad-core, hyper-threading enabled. Turbo-boost is disabled for consistency. The software stack is Windows Server 2016 TP5, and SQL Server 2016 cu2 (build 2164). Some tests were conducted on a single socket Xeon E5 v4 with 10 cores, but most are on the E3 system. In the past, I used to maintain two-socket systems for investigating issues, but only up to the Core2 processor, which were not NUMA.

The test table has 8 fixed length not null columns, 4 bigint, 2 guids, 1 int, and a 3-byte date. This adds up to 70 bytes. With file and row pointer overhead, this works out to 100 rows per page at 100% fill-factor.

Both heap and clustered index organized tables were tested. The indexes tested were 1) single column key sequentially increasing and 2) two column key leading with a grouping value followed by a sequentially increasing value. The grouping value was chosen so that inserts go to many different pages.

The test was for a client to insert a single row per call. Note that the recommended practice is to consolidate multiple SQL statements into a single RPC, aka network roundtrip, and if appropriate, bracket multiple Insert, Update and Delete statements with a BEGIN and COMMIT TRAN. This test was contrived to determine the worst case insert scenario.

With that setup in mind, click through to learn his results.

Comments closed

ISNULL And COALESCE Behavior Difference

Published 2016-10-10 by Kevin Feasel

Vladimir Oselsky notes an edge case where ISNULL and COALESCE can behave differently:

Even though we would expect to see both records returned we only get 1 record. Huh? This is exactly what puzzled a coworker, ofcourse query was not as simple as this one but same issue caused him to hit a road block.

In the case of COALESCE and OR methods, results are identical.

The underlying issue here is that the variable data type differs from the column’s data type, and exposes a difference in how COALESCE and ISNULL work.

Comments closed

Kafka Consumer Groups

Published 2016-10-07 by Kevin Feasel

David Brinegar discusses consumer groups and lag in Apache Kafka:

While the Consumer Group uses the broker APIs, it is more of an application pattern or a set of behaviors embedded into your application. The Kafka brokers are an important part of the puzzle but do not provide the Consumer Group behavior directly. A Consumer Group based application may run on several nodes, and when they start up they coordinate with each other in order to split up the work. This is slightly imperfect because the work, in this case, is a set of partitions defined by the Producer. Each Consumer node can read a partition and one can split up the partitions to match the number of consumer nodes as needed. If the number of Consumer Group nodes is more than the number of partitions, the excess nodes remain idle. This might be desirable to handle failover. If there are more partitions than Consumer Group nodes, then some nodes will be reading more than one partition.

Read the whole thing. It’s part one of a series.

Comments closed

Database-Level MAXDOP

Published 2016-10-07 by Kevin Feasel

Daniel Janik shows off setting Maximum Degree of Parallelism settings at the database level in SQL Server 2016:

The Maximum Degree of Parallelism (MAXDOP) can be defined in one of three ways:

Instance Scoped via sp_configure

Database Scoped via database properties

Query Scoped via query option

Which of these trumps the other?

Click through for the answer.

Comments closed

Power BI Row-Level Security With External Users

Published 2016-10-07 by Kevin Feasel

Patrick LeBlanc shows how to implement row-level security within Power BI for people without direct access to an underlying Analysis Services cube:

Before I explain how to fix this, let’s take a look at what’s happening behind the scenes.

When jdoe@adventureworks.com opens the dashboard a connection string is created including the effectiveusername property, which is expected behavior.
The value specified for this property is jdoe@adventureworks.com.
The connections string including the queries are sent via the On-Premises gateway to the SSAS server that hosts the data needed to view the report.
Once the connection is established, using the username and password specified in the Data Source settings, all queries are executed usingjdoe@adventureworks.com.

Read on for the solution.

Comments closed

Hive Going In-Memory

Published 2016-10-07 by Kevin Feasel

Carter Shanklin and Nita Dembla discuss Hive memory-handling optimizations:

Let’s put this architecture to the test with a realistic dataset size and workload. Our previous performance blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More”, discussed 4 reasons that LLAP delivers dramatically faster performance versus Hive on Tez. In that benchmark we saw 25+x performance boosts on ad-hoc queries with a dataset that fit entirely into the cluster’s memory.

In most cases, datasets will be far too large to fit in RAM so we need to understand if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets. To find out, we scaled the dataset up to 10 TB, 4x larger than aggregate cluster RAM, and we ran a number of far more complex queries.

Table 3 below shows how Hive LLAP is capable of running both At Speed and At Scale. The simplest query in the benchmark ran in 2.68 seconds on this 10 TB dataset while the most complex query, Query 64 performed a total of 37 joins and ran for more than 20 minutes.

Given how much faster memory is than disk, and given Spark’s broad adoption, this makes sense as a strategy for Hive’s continued value.

Comments closed

Continuous Delivery With SSAS

Published 2016-10-07 by Kevin Feasel

Jens Vestergaard shows how to implement continuous deliver with Analysis Services cubes:

None of the above mentioned scenarios appeals to Team Foundation Server(TFS) and in order to get into the no-sweat zone during release time, we need to build our deployments around TFS; The obvious choice when working with Microsoft.

Natively Visual Studio, or more precisely MSBuild, does not support dwprojfiles which are used for Analysis Services (SSAS) projects. So obviously this has to involve some kind of magic. But as it turns out, it’s not all that magic. However there is not much documentation on this particular scenario out there but I managed to find one good resource, which is this. It gave me just enough assistance to complete the task.

This is a long post, but well worth reading.

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31