Kafka Consumer Groups

David Brinegar discusses consumer groups and lag in Apache Kafka:

While the Consumer Group uses the broker APIs, it is more of an application pattern or a set of behaviors embedded into your application.  The Kafka brokers are an important part of the puzzle but do not provide the Consumer Group behavior directly.  A Consumer Group based application may run on several nodes, and when they start up they coordinate with each other in order to split up the work.  This is slightly imperfect because the work, in this case, is a set of partitions defined by the Producer.  Each Consumer node can read a partition and one can split up the partitions to match the number of consumer nodes as needed.  If the number of Consumer Group nodes is more than the number of partitions, the excess nodes remain idle. This might be desirable to handle failover.  If there are more partitions than Consumer Group nodes, then some nodes will be reading more than one partition.

Read the whole thing.  It’s part one of a series.

Database-Level MAXDOP

Daniel Janik shows off setting Maximum Degree of Parallelism settings at the database level in SQL Server 2016:

The Maximum Degree of Parallelism (MAXDOP) can be defined in one of three ways:

  1. Instance Scoped via sp_configure
  2. Database Scoped via database properties
  3. Query Scoped via query option

Which of these trumps the other?

Click through for the answer.

Power BI Row-Level Security With External Users

Patrick LeBlanc shows how to implement row-level security within Power BI for people without direct access to an underlying Analysis Services cube:

Before I explain how to fix this, let’s take a look at what’s happening behind the scenes.

  1. When jdoe@adventureworks.com opens the dashboard a connection string is created including the effectiveusername property, which is expected behavior.

  2. The value specified for this property is jdoe@adventureworks.com.

  3. The connections string including the queries are sent via the On-Premises gateway to the SSAS server that hosts the data needed to view the report.

  4. Once the connection is established, using the username and password specified in the Data Source settings, all queries are executed usingjdoe@adventureworks.com.

Read on for the solution.

Hive Going In-Memory

Kevin Feasel

2016-10-07

Hadoop

Carter Shanklin and Nita Dembla discuss Hive memory-handling optimizations:

Let’s put this architecture to the test with a realistic dataset size and workload. Our previous performance blog, “Announcing Apache Hive 2.1: 25x Faster Queries and Much More”, discussed 4 reasons that LLAP delivers dramatically faster performance versus Hive on Tez. In that benchmark we saw 25+x performance boosts on ad-hoc queries with a dataset that fit entirely into the cluster’s memory.

In most cases, datasets will be far too large to fit in RAM so we need to understand if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets. To find out, we scaled the dataset up to 10 TB, 4x larger than aggregate cluster RAM, and we ran a number of far more complex queries.

Table 3 below shows how Hive LLAP is capable of running both At Speed and At Scale. The simplest query in the benchmark ran in 2.68 seconds on this 10 TB dataset while the most complex query, Query 64 performed a total of 37 joins and ran for more than 20 minutes.

Given how much faster memory is than disk, and given Spark’s broad adoption, this makes sense as a strategy for Hive’s continued value.

Continuous Delivery With SSAS

Jens Vestergaard shows how to implement continuous deliver with Analysis Services cubes:

None of the above mentioned scenarios appeals to Team Foundation Server(TFS) and in order to get into the no-sweat zone during release time, we need to build our deployments around TFS; The obvious choice when working with Microsoft.

Natively Visual Studio, or more precisely MSBuild, does not support dwprojfiles which are used for Analysis Services (SSAS) projects. So obviously this has to involve some kind of magic. But as it turns out, it’s not all that magic. However there is not much documentation on this particular scenario out there but I managed to find one good resource, which is this. It gave me just enough assistance to complete the task.

This is a long post, but well worth reading.

Aborting Index Rebuilds

Arun Sirpal shows how to use the ABORT_AFTER_WAIT attribute on an index rebuild command:

Looking into the locking you will see that ONLINE operation uses (Sch-M) on the corresponding table as part of the process (actually takes Shared Table Lock (S) at the beginning of the operation, and a Schema Modification Lock (Sch-M at the end)).

So to be granted a SCH-M lock you can’t have any conflicting locks, so what happens when / if you have a process that is updating the table and you want to use the ONLINE rebuild? Yes you will be blocked. With 2014 onwards we can control what happens if we get into this situation and for this post I am going to abort the other query causing me to wait.

Not sure I like the “Kick the other guy(s) off” part that much, but I can see uses.  It’s probably more likely to go the opposite route, cancelling the rebuild if the server’s too hot.

DBCC SHOW_STATISTICS Update

Erik Darling notes that his Connect item to replace DBCC SHOW_STATISTICS has been marked as resolved:

So what does it look like?

I have no idea. I don’t know if it’s a DMV or a function, I don’t know what it’s called, and I don’t know what information it exposes. I also don’t know how it will get joined to other DMVs. There were no details offered up when the status changed. And I’m fine with that! I’m pretty psyched that it got enough traction to get a fix to begin with. If anyone from MS feels like shooting me an email with details, I won’t complain.

But since we don’t know, we’re free to speculate. Like all those History Channel shows about aliens and fake animals and where the Templars secretly buried Jesus’ gold teeth in Arizona. It’ll be fun!

It’ll be interesting to see the results.

JSON Parsing In U-SQL

Kevin Feasel

2016-10-07

JSON, U-SQL

Ginger Grant pulls out everybody’s favorite .NET JSON parser:

In USQL there are built-in extractors for parsing text, comma delimited or tab delimined files. Once again, parsing JSON becomes problematic. There is a solution built into USQL, write some C# code to extend it or use someone else’s C# code to extend USQL. Since I wanted to parse JSON, fortunately there are libraries available on github containing the information required to do it. Download the github package and open up the Microsoft.Analytics.Samples project in Visual Studio. When I did this the first time, there was a problem loading the Newtonsoft.Json reference, so I right clicked on the references and downloaded the missing parts again. Build the solution and check out the code in the directory …Examples\DataFormats\Microsoft.Analytics.Samples.Formats\bin\Debug\ . There will be two DLLs, Microsoft.Analytics.Samples.Formats.dll and Newtonsoft.Json.dll. These dlls then need to be registered in Data Lake Analytics and locally if you chose to run your USQL locally. As at some point the goal is to run from within Data Lake analytics, you will need to copy both of these dlls to the data lake. I created a folder for the dlls called Assemblies, and ran this command

It’s funny how often that library comes up…  Click through to see how to use it with U-SQL jobs.

Using Statistics For Index Design

Kendra Little argues that you should not use automatically created statistics as a guide for index creation:

We’ve talked a lot so far about how much statistics and indexes are related. This is why it seems like statistics might be useful for designing indexes!

But here’s the thing — SQL Server doesn’t track and report on how many times a statistic was used during optimization.

This is an interesting discussion.

Hurricane Matthew Tracking With Power BI

Chris Albrektson has a Power BI report tracking Hurricane Matthew:

We’ve got company and it’s not the type of company that you want! As most Floridians are preparing for Hurricane Matthew I thought it might be neat to track the storm using PowerBi. So I went out and found some public data online and brought that into PowerBi, created a couple calculations and some visualizations.

My goal for this was to create a report where I could track the storm no matter where I was. I also needed the ability to see the latest data without any manual intervention. PowerBi can handle all of this for me utilizing the PowerBi Mobile app and a few other cool features.

Good use of Power BI here.

Categories

October 2016
MTWTFSS
« Sep Nov »
 12
3456789
10111213141516
17181920212223
24252627282930
31