Press "Enter" to skip to content

Month: October 2016

Association Rules

Tomaz Kastrun discusses product variants:

To sum up, association rules is a great and powerful algorithm for finding the correlations between items and the fact that you can use this straight from SSMS, it just gives me goosebumps. Currently just the performance is a bit of a drawback. Also comparing this algorithm to Analysis services (SSAS) association rules, there are many advantages on R side, because of maneuverability and extracting the data to T-SQL, but keep in mind, SSAS is still very awesome and powerful tool for statistical analysis and data predictions.

Figuring out variations after the fact is an all-too-common task, and this is a good way of getting some ideas on how to do that.

Comments closed

Kafka Enrichment

I have an article on enriching data stored in a Kafka topic:

We’re going a bunch of setup work here, so let’s take it from the top.  First, I declare a consumer group, which I’m calling “Airplane Enricher.”  Kafka uses the concept of consumer groups to allow consumers to work in parallel.  Imagine that we have ten separate servers available to process messages from the Flights topic.  Each flight message is independent, so it doesn’t matter which consumer gets it.  What does matter, though, is that multiple consumers don’t get the same message, as that’s a waste of resources and could lead to duplicate data processing, which would be bad.

The way Kafka works around this is to use consumer groups:  within a consumer group, only one consumer will get a particular message.  That way, I can have my ten servers processing messages “for real” and maybe have another consumer in a different consumer group just reading through the messages getting a count of how many records are in the topic.  Once you treat topics as logs rather than queues, consumer design changes significantly.

This is a fairly lengthy read, but directly business-applicable, so I think it’s well worth it.

Comments closed

Handling Large Data Modifications

Jeff Mlakar shows how to insert, update, and delete large numbers of records with T-SQL:

Using T-SQL to insert, update, or delete large amounts of data from a table will results in some unexpected difficulties if you’ve never taken it to task.

Let’s say you have a table in which you want to delete millions of records. If the goal was to remove all then we could simply use TRUNCATE. However, if we want to remove records which meet some certain criteria then executing something like this will cause more trouble that it is worth.

I do like the delete process.  The update process is going to run tableRows/batchSize full scans, so I’m not as fond of that one.  Do read the whole thing.

Comments closed

T-SQL Tuesday #83 Roundup

Andy Mallon handles T-SQL Tuesday duties this month:

For this month’s T-SQL Tuesday, I asked people to blog about the same old problems we’ve been dealing with for years. There were some great posts, including a batch of first-time contributors. You’ll notice some overlapping themes as you read through these responses–I think those themes represent some of the biggest, most important problems we have (like being able to restore backups). Thanks for everyone who contributed this month!

This was an interesting one with quite a few contributors.

Comments closed

Control Flow Package Parts

Todd McDermid explains a feature new to Integration Services 2016:

The basic idea behind package parts makes complete sense to a coder – they’re macros.  You take code you’ve used in several places, put it in a separate file that you then include and “expand” in multiple other files.
If you have multiple packages with parts of the Control Flow that are identical – setting up a database in a certain way, sending emails, calling a set of stored procedures, … – then Control Flow Package Parts can help.
The assistance isn’t just limited to the initial coding, either.  Yes – creating a new package with the “duplicate” code is much easier.  But the real gain of Control Flow Package Parts is when your “standard” code needs changes.  Instead of having to edit multiple packages to address the modifications – you only have to alter the package part.  Deploying the project(s) that depend on this part automatically incorporates those improvements.

I’d be a lot more interested in this if Biml weren’t already a better option.  Read on for Todd’s rundown.

Comments closed

Processing 2016 Tabular From SSIS 2014

Meagan Longoria shows how to process a Tabular Model with a compatibility level of 1200 in SQL Server Integration Services 2014:

Attempting to use the AS Processing Task results in the following error: “[Analysis Services Execute DDL Task] Error: This command cannot be executed on database ‘MySSASDB’ because it has been defined with StorageEngineUsed set to TabularMetadata. For databases in this mode, you must use Tabular APIs to administer the database”

The reason for keeping SSAS processing in an SSIS package was because it kept consistent logging throughout their data refresh process. So we set out to find another solution.

Read on for the explanation and the solution.

Comments closed

.NET Producer For Kafka

I build a simple .NET console app to push messages to a Kafka topic:

That’s the core of our code.  The main function instantiates a new Kafka producer and gloms onto the Flights topic.  From there, we call the loadEntries function.  The loadEntries function takes a topic and filename.  It streams entries from the 2008.csv file and uses the ParallelSeq library to operate in parallel on data streaming in (one of the nice advantages of using functional code:  writing thread-safe code is easy!).  We filter out any records whose length is zero—there might be newlines somewhere in the file, and those aren’t helpful.  We also want to throw away the header row (if it exists) and I know that that starts with “Year” whereas all other records simply include the numeric year value.  Finally, once we throw away garbage rows, we want to call the publish function for each entry in the list.  The publish function encodes our text as a UTF-8 bytestream and pushes the results onto our Kafka topic.

All this plus a bonus F# pitch.

Comments closed

DAX Variables

Chris Webb shows how to define variables in DAX:

Variables are the best thing to happen to DAX since, well forever – they are so cool I’m almost ready to like DAX as much as I like MDX. There are already several good articles and blog posts out there describing how to use them (see here and here), but I was looking at a Profiler trace the other day and saw something I hadn’t yet realised about them: you can declare and use variables in the DEFINE clause of a DAX query. Since my series of posts on DAX queriesstill gets a fair amount of traffic, I thought it would be worth writing a brief post showing how this works.

There are some limitations, but Chris shows a way of getting around one of them.

Comments closed

Hive And Impala

Carter Shanklin and Nita Dembla run a performance comparison of Hive LLAP versus Impala:

Before we get to the numbers, an overview of the test environment, query set and data is in order. The Impala and Hive numbers were produced on the same 10 node d2.8xlarge EC2 VMs. To prepare the Impala environment the nodes were re-imaged and re-installed with Cloudera’s CDH version 5.8 using Cloudera Manager. The defaults from Cloudera Manager were used to setup / configure Impala 2.6.0. It is worth pointing out that Impala’s Runtime Filtering feature was enabled for all queries in this test.

Data: While Hive works best with ORCFile, Impala works best with Parquet, so Impala testing was done with all data in Parquet format, compressed with Snappy compression. Data was partitioned the same way for both systems, along the date_sk columns. This was done to benefit from Impala’s Runtime Filtering and from Hive’s Dynamic Partition Pruning.

I’m impressed with both of these projects.

Comments closed

String Trimming

Richie Lee has a Powershell cmdlet to trim a string:

When building up urls from different parameters in something like TeamCity, or Octopus, it’s simple enough to get double “//” in urls if the parameters are not consistent. So little helper functions are always useful to have imported to manage such things. Below is an example of such a thing!

Click through for the function.

Comments closed