Curated SQL – Page 1276 – A Fine Slice Of SQL Server

Building A Basic Kafka Producer

Published 2018-10-02 by Kevin Feasel

M. Mallikarjun shows us a simple producer in Kafka:

A Kafka producer is an application that can act as a source of data in a Kafka cluster. A producer can publish messages to one or more Kafka topics.

So, how many ways are there to implement a Kafka producer? Well, there are a lot! But in this article, we shall walk you through two ways.

Kafka Command Line Tools

Kafka Producer Java API

You can write producers in quite a few languages. Java is the example here, but there are several libraries, including a good one for .NET.

Comments closed

What You Can Learn At SQL Saturday

Published 2018-10-02 by Kevin Feasel

Nate Johnson shares a few things he picked up at the SQL Saturday in San Diego:

This was an interesting and even slightly entertaining session presented by Max @ SQLHA. One analogy that really stood out to me was this:

SANs have become a bit like the printer industry — You don’t pay a lot for the enclosure, the device itself, i.e. the SAN box & software; but you pay through the nose for ‘refills’, i.e. the drives that your SAN vendor gods deem worthy of their enclosure.

It’s frighteningly accurate. Ask your storage admin what it costs to add a single drive (or pair of drives, if you’re using something with built-in redundancy) to your SAN. Then compare that cost with the same exact drive off the retail market. It’s highway robbery. And we’re letting them get away with it because we can’t evolve fast enough to take advantage of storage virtualization tech (S2D, SOFS, RDMA) that effectively makes servers with locally attached SSDs a superior architecture. (As long as they’re not using a horribly outdated interface like SAS!)

Nate also includes several more interesting lessons. SQL Saturdays are great for picking up useful knowledge.

1 Comment

How Join Hints Affect Adaptive Joins

Published 2018-10-02 by Kevin Feasel

Grant Fritchey looks at the combination of adaptive joins and query hints which specify join type:

I’ve highlighted the interesting bit. “Actual Number of Locally Aggregated Rows” is part of aggregation push down, explained by the amazing Niko Negebauer here and here. Basically, the aggregation is occurring with the data access. So while there is a Hash Match operator for the aggregation, actually, the active part of the aggregation was performed within the columnstore. That’s why the Actual Number of Rows coming out of the columnstore index itself is 0, but the number of rows coming out of the Hash Match Aggregate is 441.

So… why not another aggregate push down when we used the hint? Because the hint says, we MUST use a hash join. At that point the optimizer has no choices on where, when, how it does data processing. It must, first, ensure that a hash join is used, so it does. First thing out of the gate, hash join. Then a hash aggregate. This difference in behavior results in a 24% decrease in performance. The only interesting thing is that the reads remained consistent. This means that it was just the processing of the join that added overhead.

Read the whole thing.

Comments closed

Running Totals In Power BI With M

Published 2018-10-02 by Kevin Feasel

Imke Feldmann gives us a reason to use M to calculate running totals:

Today I want to share a scenario where a running total calculation in the query editor saved a model that run out of memory when done with DAX:

Problem

The model couldn’t be refreshed and returned out of memory error with a calculated column in the fact table of over 20 Mio rows (from a csv-file). A running total should be calculated for each “JourneyID”, of which there were over 1 Mio in the table itself. This rose memory consumption during refresh by over 300 % – until it finally errored out:

Click through for the solution.

Comments closed

When Disabled SQL Agent Jobs Still Run

Published 2018-10-02 by Kevin Feasel

David Fowler troubleshoots a nasty issue with the SQL Server Agent:

In a previous blog post: Duplicate Agent jobs – A good reason not to meddle with Msdb I explained a situation where someone was updating msdb tables manually rather than using the supplied system stored procedures such as msdb.dbo.sp_update_job, It would seem that this was not the only occasion where I would find myself in the midst of the meddlers’ medley.

This time around the meddler decided to disable the job using an update statement against the msdb.dbo.sysjobs table setting enabled from 1 to 0, you have no idea how long it took me to work that out!! but this lead me onto discovering more about the SQL server agent and its general behaviour.

Short form: use the built in procedures to modify SQL Agent jobs rather than going off and updating tables on your own. Click through for the long form, with plenty of interesting details.

Comments closed

Natural Keys?

Published 2018-10-02 by Kevin Feasel

Steve Jones wonders if we should give up on natural primary key constraints:

One of the things I think is important in modeling your particular entity is including a primary key (PK). In my DevOps talk I stress this, as I’d rather most attendees come away thinking a PK is important as their first takeaway from the session. There are exceptions, but they are rare, and I would prefer that most tables just have some PK included from the beginning.

A PK ought to be stable as well, and there are plenty of written words about how to pick the PK for your particular problem domain. Often I have received the advice that natural keys are preferred over surrogate keys, and it is worth the effort to try and identify a suitable column (or set of columns) that will guarantee uniqueness. I think that’s good advice, and it’s also advice I tend to ignore.

Read on for Steve’s reasoning. I tend to use surrogate keys out of habit, though I do prefer to put unique key constraints on natural keys to help me reason through data models.

Comments closed

Power BI: Datasets, Reports, And Dashboards

Published 2018-10-02 by Kevin Feasel

Eugene Meidinger teams up with Bert Wagner to teach Power BI using a food metaphor:

A Power BI Dataset is a series of Power Query queries that have been shaped in a DAX model. Each dataset can combine different files, database tables and online services all into one tabular model. In our cookie analogy, these are all different “ingredients”.

Unlike SSRS, a dataset in Power BI does not represent a single table or query of data. A dataset should be considered more like a “flavor” of data used to accomplish a specific type of reporting: financial, operational, HR, etc. So in our analogy, the dataset is the “raw dough”.

So in Power Query, you are going to have a set of queries which each combine a data source with a usually linear set of transformations.

The pie chart cookie is the first one I would have eaten, if only to eliminate it.

Comments closed

Writing Higher-Order Functions With Scala

Published 2018-10-01 by Kevin Feasel

Jyoti Sachdeva explains the concept of higher-order functions and shares an example in Scala:

In this blog, I’m going to explain higher-order functions.

A higher order function takes other function as a parameter or return a function as a result.

This is possible because functions are first-class value in scala. What does that mean?

It means that functions can be passed as arguments to other functions and functions can return other function.

The map function is a classic example of a higher order function.

Higher-order functions are one of the key components to functional programming and allows us to reason in small chunks at a time

Comments closed

Building Observable Distributed Systems

Published 2018-10-01 by Kevin Feasel

Kevin Sookocheff has some thoughts on building observable systems:

Given the shortcomings of monitoring and testing, we should shift focus to building observable systems. This means treating observability of system behaviour as a primary feature of the system being built, and integrating this feature into how we design, build, test, and maintain our systems. This also means acknowledging that the ease with which we can debug our production environment will be a key indicator of system reliability, scalability, and ultimately customer experience. Designing a system to be observable requires effort from three disciplines of software development: development, testing, and operations. None of these disciplines is more important than the others, and the sum of them is greater than the value of the individual parts. Let’s take some time to look at each discipline in more detail, with a focus on observability.

My struggle has never been with the concept, but rather with getting the implementation details right. “Make everything observable” is great until you run out of disk space because you’re logging everything.

Comments closed

Redshift Architecture Performance Tips

Published 2018-10-01 by Kevin Feasel

John Ryan has a few hints to help us build speedy Redshift clusters:

The Need to Vacuum

As Redshift does not reclaim free space automatically, updates and delete operations can frequently lead to table growth. Equally, it’s important as new entries are added, that the data is maintained in a sorted sequence.

The VACUUM command is used to re-sequence data, and reclaim disk space as a result of DELETE and UPDATE operations. Although it won’t block other processes, it can be a resource-intensive operation, especially for data stored using interleaved sort keys.

It should be run periodically to ensure consistent performance and to reduce disk usage.

Some of this is good Postgres advice; some of it is good MPP advice (and serves well, for example, when dealing with Azure SQL Data Warehouse); the rest is Redshift-specific.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Curated SQL Posts

Problem

The Need to Vacuum