It starts off easy – you fill in your name, your street, your apartment number, and then you reach the field labelled “Post Town”. What is a “Post Town”? Huh. Next you see “County”. Well, you know what a county is, but since when do you list it in your address? Then there’s “Postal code”. You might recognize it as what the US calls a “zip code”, but it’s still confusing to see.
So now you don’t really know what to do, right? Where do you put your city, or your state? Do you just cram your address into the form however you can and hope that you get your order? Or do you abandon your cart and decide not to buy anything from this site?
This is more application-level code than data (and so glides over the underlying database work), but Danielle does link to some good resources for figuring out what non-US addresses look like.
A function does some computation on an object. The use of a function consists of:
- A function’s name
- 0 or more inputs
Each input is provided to an argument or parameter within a function.
These arguments have names, although you don’t often need to provide the names.
You can find out what arguments a function takes by using the code completion and it’s help snippet, or by searching for the function in the Rstudio Help tab.
When you’re inside the brackets of a function you can get the list of available arguments and auto-complete them.
Ellen also includes some useful R libraries for working with and visualizing data.
A Kafka producer is an application that can act as a source of data in a Kafka cluster. A producer can publish messages to one or more Kafka topics.
So, how many ways are there to implement a Kafka producer? Well, there are a lot! But in this article, we shall walk you through two ways.
- Kafka Command Line Tools
- Kafka Producer Java API
You can write producers in quite a few languages. Java is the example here, but there are several libraries, including a good one for .NET.
SANs have become a bit like the printer industry — You don’t pay a lot for the enclosure, the device itself, i.e. the SAN box & software; but you pay through the nose for ‘refills’, i.e. the drives that your SAN vendor gods deem worthy of their enclosure.
It’s frighteningly accurate. Ask your storage admin what it costs to add a single drive (or pair of drives, if you’re using something with built-in redundancy) to your SAN. Then compare that cost with the same exact drive off the retail market. It’s highway robbery. And we’re letting them get away with it because we can’t evolve fast enough to take advantage of storage virtualization tech (S2D, SOFS, RDMA) that effectively makes servers with locally attached SSDs a superior architecture. (As long as they’re not using a horribly outdated interface like SAS!)
Nate also includes several more interesting lessons. SQL Saturdays are great for picking up useful knowledge.
I’ve highlighted the interesting bit. “Actual Number of Locally Aggregated Rows” is part of aggregation push down, explained by the amazing Niko Negebauer here and here. Basically, the aggregation is occurring with the data access. So while there is a Hash Match operator for the aggregation, actually, the active part of the aggregation was performed within the columnstore. That’s why the Actual Number of Rows coming out of the columnstore index itself is 0, but the number of rows coming out of the Hash Match Aggregate is 441.
So… why not another aggregate push down when we used the hint? Because the hint says, we MUST use a hash join. At that point the optimizer has no choices on where, when, how it does data processing. It must, first, ensure that a hash join is used, so it does. First thing out of the gate, hash join. Then a hash aggregate. This difference in behavior results in a 24% decrease in performance. The only interesting thing is that the reads remained consistent. This means that it was just the processing of the join that added overhead.
Read the whole thing.
Today I want to share a scenario where a running total calculation in the query editor saved a model that run out of memory when done with DAX:
The model couldn’t be refreshed and returned out of memory error with a calculated column in the fact table of over 20 Mio rows (from a csv-file). A running total should be calculated for each “JourneyID”, of which there were over 1 Mio in the table itself. This rose memory consumption during refresh by over 300 % – until it finally errored out:
Click through for the solution.
In a previous blog post: Duplicate Agent jobs – A good reason not to meddle with Msdb I explained a situation where someone was updating msdb tables manually rather than using the supplied system stored procedures such as msdb.dbo.sp_update_job, It would seem that this was not the only occasion where I would find myself in the midst of the meddlers’ medley.
This time around the meddler decided to disable the job using an update statement against the msdb.dbo.sysjobs table setting enabled from 1 to 0, you have no idea how long it took me to work that out!! but this lead me onto discovering more about the SQL server agent and its general behaviour.
Short form: use the built in procedures to modify SQL Agent jobs rather than going off and updating tables on your own. Click through for the long form, with plenty of interesting details.
One of the things I think is important in modeling your particular entity is including a primary key (PK). In my DevOps talk I stress this, as I’d rather most attendees come away thinking a PK is important as their first takeaway from the session. There are exceptions, but they are rare, and I would prefer that most tables just have some PK included from the beginning.
A PK ought to be stable as well, and there are plenty of written words about how to pick the PK for your particular problem domain. Often I have received the advice that natural keys are preferred over surrogate keys, and it is worth the effort to try and identify a suitable column (or set of columns) that will guarantee uniqueness. I think that’s good advice, and it’s also advice I tend to ignore.
Read on for Steve’s reasoning. I tend to use surrogate keys out of habit, though I do prefer to put unique key constraints on natural keys to help me reason through data models.
A Power BI Dataset is a series of Power Query queries that have been shaped in a DAX model. Each dataset can combine different files, database tables and online services all into one tabular model. In our cookie analogy, these are all different “ingredients”.
Unlike SSRS, a dataset in Power BI does not represent a single table or query of data. A dataset should be considered more like a “flavor” of data used to accomplish a specific type of reporting: financial, operational, HR, etc. So in our analogy, the dataset is the “raw dough”.
So in Power Query, you are going to have a set of queries which each combine a data source with a usually linear set of transformations.
The pie chart cookie is the first one I would have eaten, if only to eliminate it.