Press "Enter" to skip to content

Author: Kevin Feasel

Connecting to Jira with Powershell

Kevin Marquette has a new Powershell module:

JiraPS is a wonderful module that is great at a lot of things. There several great people that have put a lot of time into it for it to have such good feature coverage. It was designed to be approachable and do a lot of validation for you. JiraPS does not have a good user story around bulk operations with its issue commands and thats what I need from it the most.

It also has a large user base with hundreds of thousands of downloads and is used in a lot of organizations. Every change made to that module at this point needs to pay very close attention to backwards compatibility.

Now there are two modules depending on your use case.

Comments closed

Bring .NET Support to Spark

I have a request that you vote up a Spark issue:

There is a Jira ticket for the Apache Spark project, SPARK-27006. The gist of this ticket is to bring .NET support to Spark, specifically by supporting DataFrames in C# (and hopefully F#). No support for Datasets or RDDs is included in here, but giving .NET developers DataFrame access would make it easy for us to write code which interacts with Spark SQL and a good chunk of the SparkSession object.

You an click through and read everything I have to say, but do go to the Spark ticket and vote for .NET support.

Comments closed

Against Surrogate Keys on Junction Tables

Lukas Eder explains the costs of surrogate keys on tables intended to join multiple tables together:

There is really no point in adding another column FILM_ACTOR_ID or ID for an individual row in this table, even if a lot of ORMs and non-ORM-defined schemas will do this, simply for “consistency” reasons (and in a few cases, because they cannot handle compound keys).

Now, the presence or absence of such a surrogate key is usually not too relevant in every day work with this table. If you’re using an ORM, it will likely make no difference to client code. If you’re using SQL, it definitely doesn’t. You just never use that additional column.

But in terms of performance, it might make a huge difference!

Lukas makes a good argument here.

Comments closed

Date Dimensions and Large Power BI Files

Reza Rad explains why your Power BI data size might be abnormally large:

If you have a large *.pbix file, you can investigate what are the columns and tables that causing the highest storage consumption, using Power BI Helper. You can download Power BI Helper for free from here. I opened the file above in Power BI Helper, and in the Modeling Advise tab, this is what I see:

As you can see in the above output, the Date field in the Date table is the biggest column in this dataset. taking 150MB runtime memory! This is considering that we have only three distinct values in the column! Seems a bit strange, isn’t? let’s dig into the reason more in deep.

Read on for Reza’s explanation and what you can do to fix it.

Comments closed

Pacemaker Changes Affecting SQL on Linux

Randolph West has an important message if you’re running SQL Server on Linux:

Heads up for SQL Server on Linux folks using availability groups and Pacemaker. Pacemaker 1.1.18 has been out for a while now, but it’s worth mentioning that there was a behaviour change in how it fails-over a cluster. While the new behaviour is considered “correct”, it may affect you if you’ve configured availability groups on a previous version (specifically 1.1.16).

Click through for more details and what you can do about this.

Comments closed

Persisting Databases with Docker Compose

Rob Sewell tries out docker-compose to persist databases using named volumes:

With all things containers I refer to my good friend Andrew Pruski. Known as dbafromthecold on twitter he blogs at https://dbafromthecold.com

I was reading his latest blog post Using docker named volumes to persist databases in SQL Server and decided to give it a try.

His instructions worked perfectly and I thought I would try them using a docker-compose file as I like the ease of spinning up containers with them.

Read on for Rob’s travails, followed by great success. And never go into something named “the spooky basement;” that’s just good life advice.

Comments closed

Apache Druid Concepts

Jatin Demla takes us through some of the key concepts behind Apache Druid:

Apache Druid is a distributed, high-performance columnar store for real-time analytics on a large dataset. Druid core design combines the OLAP analytics, time series database and search system to create a single operational analysis. Druid is most suitable for data with high cardinality column or queries having higher aggregation or group by.

Druid has very specific use cases. If you don’t fit one of the use cases, it’s not a good solution at all; but if you do fit one of the use cases, it’s excellent.

Comments closed

Residual Analysis with R

Abhijit Telang shares a few techniques for doing post-regression residual analysis using R:

Naturally, I would expect my model to be unbiased, at least in intention, and hence any leftovers on either side of the regression line that did not make it on the line are expected to be random, i.e. without any particular pattern.

That is, I expect my residual error distributions to follow a bland, normal distribution.

In R, you can do this elegantly with just two lines of code. 
1. Plot a histogram of residuals 
2. Add a Quantile-Quantile plot with a line that passes through, namely, the first and third quantiles.

There are several more techniques in here to analyze residuals, so check it out.

Comments closed

Generating TPC-DS Data Sets with HDInsight

Chris Koester shows how you can generate artificial data sets in the TCP-DS format using HDInsight:

This post describes how to generate big datasets with Hive in HDInsight, specifically TPC-DS benchmarking datasets. There are many tools for generating sample data, and this one is particularly nice due to its familiarity and ability to generate massive datasets up to 100 terabytes in size. The intended purpose of TPC data is for benchmarking purposes, but big sample datasets are also very useful for learning big data tools, proofs of concept, testing, etc.

The TPC (Transaction Processing Performance Council) provides tools for generating the benchmarking data, but using them to generate big data is not trivial, and would take a very long time on modest hardware. Thankfully someone has written a nice utility that uses Hive and Python to run the generator on a Hadoop cluster. While Hadoop clusters are not easy to setup, using a Hadoop cloud service like Azure HDInsight is remarkably easy. With HDInsight, you can use a powerful cluster of machines to generate the data quickly, and when you’re done you can delete the cluster, leaving the data in place.

Most of the instructions should follow through to work with on-prem or non-HDInsight Hadoop clusters, though there will be some changes to accommodate differences in HDInsight.

Comments closed

Row Versioning and 14 Bytes

Kendra Little explains why enabling row versioning adds 14 bytes per row:

I love it when someone sends me a repro script, but in this case I didn’t need to run it. The first thing I did was to look at the two numbers given for row size, and to subtract the smaller one from the larger one: 724 – 710 = 14 bytes of difference.

That bit of information alone gave me an immediate guess of what was going on.

Click through for the solution as well as a more detailed explanation of one of the trickier scenarios.

Comments closed