Press "Enter" to skip to content

Curated SQL Posts

HDFS Erasure Coding

Ayush Tiwari explains how HDFS Erasure Coding lets us keep the same data durability while improving storage efficiency:

In this particular example, when you look at the codeword, actual data cells are 6(blue cells) & 3(red cells) are the parity cells which are simply obtained by multiplied our data cells to generation matrix.
Storage failure can be recovered by the multiplying inverse of generator matrix with the extended codewords as long as ‘k’ out of ‘k+m’ cells are available.
Therefore, here Data Durability is 3 as it can handle 2 simultaneous failure, Storage Efficiency is 67% (as we are using only 1 extra block i.e. 6/9) & only we need to store half number of cells as compared to original number of cells, we can conclude that we also have only 50% overhead in it.

It’s a good explanation of one of the biggest improvements to HDFS over the past several years.

Comments closed

Ignoring Case In Table.Distinct With Power Query

Cedric Charlier shows how to ignore case when running Table.Distinct in Power Query:

Note that at the top of the table, I’ve twice the value ADO.NET, once with uppercase and once lowercase for the “Net” part. If I try to apply a Table.Distinct, the two values will be considered as distinct and one of them won’t be removed.

Will it be an issue? If this your key and it’s part of a model where this key is part of one side of a one-to-many then it will be an issue.

Read on for the solution.

Comments closed

Using Cosmos DB For Graph Data

Jose Mendes has an introduction to the Cosmos DB graph engine:

Gremlin is the graph traversal language of Apache TinkerPop, an open source Graph Computing Framework. Gremlin allows the users to write complex queries to traverse their graphs by using a composed sequence of steps, with each step performing an operation on the data stream (further details here). There are 4 fundamental steps:

· transform: transform the objects in the stream

· filter: remove objects from the stream

· sideEffect: pass the object, but yield some side effect

· branch: decide which step to take

Click through for a quick example showing how to create and populate a graph.

Comments closed

Will It Bit?

Louis Davidson wants to see what he can cast to a bit type:

There are no other textual/alpha string values that will cast to a bit value, but the numeric values that will cast to a bit are voluminous (even some that are in string format). Consider the following eight statements:

SELECT CAST(100 AS bit);
SELECT CAST(-100 AS bit);
SELECT CAST(99999999999999999999999999999999999999 AS bit);
SELECT CAST(-99999999999999999999999999999999999999 AS bit);
SELECT CAST(88.999999 AS bit);
SELECT CAST('1' AS bit);
SELECT CAST('2' AS bit);
SELECT CAST('999999' AS bit);

Danged if they didn’t all work, and all return 1.

Check out what else Louis tries to cast to a bit type.

Comments closed

All About Power BI Licensing

Reza Rad has a guide for the perplexed when it comes to licensing Power BI:

The pricing table above may scare you off and you may immediately think of not going through the embedded path. However, I need to let you know that there are some scenarios which Power BI Embedded can be a much more cost-effective option than Pro. Here is an example:

Assume that you have 100 users for your Power BI solution. And your users are not connecting all at the same time to use Power BI reports. You may have the maximum of 300 page renders per hour for them if you use embedded. In such case, embedded for that scenario would cost you about $700 USD per month, where the Power BI Pro for 100 users would be $1000 USD per month. This means saving of $3,600 USD per year. This is an example scenario that Power BI Embedded can be more cost-effective than Pro.

Give this a careful reading if you’re looking to implement Power BI in your environment.

Comments closed

Working With forcats

S. Richter-Walsh demonstrates what the forcats R package can do:

Synonymous factor levels

Sometimes a categorical variable may have two or more factor levels that refer to the same group. There may be subtle differences in syntax such as upper case leading letter versus lower case leading letter (GroupA vs. groupA), for example. In this situation, one can use forcats::fct_collapse() to collapse the synonymous levels into one. In our test data, let’s assume that Web and Online refer to the same sales channel and we want to combine both into a factor level called Online….

df$sales <- fct_collapse(df$sales, Online = c("Online", "Web"))

I don’t use forcats that often, but when I do, I definitely appreciate it being here.  H/T R-Bloggers

Comments closed

Tips For Processing Large Data Sets With Python

Julien Heiduk has a few tips for people looking to process large data sets within Python:

In order to aggregate our data, we have to use chunksize. This option of read_csvallows you to load massive file as small chunks in Pandas. We decide to take 10% of the total length for the chunksize which corresponds to 40 Million rows.
Be careful it is not necessarily interesting to take a small value. The time between each iteration can be too long with a small chaunksize. In order to find the best trade-off “Memory usage – Time” you can try different chunksize and select the best which will consume the lesser memory and which will be the faster.

Click through for more tips.

Comments closed

Plotting In R Using ggplot2

The folks at Sharp Sight Labs have another nice demo of ggplot2:

You’ve heard me say it a thousand times: to master data science, you need to practice.

You need to “practice small” by practicing individual techniques and functions. But you also need to “practice big” by working on larger projects.

To get some practice, my recommendation is to find reasonably sized datasets online and plot them.

Wikipedia is a nearly-endless source of good datasets. The great thing about Wikipedia is that many of the datasets are small and well contained. They are also fairly clean, with just enough messiness to make them a bit of a challenge.

As a quick example, this week, we’ll plot some economic data.

The code is deceptively easy considering the scope of the problem.

Comments closed

More Fun With Row Goals

Joe Obbish has a detailed investigation into performance of a simple-enough query:

Something looks very wrong here. The loop join plan has a significantly lower cost than the hash join plan! In fact, the loop join plan has a total cost of 0.0167621 optimizer units. Why would disabling row goals for such a plan cause a decrease in total query cost?

I uploaded the estimated plans here for those who wish to examine them without going through the trouble of creating tables.

It’s a long but worthwhile read.

Comments closed

Null Reference Exception In SSMS 17.5

David Fowler notes a bug in SQL Server Management Studio 17.5 around availability group failover:

After a spot of head scratching and thinking that there was something wrong with my AG setup, it turns out that there’s a bug in SSMS.  I was running SSMS 17.5 although this may well also affect earlier versions.

Looking at the release notes for SSMS 17.6, one of the bug fixes that this version addresses is…

  • Fixed an issue when the primary is down and manually failover to secondary, a NullReferenceException will be thrown.

David notes that upgrading fixed his issue; read on for more.

Comments closed