Press "Enter" to skip to content

Author: Kevin Feasel

Generating Random Numbers with R

The folks at Data Sharkie walk us through random number generation in R:

Why is random numbers generation important and where is it used?

Random numbers generations have application in various fields like statistical sampling, simulation, test designs, and so on. Generally, when a data scientist is in need of a set of random numbers, they will have in mind

R programming language allows users to generate random distributed numbers with a set of built-in functions: runif()rnorm()rbinom().

Read on to generate random numbers across two separate distributions.

Comments closed

Combining User-Defined Types and Temp Tables

Andy Levy tries to make cats and dogs live together:


This tripped me up a few weeks ago, but once I stopped and thought about for a moment it made total sense. I was trying to copy some data into a temp table and got an error I’d never encountered before.

Column, parameter, or variable #1: Cannot find data type MyStringType.

What’s that all about? Let’s find out.

I don’t think it spoils things to say that Andy’s story is a tragedy and not a comedy. But in fairness, the number of shops using user-defined types (as opposed to user-defined table types) is probably not enormous.

Comments closed

Creating a Time Dimension in Power BI

Reza Rad walks us through creating a time dimension in Power BI:

I have explained about Date Dimension a lot previously and mentioned why that is needed. Date dimension gives you the ability to slice and dice your data by different date attributes, such as year, quarter, month, day, fiscal columns, etc. Time dimension, on the other hand, will give you the ability to slice and dice data in the level of hours, minutes, seconds, and buckets related to that, such as every 30 minutes, or 15 minutes, etc.

Time table SHOULD NOT be combined with Date table, the main reason is the huge size of the combined result. Let’s say your date table which includes one record per day, has 10 years of data in it, which means 3,650 rows. Now if you have a Time table with a row for every second, this ends up with 24*60*60=86,400 rows just for the time table. If you combine date and time table, you will have 3,650*86,400=315,360,000 rows. 315 Million rows in a table are not good for a dimension table. Even if you store one record per minute in your time table, you would still end up with over 5 million rows.

So don’t combine the Date and Time table. These two should be two different tables, and they both can have a relationship to the fact table.

With that in mind, click through to see how to create the table.

Comments closed

Execution Plan Training, in Video Form

Hugo Kornelis makes an announcement:

As those who have been to my full-day precon on execution plans know, I believe that learning to understand execution plans does not start with dozens of examples. It starts with an explanation of the basics, followed by an overview of operators. Just like learning Russian doesn’t start with reading Tolstoy’s Война и мир (War and Peace), but with learning the grammar rules and the vocabulary.

Once you know the grammar of a language, and enough of its vocabulary, you can then pick up any book. And the more you do that, the easier it becomes. Eventually, one day, you will be able to read Война и мир in its original language.

And once you know the basics of reading execution plans, and are familiar with most of the operators, you will be able to tackle any execution plan you find on your servers, no matter how complex.

And, at least for now, this is free. So check out what Hugo has already and pass along a “thank you” if you like what you see there.

Comments closed

Auto-Recovery with Power BI

Prathy Kamasani shows us how to recover lost Power BI desktop reports:

A quick post, how many times in Power BI Desktop, have you clicked on “No, remove the files.” and then say OOPS! Well, I did plenty of times to discover this trick.

In short, you can find those removed files under Temp folder like many other windows application files. Usually, the location will be somewhere like this – C:UsersprathyAppDataLocalMicrosoftPowerBI DesktopTempSaves. This location depends upon which version of Power BI Desktop you have. Beware, these files will be removed whenever you clear your Temp Directory.

Auto-save and auto-recovery are marvelous things.

Comments closed

Horizontal Dumbbell Dot Plots in Excel

Stephanie Evergreen walks us through an interesting technique for creating dumbbell-style dot plots in Excel:

Ok, babes, prepare to be amazed. It used to be that making this viz was pretty tedious but I’ve recently refined a totally new hack (thanks to a lollipop chart example provided by Sevinc Rende, one of my mentees) that makes this soooooooo easier. It used to be ninja level 9. Now it is ninja level 5, if that.

We will create a dumbbell dot plot out of a stacked bar, where the first stack is composed of our first set of dot values and the second stack is composed of *the difference* between our first and second values (so that it would end at our second values on the x-axis scale). So let’s calculate the difference between the 2020 and 2010 scores.

Read on to see how.

Comments closed

Visualizing a Single Variable in R

Michaelino Mervisiano takes us through the types of visuals we can create to understand a single variable in R:

How to create a histogram in R? And what information that we can get from histogram?
Histogram shows a frequency distribution. It is a great graph for showing the mode, the spread, and the symmetry (skewness) of your data. Here is a histogram of 1,000 random points drawn from a normal distribution with a mean of 2.5

Of course I don’t like option number 4 and would replace it with something else (column/bar charts, Cleveland dot plots, or stacked column/bar depending on what you’re trying to observe). But this is a good way of thinking about how you can visualize a variable.

Comments closed

Accessing Blob Storage from Azure Databricks

Gauri Mahajan shows how we can read data in Azure Blob Storage from Azure Databricks:

Since our base set-up comprising of Azure Blob Storage (with a .csv file) and Azure Databricks Service (with a Scala notebook) is in place, let’s talk about the structure of this article. We will demonstrate the following in this article:

1. We will first mount the Blob Storage in Azure Databricks using the Apache Spark Scala API. In simple words, we will read a CSV file from Blob Storage in the Databricks
2. We will do some quick transformation to the data and will move this processed data to a temporary SQL view in Azure Databricks. We will also see how we can use multiple languages in the same databricks notebook
3. Finally, we will write the transformed data back to the Azure blob storage container using the Scala API

It’s just a few lines of code. One of the best things Microsoft and the Databricks team did for Azure Databricks was to ensure that it felt like a first-party offering—everything feels a little more integrated than Databricks for AWS.

Comments closed

Obtaining Accurate Totals in DAX

Alberto Ferrari explains a nuance of summation in DAX:

In simple DAX measures, the total of a report is the sum of its individual rows. For more sophisticated measures, the total might seem wrong because the formula does not include an aggregation over the rows that are visible in the report. For example, if the total of a measure must be the sum of the values displayed in the rows of a report, we consider the expected result a “visual total”, which is a total that corresponds to the visual aggregation of its values sliced by different rows in the report.

Click through for a straightforward demonstration.

Comments closed

Understanding Point-In-Time Recovery with SQL Server

Eduardo Pivaral walks us through what it takes to get point-in-time recovery of data in SQL Server:

Nowadays, data is a precious asset for companies today. If you are a database administrator (by decision or by mistake) or simply you are the “IT guy,” you have the mission of guarantee all the data is backed up and accessible for recovery.

Trust me, even when you could think you have the more reliable hardware on the planet, or you have multiple database replicas around the globe, anything can happen (a user deleting an entire schema by mistake, an application updating the wrong records, some process crashing, a lot of things can happen).
So trust me and don’t question me, just backup all your databases regularly.

During my time as a DBA, I think the most frequent reason for needing point-in-time backups was “We goofed up at 2:20 PM and need to get the database back to that state,” where goof-ups typically involved mass updates or deletes of data.

Comments closed