Press "Enter" to skip to content

Month: February 2020

Finding the Right Disk and Data Node Sizes in HDFS

Lokesh Jain has some advice when it comes to disk and data node size:

There are two factors to keep in mind when choosing node capacity. These will be discussed in detail in the next sections.

1. Large Disks – total node capacity being the same, using more disks is better as it yields higher aggregate IO bandwidth.
2. Dense Nodes – as nodes get denser, recovery after node failure takes longer.

These factors are not HDFS-specific and will impact any distributed storage service that replicates data for redundancy and serves live workloads.

Click through for specific advice on maximum disk and node sizes.

Comments closed

Pulling R Packages from Fedora

Inaki Ucar has an interesting project:

Bringing R packages to Fedora (in fact, to any distro) is an Herculean task, especially considering the rate at which CRAN grows nowadays. So I am happy to announce the cran2copr project, which is an attempt to maintain binary RPM repos for most of CRAN (~15k packages as of Feb. 2020) in an automated way using Fedora Copr.

Click through for installation instructions if you’re using an RPM-based Linux distribution like Fedora or CentOS. H/T R-Bloggers.

Comments closed

Scripting and Deploying SQL Agent Jobs

Alex Yates shows how you can incorporate SQL Agent jobs in your CI/CD process:

Basically, we need to put all the SQL Agent Job .sql scripts into a git repo. Then we need a PowerShell script that executes each .sql script against the necessary target databases. If you use SSDT, you might prefer to use a post deployment script to do this. That bit should be reasonably straight forward. I’ll leave that as a task for the user since I’m short on time.

You probably want to put some thought into whether your agent jobs are scoped to a particular database, general server admin for a specific server, or whether you want them to be standardised across many servers since this may affect where you choose to put your jobs ion source control and on what schedule you want to deploy them.

It may also make sense to set up MSX if you have a central server. That would make Agent job deployment easier and you can still script out which sets of servers get which jobs.

Comments closed

Tenant Usage Monitoring with Power BI

Jeff Pries shows us the culmination of several blog posts’ worth of work:

Most of those posts (listed at the bottom of this page) are fairly long and technical — as, initially getting started using a program to read Power BI data via the API can be a bit much.

This post is going to be a bit different. Short and sweet. The payoff for all that hard work authenticating to Power BI, requesting data, downloading that data, and storing it in an easy to use SQL table.

With all of the hard work out of the way, its time to build a Power BI report to explore that great Activity Log usage data.

This is the payoff and it’s quite useful.

Comments closed

Benford’s Law in Power BI

Imke Feldmann shows how you can build up a Benford distribution in DAX:

The green columns show how often each number should be the first digit in numbers that should follow the Benford-distribution. In black you’ll see the actual distribution of first digits within my table. Lastly, the red line shows the percentual absolute deviations between actual and Benford values.

In this example, there is a relatively high occurrence of numbers starting with 4 and 5. So this could be a sign for fraudulent manipulations.

In the example, eyeballing it says things look pretty good. It’s interesting to see just how many things fit a Benford distribution, including populations, budgets (when you have enough line items), expenses, etc. Not everything does, however—high and low temperatures tend not to, either in Fahrenheit or Celsius.

Comments closed

The Histogram Output with Extended Events

Grant Fritchey talks us through the histogram output in Extended Events:

The histogram target behaves similarly to the event_counter target. The event_counter target counts the number of times that an event occurs. However, the histogram target lets you pick a grouping mechanism for the histogram. You can use either an action, or an event field.

For demonstration purposes, what I want to know is, per object in the database, how many times are the statistics automatically updated?

Read on to see the test, including event setup, data-building queries, and usage of the histogram itself.

Comments closed

Comparing Slicers and Filters in Power BI

Teo Lachev has a nice comparison of slicers versus filters in Power BI:

Besides the built-in cross-filtering and cross-highlighting among visuals, Power BI supports two explicit filtering options: slicers and filters. Which one to use? Traditionally, you would use a slicer when you want the user to easily see what’s filtered on the report page. But with the introduction of the new filter pane and slicer enhancements, the choice becomes more difficult. Let’s compare the two options:

Click through for a table of comparisons as well as some advice.

Comments closed

Apache Flink 1.9.2 Released

Hequn Cheng announces Apache Flink 1.9.2:

The Apache Flink community released the second bugfix version of the Apache Flink 1.9 series.

This release includes 117 fixes and minor improvements for Flink 1.9.1. The list below includes a detailed list of all fixes and improvements.

We highly recommend all users to upgrade to Flink 1.9.2.

Read on to see all of the things they’ve fixed and improved.

Comments closed

Calculating Distances in R

Chris Brown gives us three ways to calculate distance in R:

Calculating a distance on a map sounds straightforward, but it can be confusing how many different ways there are to do this in R.

This complexity arises because there are different ways of defining ‘distance’ on the Earth’s surface.

The Earth is spherical. So do you want to calculate distances around the sphere (‘great circle distances’) or distances on a map (‘Euclidean distances’).

Then there are barriers. For example, for distances in the ocean, we often want to know the nearest distance around islands.

Then there is the added complexity of the different spatial data types. Here we will just look at points, but these same concepts apply to other data types, like shapes.

Read on to learn these three separate techniques. H/T R-Bloggers.

Comments closed

Recovering Lost Linked Servers

Taryn Pratt had a post-upgrade problem:

Recently, I kicked off a project to start moving us to SQL Server 2019. During my initial review of our servers, I found quite a few (9 total) that were still running on Windows Server 2012 R2. This meant that I would need to upgrade the operating system and move us to SQL Server 2019. Having completed plenty of SQL Server upgrades, as well as operating system upgrades, I couldn’t possibly make a mistake, right? Wrong…I completely forgot to script out the linked servers on the server I upgraded this week. I screwed up and decided to write about how I went about fixing it.

Click through to understand the problem and solution.

Comments closed