Project-Oriented R Development

Jenny Bryan explains how building projects in R can reduce the likelihood that someone will come in and set your computer on fire:

I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work. I’m not assuming this is an RStudio Project, though this is a nice implementation discussed below.

Any resident R script is written assuming that it will be run from a fresh R process with working directory set to the project directory. It creates everything it needs, in its own workspace or folder, and it touches nothing it did not create. For example, it does not install additional packages (another pet peeve of mine).

This convention guarantees that the project can be moved around on your computer or onto other computers and will still “just work”. I argue that this is the only practical convention that creates reliable, polite behavior across different computers or users and over time. This convention is neither new, nor unique to R.

I admit that I’m just now getting into using projects regularly for my one-off stuff.  This is very good advice.  H/T David Smith

What Is A Dashboard

I’ve started a new series on data visualization:

There are a few things which make dashboards useful:

  • Ideally, the dashboard is a “single pane of glass.”  By that, I mean that all relevant indicators are visible on the screen at the same time.  With my car, it’s close but no cigar:  I can see one of miles traveled, average fuel mileage, or current fuel mileage at a time.  If I want to see a different item, I need to hit a button on the steering wheel to scroll through those options.  By contrast, the TV show dashboard has everything on a single screen with no scrolling or switching required.

  • Key Performance Indicators (KPIs) are readily apparent.  For the TV show dashboard, we have a couple key metrics on display:  episode rating and number of votes as sourced from IMDB at the time I pulled those numbers.

  • Relevant KPIs are bunched together in a logical fashion.  On the top half of the dashboard, we see two visuals relating to average rating by show.  The bottom half show rating & user vote counts for the three highest-rated shows.

  • Layouts are consistent between dashboard elements and between related dashboards.  On the TV show dashboard, bars and columns use a single, consistent color.  Also, shows have thematic colors:  Daredevil in red, Jessica Jones blue, Punisher black, etc.  If I had a second dashboard for season two, I would want to use the same theme.

Read on for more details about what a dashboard is and some of the sundry forms of dashboards.

DBA Morning Health Checks

Patrick Keisler has a process which walks through some morning health checks for SQL Server instances:

A few weeks ago, a customer asked me to help develop a way to improve their morning routine of checking the health of each SQL Server. This daily task would typically take about an hour for the DBA to complete. The solution I provided him, reduced that task to under one minute.

The DBA needed me to answer these questions for each SQL Server:

1.       What is the uptime of each SQL Server?

2.       What is the status of each database?

3.       What is the status of each Availability Group?

4.       What is the backup status of each database?

5.       What is the available disk space?

6.       Are there any SQL Agent failed jobs in the last 24 hours?

7.       What errors appeared in the SQL errorlog in the last 24 hours?

In addition, the customer asked to display the results using the typical stop light colors (red, yellow, and green) to represent the condition of each server.

Click through for more details, as well as a link to Patrick’s GitHub repo which hosts the script.

Finding Open Transactions

Andy Mallon has a stored procedure to find open transactions:

I’m fussy. Paul’s example script is awesome, but I want to tweak it to meet my needs. I also want to wrap it into a stored procedure so that I can put it in my DBA database, and have it installed on every server I manage. In addition to the columns Paul pulls back, I want to find out:

  • More session details: host name, application name, etc
  • More transaction info: Transaction state, duration, etc
  • Filter out very short transactions: I want to filter out the noise for regular activity

There’s also a transaction state that I exclude. Transaction State 3 means “The transaction has been initialized but has not generated any log records.” A transaction in this state won’t affect log reuse. However, it could affect the size of the version store in TempDb if you’re using Read Committed Snapshot Isolation. I haven’t run into this as a problem, so I filter it out.

I used dbo.Check_Blocking as a base to create another check script to check open transactions

Click through to see what Andy’s procedure offers and also for a link to get the script.

Smart Transaction Log Backups

Tracy Boggiano has started a new series on smart transaction log backups.  Part one involves taking smarter transaction log backups in SQL Server 2017:

SQL Server 2017 introduced two fields to help with taking smart backups.  One was for taking smarter log backups, for this have DMV sys.dm_db_log_stats that have two fields log_backup_time and log_since_last_backup_mb.  With the combination of these two fields, we can put some logic in the jobs that I use for Ola’s scripts that use my config tables from my Github repository.  To support this change we will be added three new fields to the DatabaseBackupConfig table:

  • SmartBackup

  • LogBackupTimeThresholdMin

  • LogBackupSizeThresholdMB

Click through for scripts.

Row Counts From Statistics In Azure DW

Derik Hammer has a script to estimate row counts in an Azure SQL Data Warehouse table:

Azure SQL Data Warehouse is a massively parallel processing (MPP) architecture designed for large-scale data warehouses. An MPP system creates logical / physical slices of the data. In SQL Data Warehouse’s case, the data has 60 logical slices, at all performance tiers. This means that a single table can have up to 60 different object_ids. This is why, in SQL Data Warehouse, there is the concept of physical and logical object_ids along with physical names.

Below is a query for finding row counts of tables in SQL Data Warehouse which accounts for the differences in architecture between my earlier script, written for SQL Server, and SQL Data Warehouse.

Click through for the script.

Embedding Base-64 Encoded Images Into Power BI

Jason Thomas shows how to embed an image into Power BI without using an image URL:

And I completely understood his concerns as I had the same issue with some of the public facing reports that I made, for eg., the US Election report that I had made 1 year back. The images for the candidates were sourced from Wikipedia and certain candidates like George Bush, Donald Trump, etc. are not displayed, because the image URLs are no longer valid.

This is where you can use my workaround to embed the images within the report by converting the images into Base64.

It’s an interesting approach when you need to solve this problem.

DAX’s EARLIER And EARLIEST Functions

Matt Allington contrasts the EARLIER function with the EARLIEST function in DAX:

The EARLIER function by default refers to the row context that immediately preceded the current row context being considered.  In the examples used here there have only been 2 row contexts, the outer and the inner.  Given there are only 2, when using EARLIER within the inner row context it is always referring to the outer row context.  In this case, EARLIER and EARLIEST refer to the same outer row context.  If there are more than 2 row contexts, the EARLIER function accesses the previous row context by default (starting from the inner and moving out) and the EARLIEST function refers to the outermost row context. But did you know that EARLIER has a second optional parameter?

Read the whole thing.

Categories

January 2018
MTWTFSS
« Dec  
1234567
891011121314
15161718192021
22232425262728
293031