Avoding Direct View() Calls In R

Kevin Feasel

2018-03-06

R

John Mount notes that you should not assume that the View() function in R will work:

R tip: get out of the habit of calling View() directly.

View() only works correctly in interactive environments, not currently in RMarkdown contexts. It is better to call something else that safely dispatches to View(), or to something else depending if you are in an interactive or non-interactive session.

Click through for a script which is safe to run whether you’re in R Studio or using knitr to build a document.

Recommendations For Running Kafka On AWS

Kevin Feasel

2018-03-06

Hadoop

Prasad Alle has some recommendations if you decide to run Apache Kafka on AWS:

The network plays a very important role in a distributed system like Kafka. A fast and reliable network ensures that nodes can communicate with each other easily. The available network throughput controls the maximum amount of traffic that Kafka can handle. Network throughput, combined with disk storage, is often the governing factor for cluster sizing.

If you expect your cluster to receive high read/write traffic, select an instance type that offers 10-Gb/s performance.

In addition, choose an option that keeps interbroker network traffic on the private subnet, because this approach allows clients to connect to the brokers. Communication between brokers and clients uses the same network interface and port. For more details, see the documentation about IP addressing for EC2 instances.

If you are deploying in more than one AWS Region, you can connect the two VPCs in the two AWS Regions using cross-region VPC peering. However, be aware of the networking costs associated with cross-AZ deployments.

There’s some good advice here, as well as acknowledgement of various tradeoffs involved in architecting a solution.

Module Signing For Database Rights

Solomon Rutzky shows how to use module signing to grant granular permissions to users:

Scenario: We want to allow one or more Users and/or Database Roles to be able to truncate certain Tables, but not all Tables. We certainly do not want to allow anyone the ability to make structural changes to the Table.

Also, it is likely that, over time, at least one more Tables will be added that the User(s) and/or Role(s) should be able to truncate, and less likely, though not impossible, that one or more tables that they should be able to truncate now might be removed.

Truncation is a great example of the kind of right you’d want behind a signed stored procedure, as the level of right necessary to truncate a table is absurd:  practically full control of the table.  Module signing is something that I wish more DBAs knew and implemented.

Executing R Scripts In SSRS

Tomaz Kastrun shows how to include R scripts (and visuals) in SQL Server Reporting Services:

Using the privileges of R language to enrich your data, your statistical analysis or visualization is a simple task to get more out of your reports.

The best practice to embed R code into SSRS report is to create stored procedure and output the results to report. To demonstrate this, we will create two reports; one that will take two input parameters and the second one to demonstrate the usage of R visualization.

It’s nice to be able to use R to create nice visuals and then import them in your SSRS report, and Tomaz shows how.

Explaning RANKX In DAX

Philip Seamark explains how the RANKX function works:

One of the first traps to encounter when using this function is the function can be used in calculations for calculated columns as well as calculated measures.   The RANKX function will still do what it is asked.  The trick is how you use the function in each scenario – and more importantly, what filters are going to be implicitly applied to the data the RANKX function actually uses.

This is a helpful post for explaining the ranking function.

All Execution Plans Are Estimates

Grant Fritchey drops a bomb on us:

All these resources, yet, for any given query, all the plans will be identical (assuming no recompile at work). Why? Because they’re all the same plan. Each and every one of them is an estimated plan. Only an estimated plan. This is why the estimated costs stay the same between an estimated and actual plan, this despite any disparity between estimated and actual row counts.

I’ve blogged about this before, but it’s worth mentioning again. There are a only a few minor differences between an estimated plan and an actual plan. It’s all about the data set. What’s going on is that an actual plan can capture query metrics, which are then appended to the estimated plan. At no point is any different plan generated during this process. It’s just a plan, an estimated plan, or, it’s a plan plus query metrics.

Read the whole thing.

Getting A Random Row

Kevin Feasel

2018-03-06

T-SQL

Brent Ozar shares four methods for getting a random row from a table:

Method 1, Bad: ORDER BY NEWID()

Easy to write, but it performs like hot, hot garbage because it scans the entire clustered index, calculating NEWID() on every row:

That took 6 seconds on my machine, going parallel across multiple threads, using tens of seconds of CPU for all that computing and sorting. (And the Users table isn’t even 1GB.)

Click through for the other three methods.  The really tricky part is when you want to get a random sample from the table, as TABLESAMPLE is an awful choice for that.

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031