R With Power BI

Kevin Feasel

2016-06-10

Power BI, R

Anindita Basak shows how to integrate R graphics into Power BI:

If you already have R installed on the same system as PowerBI, you just need to paste the R scripts in the code pen. Otherwise you need to install R in the system where you are using the PowerBI desktop like this:

This step-by-step guide features a lot of images and should be pretty easy for a new user.

Skirting Around Dynamic Data Masking

Phil Factor gives a trivial method of subverting dynamic data masking:

Dynamic data masking is a great product and solves some niche problems that come if you need to do certain testing with live data at the application level. You should, however, beware of using it as a database-level security device.

I haven’t yet used it in testing because I don’t have the problem that it solves.

The problem that it solves is for people doing testing, especially user-acceptance testing, of an application using live data. It is good at masking data where the user is unable to make direct access to the database to execute queries.

Phil has code to get around credit card numbers, and I will say that he’s not the first person I’ve seen do this.  Dynamic Data Masking is not a general-purpose security solution.

DBCC Help

Kevin Feasel

2016-06-10

DBCC

Kenneth Fisher describes DBCC Help:

Many years ago during an interview I was asked to “Name the top 7 DBCC commands that you use.” I think I was able to name 3. Ever since then I’ve paid particular attention to DBCC commands in preparation of the day when I hear that question again. Well not to long ago I was watching Erin Stellato’s (b/t) PluralSight course “SQL Server: Understanding and Using DBCC Commands”. Note: It’s a great course and I highly recommend it. In it she goes over quite a few DBCC commands, some documented, some not. In noted two in particular.

The first one I noted gives you a list of the size and free space in the logs of all databases on the instance. Highly useful. I’ve used it repeatedly recently. Unfortunately my memory is not always the best and I can never quite remember the exact name. Which brings us to the other command.

I completely agree that Erin’s Pluralsight course is worth watching.

Aggregating Clickstream Data

Kevin Feasel

2016-06-10

Spark

Ofer Habushi solves a clickstream aggregation problem using Spark:

At this point, an interesting question came up for us: How can we keep the data partitioned and sorted? 

That’s a challenge. When we sort the entire data set, we shuffle in order to get sorted RDDs and create new partitions, which are different than the partitions we got from Step 1. And what if we do the opposite?

Sort first by creation time and then partition the data? We’ll encounter the same problem. The re-partitioning will cause a shuffle and we’ll lose the sort. How can we avoid that?

Partition→sort = losing the original partitioning

Sort→partition = losing the original sort

There’s a solution for that in Spark. In order to partition and sort in Spark, you can use repartitionAndSortWithinPartitions. 

This is an interesting solution to an ever-more-common problem.

Actual Rows Read

Rob Farley goes into detail on the Actual Rows Read property in execution plans:

The warning says “Operation caused residual IO. The actual number of rows read was 2,130, but the number of rows returned was 2.” Sure enough, further up we see “Actual Rows Read” saying 2,130, and Actual Rows at 2.

Whoa! To find those rows, we had to look through 2,130?

You see, the way that the Seek runs is to start by thinking about the Seek Predicate. That’s the one that leverages the index nicely, and which actually causes the operation to be a Seek. Without a Seek Predicate, the operation becomes a Scan. Now, if this Seek Predicate is guaranteed to be at most one row (such as when it has an equality operator on a unique index), then we have a Singleton seek. Otherwise, we have a Range Scan, and this range can have a Prefix, a Start, and an End (but not necessarily both a Start and an End). This defines the rows in the table that we’re interested in for the Seek.

But ‘interested in’ doesn’t necessarily mean ‘returned’, because we might have more work to do. That work is described in the other Predicate, which is often known as the Residual Predicate.

Definitely worth a read.

Think About Recovery Mode

Kendra Little answers a user question about a co-worker who puts all prod databases into Simple recovery mode:

What if Your Coworker Regularly Goes Off the Ranch and Doesn’t Use Change Control?

Don’t cover for them.

Ask them about it first to make sure there wasn’t an emergency change request you’re unaware of for the change, but be honest about what happened when you’re asked.

In other words, treat them as an equal and a grown-up.

Sometimes in this situation, people sugar coat things or cover for the person who makes mistakes. You need to treat them as an adult though.

If you made mistakes, you would own up to what happened and work to not do it again, right? It’s just about respectfully allowing others to own their actions.

Kendra’s answer is a good mixture of technical explanation and not being socially inept.  It’s important to follow that two-pronged approach.

HIBPwned

Kevin Feasel

2016-06-10

R, Security

Steph Locke has a new CRAN package out:

HIBPwned is a feature complete R package that allows you to use every (currently) available endpoint of the API. It’s vectorised so no need to loop through email addresses, and it requires no fiddling with authentication or keys.

You can use HIBPwned to do things like:

  1. Set up your own notification system for account breaches of myriad email addresses & user names that you have

  2. Check for compromised company email accounts from within your company Active Directory

  3. Analyse past data breaches and produce charts like Dave McCandless’ Breach chart

The regular service is extremely useful and Steph’s wrapper looks like it’s worth checking out.

Analysis Of Fantasy Sports Using Spark

Kevin Feasel

2016-06-10

Spark

Jordan Voiz knows how to get to my heart:

Although the data involved is not large in volume, the types of data processing, data analytics, and machine-learning techniques used in this area are common to many Apache Hadoop use cases. So, fantasy sports analytics provides a good (and fun) use case for exploring the Hadoop ecosystem.

Apache Spark is a natural fit in this environment. As a data processing platform with embedded SQL and machine-learning capabilities, Spark gives programmatic access to data while still providing an easy SQL access point and simple APIs to churn through the data. Users can write code in Python, Java, or Scala, and then use Apache Hive, Apache Impala (incubating), or even Cloudera Search (Apache Solr) for exploratory analysis.

Baseball was my introduction to statistics, and I think that fantasy sports is a great way of driving interest in stats and machine learning.  I’m looking forward to the other two parts of this series.

Don’t Ignore .NET Framework 3.5 Just Yet

Shaun J. Stuart notes that database mail apparently still requires .NET Framework 3.5:

However, I did run into one issue later on regarding database mail. I was able to configure it correctly and was even able to send a test email through SSMS without any errors. Unfortunately, the email was never delivered. Additionally, the database mail logs did not show database mail starting or attempting to send the message. I checked with my Exchange administrator and he said he never saw the SQL Server connecting to the mail server.

A check of the database mail table in msdb (sysmail_allitems) showed my messages sitting there with a status of Unsent.

After troubleshooting for a couple of hours and getting nowhere, we installed in the .Net Framework 3.5 and suddenly database mail started working.

This, hopefully, is a bug.  But not getting e-mail alerts you’re expecting to receive can be a scary scenario.

Generating Bulk Import Files

Cody Konior shows us how to create a SQL Server bulk insert format file using Powershell:

Don’t forget there are a few minor security considerations:

  • Your login needs Administer Bulk Operations permission.
  • Your AD account needs access to the file (and possibly delegation enabled for remote shares)
  • Or if using an SQL login the database engine service account needs access to the file.

Anyway now that we have the data in table format without worrying about ordering or duplicate column names, we can much more easily manipulate it and store it into the database.

I remember creating a couple of these by hand, and that was no fun.  I never created enough to get the hang of the syntax or to want to automate the process, but at least I know where to look if I ever have to do this again.

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930