Intro To Data Analytics

Stacia Varga has a follow-up to her Gentle Introduction to Data Analytics:

Q: How is data analytics on SAS different from R support for SQL2016?

A: I’ve not used SAS (although I did go to their campus to teach MDX to their engineers once upon a time… beautiful campus!), so I can’t say with any specificity. SAS also has multiple products so it would be difficult to describe differences. In general – and please understand there may be something new in SAS that I don’t know about – I think the difference is that R in SQL 2016 allows you to run functions on data inside the database, thereby leveraging database resources and also server resources for parallelization. I’d love to get input from readers to expand on this topic.

I can confirm that SAS has a beautiful campus.

Logging WhoIsActive Output

Tara Kizer has a primer on storing WhoIsActive outputs for subsequent analysis:

Create a new job and plop the below code into the job step, modifying the first 3 variables as needed. The code will create the logging table if it doesn’t exist, the clustered index if it doesn’t exist, log current activity and purge older data based on the @retention variable.

How often should you collect activity? I think collecting sp_WhoIsActive data every 30-60 seconds is a good balance between logging enough activity to troubleshoot production problems and the storage needed to keep the data in a very busy environment.

I like having something like this in place because often times, when you need these results, it’s already too late.

Enterprise R Security

Kevin Feasel

2016-07-27

R, Security

Ramkumar Chandrasekeran discusses DeployR, an enterprise security model for R:

DeployR Enterprise is designed to deliver analytics solutions at scale to whomever needs it: inside or outside the enterprise. It also guarantees secure delivery of your analytics via DeployR web services. These secure web services integrate seamlessly with existing enterprise security solutions: Single Sign-On, LDAP, Active Directory, PAM, and Basic Authentication, can enforce access privileges already defined by your IT department for existing enterprise users and also have the capability to safely support anonymous users when needed.

There’s nothing groundbreaking here:  it’s TLS (to encrypt network transmissions) and LDAPS (to control authentication and authorization).  That there’s nothing groundbreaking is a good thing—that means companies will have most of the infrastructure in place to support this.

SQL Server 2016 CU1 Bugfixes

Andrew Pruski notes an important bugfix in SQL Server 2016 CU1:

SQL Server 2016 CU1 has been released and one thing I noticed was: –

FIX: Canceling a backup task crashes SQL Server 2014 or 2016

That’s pretty nasty, when I originally clicked on the link I was expecting to see detailed a pretty precise set of circumstances in which that bug can occur but no no, apparently not. Cancelling any backup task can lead to this happening.

Andrew then argues in favor of waiting for SPs before deploying new versions of software, having been burned on it in the past.  I don’t agree with that philosophy; regardless, I recommend reading his post.

Traveling Salesmen

Jesse Clark explains the traveling salesman problem:

One of the canonical questions in operations is the traveling salesman problem (TSP). In its simplest form, we have a busy salesperson who must visit a set number of locations once. Time is money, so the salesperson wants to choose a route that minimizes the total distance traveled. It is not so hard to imagine these path optimization problems occurring within warehouses where people (‘pickers’) need to navigate aisles and fill orders as they go.

The Traveling Salesman Problem is a computer science classic and acts as a classic graph optimization problem.  Check this post out for more details.

Netflix Billing

Subir Parulekar and Rahul Pilani describe how they moved their billing data out of a data center and into AWS:

Now the only (and most important) thing remaining in the Data Center was the Oracle database. The dataset that remained in Oracle was highly relational and we did not feel it to be a good idea to model it to a NoSQL-esque paradigm. It was not possible to structure this data as a single column family as we had done with the customer-facing subscription data. So we evaluated Oracle and Aurora RDS as possible options. Licensing costs for Oracle as a Cloud database and Aurora still being in Beta didn’t help make the case for either of them.
 
While the Billing team was busy in the first two acts, our Cloud Database Engineering team was working on creating the infrastructure to migrate billing data to MySQL instances on EC2. By the time we started Act III, the database infrastructure pieces were ready, thanks to their help. We had to convert our batch application code base to be MySQL-compliant since some of the applications used plain jdbc without any ORM. We also got rid of a lot of the legacy pl-sql code and rewrote that logic in the application, stripping off dead code when possible.
Our database architecture now consists of a MySQL master database deployed on EC2 instances in one of the AWS regions. We have a Disaster Recovery DB that gets replicated from the master and will be promoted to master if the master goes down. And we have slaves in the other AWS regions for read only access to applications.

Read the whole thing.  Their architectural requirements probably won’t be yours (unless you’re working at a company at the scale of Netflix), but it’s quite interesting seeing how they solve their problems.

Built-In Query Monitoring Tools

Grant Fritchey describes a couple built-in options for monitoring query performance:

It’s not enough to know that you have a slow query or queries. You need to know exactly how slow they are. You must measure. You need to know how long they take to run and you need to know how many resources are used while they run. You need to know these numbers in order to be able to determine if, after you do something to try to help the query, you’ll know whether or not you’ve improved performance. To measure the performance of queries, you have a number of choices. Each choice has positives and negatives associated with them. I’m going to run through my preferred mechanisms for measuring query performance and outline why. I’ll also list some of the other mechanisms you have available and tell you why I don’t like them. Let’s get started.

This is an intro-level blog post, so Grant doesn’t go into much detail, but he does provide some good links for getting started.

Upgrades And Regressions

Kendra Little explains when upgrades can cause performance to suffer:

The cluster’s servers and SQL Server configurations were built to be as close to identical as possible to the previous instance (memory, cores, disk, maxdop, CTP, etc).

After the migration, I noticed that CPU utilization jumped from the normal 25% to a consistent 75%.

I did several other migrations with similar server loads with no issues, so I’m a bit puzzled as to what might be going on here. Could the upgrade from SQL Server 2008 R2 to SQL Server 2012 simply be exposing bad queries that 2008 was handling differently?

Kendra goes through a number of reasons, building a troubleshooting guide in the process.  This is a great read.

Hive 2.1 Benchmarks

Kevin Feasel

2016-07-26

Hadoop

Nita Dembla and Gopal Vijayaraghavan compare Hive 2.1 versus Hive 1:

To measure the improvement LLAP brings we ran 15 queries that were taken from the TPC-DS benchmark, similar to what we have done in the past. The entire process was run using the hive-testbench repository and data generation tools. The queries there are adapted to Hive SQL but are otherwise not modified from the standard TPC-DS queries using any of the tricks that some big data vendors routinely use to show better performance for their tools. This blog only covers 15 queries but a more comprehensive performance test is underway.

The full test environment is explored below but at a high level the tests run using 10 powerful VMs with a 1TB dataset that is intended to show performance at data scales commonly used with BI tools. The same VMs and the same data are used both for Hive 1 and for Hive 2. All reported times represent the average across 3 runs in the respective Hive version.

Hive 2.1 looks like a big step forward for Hadoop performance.

Mapping Logins

Constantine Kokkinos shows a built-in stored procedure which shows login mappings to database users:

A great little built in stored procedure to map server level logins to database users named sp_msloginmappings .

It builds one result set per login, so you might be getting a lot of result sets back.

Categories

December 2018
MTWTFSS
« Nov  
 12
3456789
10111213141516
17181920212223
24252627282930
31