Press "Enter" to skip to content

Curated SQL Posts

Using R In Azure Data Lake Analytics

David Smith links to a tutorial which shows how to use R against Azure Data Lake Analytics:

The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, or via HDFS-compatible open-source applications. And for data science applications, you can also access the data directly from R, as this tutorial explains.

To interface with Azure Data Lake, you’ll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There’s a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this data you can use any function from base R or any R package. (Several common R packages are provided in the environment, or you can upload and install other packages directly, or use the checkpoint package to install everything you need.) The R engine used is R 3.2.2.

Click through for the details.

Comments closed

Linear Regression With Deducer

Sunil Kappal demonstrates how to use Deducer, a GUI for R, to perform a simple linear regression:

Selecting the variables in the Deducer GUI:

  • Outcome variable: Y, or the dependent variable, should be put on this list

  • As numeric: Independent variables that should be treated as covariates should be put in this section. Deducer automatically converts a factor into a numeric variable, so make sure that the order of the factor level is correct

  • As factor: Categorically independent variables (language, ethnicity, etc.).

  • Weights: This option allows the users to apply sampling weights to the regression model.

  • Subset: Helps to define if the analysis needs to be done within a subset of the whole dataset.

Deducer is open source and looks like a pretty decent way of seeing what’s available to you in R.

Comments closed

Testing Azure Data Lake Store Performance

Zhen Zeng and Govind Kamat stress test Azure Data Lake Store:

Now that we know the read and write throughput characteristics of a single Data Node, we would like to see how per-node performance scales when the number of Data Nodes in a cluster is increased.

The tool we use for scale testing is the Tera* suite that comes packaged with Hadoop.  This is a benchmark that combines performance testing of the HDFS and MapReduce layers of a Hadoop cluster.  The suite is comprised of three tools that are typically executed in sequence:

  • TeraGen, that tool that generates the input data.  We use it to test the write performance of HDFS and ADLS.

  • TeraSort, which sorts the input data in a distributed fashion.  This test is CPU bound and we don’t really use it to characterize the I/O performance or HDFS and ADLS, but it is included for completeness.

  • TeraValidate, the test that reads and validates the sorted data from the previous stage.  We use it to test the read performance of HDFS and ADLS.

It’s an interesting look at how well ADLS scales.  In general, my reading of this is fairly positive for Azure Data Lake Store.

Comments closed

New Management Studio Features

Wayne Sheffield looks at two new features in SQL Server Management Studio 17.3:

For this simple test, it worked pretty well, and it should work well for most of the requirements that you have. Time will tell how reliably this new feature does work.

The Import Flat File is available when connecting to SQL Server version 2005 or higher. I haven’t tried this on a lower version, but I don’t see any reason why it wouldn’t work there either. You can read more about this feature in Microsoft’s documentation.

You can definitely break the Import Flat File feature, but I appreciate it being smoother than the SSIS-based wizard of yore.  Wayne also shares his thoughts on the Extended Events Profiler.

Comments closed

Azure Database-Level Firewall Rules And Geo-Replication

Arun Sirpal explains that you don’t need to create database-level firewall rules in Azure on secondary databases when using Active Geo-Replication:

The main purpose of this post today is to discuss this point – If you have an Azure SQL Database involved in Active Geo Replication and opt to use database level firewall rules do you need to create the rules in both the primary and secondary database?

I thought so, but I was wrong. I connect to my primary database and run the following (obfuscated) .

Read on for Arun’s demonstration.

Comments closed

Converting DATETIME2 To VARBINARY

Randolph West unravels a mystery around byte lengths:

Quite a lot to take in. Let’s break this down.

DATETIME2 is a data type that was introduced in SQL Server 2008. It uses up to 8 bytes to store a date and time: 3 bytes for the date component, and up to 5 bytes for the time component.

The point here is that it uses 8 bytes in total. That’s it. No more.

Jemma noted that when converting the DATETIME2 data type to BINARY, it suddenly became longer by exactly one byte, which seems strange.

Read on for the solution.

Comments closed

System Objects And Helper Functions

Kenneth Fisher compares and contrasts querying system tables versus using built-in helper functions which query the system tables:

So which should you use? Well, like most things in the database world, it depends. Do you need to cross databases? Do you need to use NOLOCK for a diagnostic query? Do you need additional information available in one of the views that isn’t available in one of the helper functions?

Click through for the comparison.  I agree with his end result, that you should at least know both methods, even if you lean toward one.

Comments closed

Predicting Advertising Budgets With Kafka Streams

Boyang Chen explains how Pinterest uses Kafka Streams to reduce advertising overdelivery:

Overdelivery occurs when free ads are shown for out-of-budget advertisers. This reduces opportunities for advertisers with available budget to have their products and services discovered by potential customers.

Overdelivery is a difficult problem to solve for two reason:

  1. Real-time spend data: Information about ad impressions needs to be fed back into the system within seconds in order to shut down out-of-budget campaigns.

  2. Predictive spend: Fast, historical spend data isn’t enough. The system needs to be able to predict spend that might occur in the future and slow down campaigns close to reaching their budget. That’s because an inserted ad could remain available to be acted on by a user. This makes the spend information difficult to accurately measure in a short timeframe. Such a natural delay is inevitable, and the only thing we can be sure of is the ad insertion event.

This is a very interesting architectural overview.

1 Comment

R And Python: Two Growing Languages

David Smith notes that as fast as Python is growing, R is as well:

Python has been getting some attention recently for its impressive growth in usage. Since both R and Python are used for data science, I sometimes get asked if R is falling by the wayside, or if R developers should switch course and learn Python. My answer to both questions is no.

First, while Python is an excellent general-purpose data science tool, for applications where comparative inference and robust predictions are the main goal, R will continue to be the prime repository of validated statistical functions and cutting-edge research for a long time to come. Secondly, R and Python are both top-10 programming languages, and while Python has a larger userbase, R and Python are both growing rapidly — and at similar rates.

I had a discussion about this last night.  I like the language diversity:  R is more statistician-oriented, whereas Python is more developer-oriented.  They both can solve the same set of problems, but there are certainly cases where one beats the other.  I think Python will end up being the more popular language for data science because of the number of application developers moving into the space, but for the data analysts and academicians moving to this field, R will likely remain the more interesting language.

Comments closed

Polybase And HDInsight

I have a post up on trying to integrate Polybase with HDInsight:

But now we run into a problem:  there are certain ports which need to be open for Polybase to work.  This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs.  This goes back to the issue we see with spinning up data nodes in Docker:  ports are not available.  If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.

As I mention in the post, I’d much rather build my own Hadoop cluster; I don’t think you save much maintenance time in the long run going with HDInsight.

Comments closed