Convenience Functions In wrapr

Kevin Feasel



John Mount walks us through some of the language conveniences available in the wrapr library:

wrapr supplies additional q*() methods.

  • qae() “quote assignment expression” where both sides of assignments is taken as un-evaluated. I.e.: qae(x = 5+1) yields c(‘x’ = ‘5 + 1’) regardless if x is bound or unbound in the environment. This is a bit of a complement to := which looks-up bindings/references (i.e.: x = "z"; x := 5+1 returns c(‘z’ = ‘6’)).

  • qe() “quote expressions” for quoting complex expressions. Similar to quote(), except it returns a list of strings (not a language object). The qe()method is not as interesting to end-users as the other methods mentioned, it is designed to help in implementation of methods that take a non-assignment expression or list of expressions such as rquery::select_rows_nse().

Read the whole thing.  := probably gives the most obvious immediate benefit but the whole set seems useful.

Installing Spark On Windows

Nigel Meakins is starting a new series on Spark and his first post involves installing Spark on Windows:

WinUtils provides a number of HDFS-emulating utilities that allow us to run Spark as though it were talking to an HDFS storage system (at least to a certain degree). Without this you will get all manner of file system-related issues wit Spark and won’t get off the launchpad.

Within the WinUtils archive you may have a number of Hortonworks Data Platform versioned folders. For the version of Spark I’m using, being 2.2.1, I have chosen hadoop-2,7,1\bin for my files. Unzip and copy the contents of the bin directory to a directory of your choice. It must however be called ‘bin’ in order to be located by the calling programs. I actually placed mine in the C:\Spark\bin directory together with the other executables that Spark uses but this is not essential.

Once done, you will need to set the following environment variable:

HADOOP_HOME = <your winutils ‘bin’ parent directory>

Note we don’t include the \bin, so for my example this is C:\Spark.

I have a post on installing Spark on Windows that might help if you get stuck on the WinUtils part.

The Benefits And Risks Of Sharing In Power BI

Steve Hughes explains the implications of sharing a Power BI report:

The primary reason to use share is to distribute content outside the context of a Power BI App. Power BI Apps should be your first mechanism for sharing content within your organization. It requires more thought and planning which is typically a good idea with your companies data. However, there are times when sharing makes sense. With the ability to share reports, you can limit sharing to specific areas. Also, you may want to create a “one-off” report for use in decision making but not something to be deployed in the long term.

Sharing is very different from deploying Apps. App deployment is not that difficult to do, but prevents sharing and is much easier to manage access.

Read on for Steve’s thoughts, including his hesitancy toward sharing en masse.

Data Migration And Visualization With Data Factory And Data Lake

Matt Basile has a video which shows him taking raw data in S3, moving it to Azure Data Lake Storage using Azure Data Factory, and then visualizing it with Power BI:

While this seems like a lot of parts just to copy a few files, it’s important to note I only scratched the surface of what ADF can do.  Think of ADF as an airline company that manages and enables cargo (data) movement between cities (data sources).  A pipeline represents the overall goal of moving certain cargo from one city to another. The linked service is the airport, which provides a landing point and access control for the cities. The dataset is the list of cargo to move, the activity is the flight itself, while the integration runtime is the airport infrastructure that makes the cargo movement possible.  A single pipeline requires all these objects to run successfully; however, many pipelines can use these same objects to complete different tasks.  Once you’ve created these data factory objects, it is straightforward to layer on additional functionality or more pipelines. ADF also has visual tools that make building these objects a breeze – to build my pipeline, all I had to do was click on “Copy data” in the visual tools start menu and follow the steps provided.

Matt has a video demonstrating the process as well.

Quick Date Formatting With CONVERT

Dave Mason lists the common date formats available with the CONVERT fucntion:

Displaying dates and times with different formats in TSQL is a task I run into quite a bit. I used to visit this page so many times, I’m surprised it doesn’t have a “Welcome back, Dave!” banner on it at the top.  After umpteen million times, I decided it was time to be more efficient. I created this query that’s come in handy numerous times. I considered dumping it into a view, but I’ve found it’s nice to copy/paste the CONVERT statement (directly from a script) and replace CURRENT_TIMESTAMP with whatever column I want to have formatted.

Click through for the script and sample output.

Using The Cosmos DB Emulator

Cedric Charlier shows off how to use the Azure Cosmos DB emulator to create and query a graph database:

When developing on Azure Cosmos DB, Microsoft’s globally distributed, horizontally partitioned, multi-model database service, it’s useful to use the local emulator. At the moment the Web Interface of the data explorer is oriented for the SQL API and it sounds impossible to create a graph query on the emulator. This is false, the UI is not aware of these features but the engine supporting the Graph API.

You can access the emulator engine by the means of the .Net librairies. Create a new console project in Visual Studio and add the package Microsoft.Azure.Graphs. This package is in preview, you’ll need to specify to the package manager that you’re accepting the previews.

Read on to learn more.

Inlining Scalar Functions

Brent Ozar covers a Microsoft paper which looks exciting:

Froid replaces the scalar UDF operator in the calling query with the newly constructed relational expression as a scalar sub-query.

That one statement comes with a lot of interesting gotchas that they discuss throughout the paper:

  • You won’t see the scalar function in the plan (just like we don’t currently see single-statement inline table-valued functions by name in the plan – we just see their effects, kinda like views)

  • The function and the underlying tables might have different permissions (you could have permissions on the function but not the tables, or vice versa, which makes compilation & execution a little trickier)

  • Code that doesn’t get used can just get removed outright (just like SQL Server can do join elimination)

  • Costs and row estimates are now useful inside the plan

To the extent that this works (and I hope it does), it helps fulfill the promise of SQL Server 2000, with encapsulation of code.  Today, one of the easiest big performance gains I can give is to strip something out of a user-defined function and inline it.

Cross-Server Partition Elimination

Derik Hammer shows an example of “old-style” partitioning across servers:

SQL Server has a feature for partitioning tables and indexes. Partitioning can be implemented at many levels, however. You can create multiple tables for one logical data set, you can split the set into multiple databases, and you can even split it among different servers. Sharding is another term. It refers to partitioning data to horizontally scale out compute resources and storage.

There are different methods of handling sharding. Each of them need a central point of connection to handle querying the data on the other shards. This is typically called the control node. The method I am to discuss today is one where linked servers are used to connect the various shards.

This is useful for something like offloading old invoices which you rarely need to a separate server.  Derik also shows that the optimizer can, if it knows your partitioning rules, avoid unnecessary cross-server hits.


January 2018
« Dec Feb »