Press "Enter" to skip to content

Month: August 2018

Faster User-Defined Functions In SparkR

Liang Zhang and Hossein Falaki note a major performance improvement for functions in SparkR using the latest version of the Databricks Runtime:

SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame

  • dapply()
  • dapplyCollect()
  • gapply()
  • gapplyCollect()

dapply() allows you to run an R function on each partition of the SparkDataFrame and returns the result as a new SparkDataFrame, on which you may apply other transformations or actions. gapply() allows you to apply a function to each grouped partition consisting of a key and the corresponding rows in a SparkDataFrame. dapplyCollect() and gapplyCollect()are shortcuts if you want to call collect() on the result.

The following diagram illustrates the serialization and deserialization performed during the execution of the UDF. The data gets serialized twice and deserialized twice in total, all of which are row-wise.

By vectorizing data serialization and deserialization in Databricks Runtime 4.3, we encode and decode all the values of a column at once. This eliminates the primary bottleneck which row-wise serialization, and significantly improves SparkR’s UDF performance. Also, the benefit from the vectorization is more drastic for larger datasets.

It looks like they get some pretty serious gains from this change.

Comments closed

Subsetting Matrices In R

Dave Mason continues his look at matrices in R:

We can extract an entire row from a matrix. To do this, specify the desired row only within the square brackets [ ]. The placeholder where you would otherwise specify the column is left empty.

> #Points scored by Kendrick Perkins.
> points_scored_by_quarter[1,]
1st 2nd 3rd 4th 
  2   2   6   0 
> points_scored_by_quarter["Perkins",]
1st 2nd 3rd 4th 
  2   2   6   0 

Conversely, we can extract a column from a matrix. Specify the column within the square brackets [ ]and omit the row. The result is a vector, thus the pivot effect–the row names are displayed in the output (not the column name).

Dave points out that working with matrices is basically an extension of working with vectors.

Comments closed

Database Ownership Chaining On Azure SQL Managed Instances

Jovan Popovic shows that you can enable database ownership chaining on Azure SQL Managed Instances:

If you have the same owner on several objects in several databases, and you have some stored procedure that access these objects, you don’t need to GRANT access permission to every object that the procedure needs to access. If the procedure and the objects have the same owner, you can to GRANT permission on the procedure and Database Engine will allow the procedure to access all other objects that share the same owner.

In this example, I will create two databases that have the same owner and a login that will be used to access the data. One database will have some table and other database will have a stored procedure that reads data from the table in other database. Login will be granted to execute the stored procedure, but not to read data from the table:

“Can” and “should” here probably have different answers.  Far better to set up certificates for granting rights.

Comments closed

A Minimalist Guide To Using SQL Server On Linux

Mark Litwintschik has a quick guide to installing and using SQL Server 2017 on Ubuntu:

SQL Server is Microsoft’s enterprise relational database offering. It was first released in 1989 and has seen support on various Windows and OS/2 platforms since it’s release. In October 2017, Microsoft released SQL Server 2017 for Linux. To date, Ubuntu 16, Red Hat Enterprise Linux 7.3 and 7.4 as well as SUSE Enterprise Linux Server v12 are supported.

Though the Linux distribution is missing features found in the Windows offering, the result is a very useful and feature-rich database that fits in well in a UNIX environment.

In this post I’ll walk through setting up SQL Server 2017, performing basic data import and export tasks as well as building reports via Jupyter Notebook and automating tasks using Apache Airflow.

Mark calls this a minimalist guide, but he does cover a lot of the basics.

Comments closed

Mining The Plan Cache, Query Store, And More

Erin Stellato shows the benefit of digging through the plan cache, Query Store, and third-party performance tool databases (using SentryOne’s SQL Sentry as an example):

As much as I love all this extra data, it’s important to note that some information is more relevant for an actual execution plan, versus an estimated one (e.g. tempdb spill information). Some days we can capture and use the actual plan for troubleshooting, other times we have to use the estimated plan. Very often we get that estimated plan – the plan that has been used for problematic executions potentially – from SQL Server’s plan cache. And pulling individual plans is appropriate when tuning a specific query or set or queries. But what about when you want ideas on where to focus your tuning efforts in terms of patterns?

The SQL Server plan cache is a prodigious source of information when it comes to performance tuning, and I don’t simply mean troubleshooting and trying to understand what’s been running in a system. In this case, I’m talking about mining information from the plans themselves, which are found in sys.dm_exec_query_plan, stored as XML in the query_plan column.

When you combine this data with information from sys.dm_exec_sql_text (so you can easily view the text of the query) and sys.dm_exec_query_stats (execution statistics), you can suddenly start to look for not just those queries that are the heavy hitters or execute most frequently, but those plans that contain a particular join type, or index scan, or those that have the highest cost. This is commonly referred to as mining the plan cache, and there are several blog posts that talk about how to do this. My colleague, Jonathan Kehayias, says he hates to write XML yet he has several posts with queries for mining the plan cache:

It’s a good article with a lot of useful information.

Comments closed

Creating Indexed Views

Eduardo Pivaral shows how to create a fairly simple indexed view:

Views help our query writing by simplifying writing the same sentences and/or aggregations over and over again, but it has a drawback, the views just store our query definition, but the performance is not improved by using them.

Since SQL Server 2008, the option to create an index over a view was introduced, of course, there are some limitations, but if your view can use them, the performance improvement could be great!

I will show you how to create a simple index over a view

I’ve used indexed views in the past, but they’re a much less common tool in the belt these days.

Comments closed

Finding The Last Database Restore Time

Lori Brown shows how you can see the last time a particular database was restored:

I have a client that uses a lot of disconnected log shipping on a few servers.  They do this because they want to have a copy of their database that is actually hosted by software vendors for reporting purposes.  So, disconnected log shipping it is!  We pull down files that have been FTP’d to the corporate FTP site and apply those logs locally and leave the databases in standby so that they can be queried.

(If you want to review how to set up disconnected log shipping complete with FTP download scripts, all SQL jobs and log shipping monitoring, then hop on over to https://www.sqlrx.com/setting-up-disconnected-log-shipping/ to check it out!)

Of course every so often there is a glitch and one or all databases can get out of synch pretty easily.  When we receive a notification that tlogs have not been restored in a while, the hunt is on to see what happened and get it corrected.  Since I have disconnected log shipping set up, I can’t get a log shipping report from SSMS that will tell me errors and what was the last log applied.  So, I made a query that will give me the most recent file that was restored along with the file name, date, database, etc.  I can pull back all databases or by uncommenting a line can filter by a single or multiple databases.

Lori also includes a helpful script.

Comments closed

What Happens With Data Compression + Backup Compression

Jess Pomfret tests what happens when you enable backup compression for databases with already-compressed tables in SQL Server:

What happens if I use data compression and backup compression, do I get double compression?

This is a great question, and without diving too deeply into how backup compression works I’m going to do a simple experiment on the WideWorldImporters database.  I’ve restored this database to my local SQL Server 2016 instance and I’m simply going to back it up several times under different conditions.

After restoring the database it’s about 3GB in size, so our testing will be on a reasonably small database.  It would be interesting to see how the results change as the database size increases, perhaps a future blog post.

Click through for the answer.

Comments closed

No Curation Today

I’m somewhere deep in South Dakota at this point, so no curation today.  We will be back on the regular schedule Monday.

South Dakota at dusk.
Comments closed

Last-Click Attribution With Databricks Delta

Caryl Yuhas and Denny Lee give us an example of building a last-click digital marketing attribution model with Databricks Delta:

The first thing we will need to do is to establish the impression and conversion data streams.   The impression data stream provides us a real-time view of the attributes associated with those customers who were served the digital ad (impression) while the conversion stream denotes customers who have performed an action (e.g. click the ad, purchased an item, etc.) based on that ad.

With Structured Streaming in Databricks, you can quickly plug into the stream as Databricks supports direct connectivity to Kafka (Apache KafkaApache Kafka on AWSApache Kafka on HDInsight) and Kinesis as noted in the following code snippet (this is for impressions, repeat this step for conversions)

This is definitely an interesting approach to the problem.  Check it out.

Comments closed