Sample Spark-Submit Config Settings

Leela Prasad shares a few sample configuration settings for Spark-Submit jobs:

Before going further let’s discuss on the below parameters which I have given for a Job.

Click through to see what these do and why Leela chose these settings. The Spark documentation has the full list of settings but it’s good to hear explanations from practitioners.

Datasets In Spark

Ayush Hooda explains the differences between DataFrames and Datasets in Apache Spark:

The Datasets API provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. You can define Dataset objects and then manipulate them using functional transformations (map, flatMap, filter, and so on) similar to an RDD. The benefits are that, unlike RDDs, these transformations are now applied on a structured and strongly typed distributed collection that allows Spark to leverage Spark SQL’s execution engine for optimization.

Read on for more details and a few examples of how to operate DataFrames and Datasets.

Restoring Databases From Azure

John Morehouse shows how we can restore a database from Azure Blob Storage:

So how do you restore from Azure storage? You do so from an URL.  Let’s take a look!

When you backup a database to Azure, there are two types of blobs that can be utilized, namely page and block blobs.   Due to price and flexibly, it is recommended to use block blobs.  However, depending on which type you used to perform the backup will dictate how the restores are performed.  Both methods require the use a credential, so that information will need to be known before being able to restore from Azure.

Click through for examples using both page blobs and block blobs.

Azure Data Factory Data Flows

Marlon Ribunal shows how we can perform some amount of data transformation in an Azure Data Factory V2 data flow:

Azure Data Factory (ADF) offers a convenient cloud-based platform for orchestrating data from and to on-premise, on-cloud, and hybrid sources and destinations. But it is not a full Extract, Transform, and Load (ETL) tool. For those who are well-versed with SQL Server Integration Services (SSIS), ADF would be the Control Flow portion.

You can scale out your SSIS implementation in Azure. In fact, there are two (2) options to do this: SSIS On-Premise using the SSIS runtime hosted by SQL Server or On Azure using the Azure-SSIS Integration Runtime.

Azure Data Factory is not quite an ETL tool as SSIS is. There is that transformation gap that needs to be filled for ADF to become a true On-Cloud ETL Tool. The second iteration of ADF in V2 is closing the transformation gap with the introduction of Data Flow.

Despite it not being nearly as complete as SSIS, there are useful data transformations available in Azure Data Factory, as Marlon shows.

Preventing SQL Server Startup With A Simple INI File

Solomon Rutzky is a month early with this:

In the event shown directly above, towards the bottom, in the final “<Data>” element that starts with “\\?\C:\ProgramData...“, that entry does point to a folder containing a Report.wer file. It is a plain text containing a bunch of error dump info, but nothing that would seem to indicate where to even start looking to fix this. And, nothing useful for searching on, at least not as far as my searching around revealed.

There you have it: a nearly untraceable way to prevent SQL Server from starting.

Read on to see how what Solomon did.

SET STATISTICS IO And Automated Statistics Updates

Niko Neugebauer shows us something new in SQL Server 2019:

There has never been such information before!
We are just writing into it!
Why do we have those wonderful 1351498 logical reads ?
Are they actually writes ? And if they would be, would not it be correct to display them as physical accesses ?

The answer is rather simple and actually should have been expected.
We are inserting a big amount of data into an empty table with a Primary Key, which triggers a creation/update of the statistics and those are the reads of the statistics scan operation. 

I hadn’t noticed that, but it is quite interesting.

SQL Server 2019 CTP 2.3 Released

The SQL Server team announces SQL Server 2019 CTP 2.3:

At the SQL bits keynote today, we announced the release of SQL Server 2019 community technology preview 2.3, the fourth in a monthly cadency of preview releases. Previewed in September 2018, SQL Server 2019 is the first release of SQL Server to closely integrate Apache Spark and HDFS with SQL Server in a unified data platform.

There’s not a giant list but there are some interesting items on it. Click through for the full list.

Fixing Orphaned Users In All Databases

Max Vernon shows how we can find orphaned users across all databases in an instance and fix them using dynamic SQL:

The next piece of code helps fix orphaned users by reconnecting them to logins that have precisely the same name, but a differing SID. This code is a variant of the above code that dynamically creates ALTER USER statements. A statement is created for each orphaned user where there is a match-by-name in the list of server logins. Once the list of dynamically created ALTER USERstatements are compiled, the commands to fix orphaned users are automatically executed.

Click through for the scripts.


March 2019
« Feb