Press "Enter" to skip to content

Month: March 2019

Datasets In Spark

Ayush Hooda explains the differences between DataFrames and Datasets in Apache Spark:

The Datasets API provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. You can define Dataset objects and then manipulate them using functional transformations (map, flatMap, filter, and so on) similar to an RDD. The benefits are that, unlike RDDs, these transformations are now applied on a structured and strongly typed distributed collection that allows Spark to leverage Spark SQL’s execution engine for optimization.

Read on for more details and a few examples of how to operate DataFrames and Datasets.

Comments closed

Restoring Databases From Azure

John Morehouse shows how we can restore a database from Azure Blob Storage:

So how do you restore from Azure storage? You do so from an URL.  Let’s take a look!

When you backup a database to Azure, there are two types of blobs that can be utilized, namely page and block blobs.   Due to price and flexibly, it is recommended to use block blobs.  However, depending on which type you used to perform the backup will dictate how the restores are performed.  Both methods require the use a credential, so that information will need to be known before being able to restore from Azure.

Click through for examples using both page blobs and block blobs.

Comments closed

Azure Data Factory Data Flows

Marlon Ribunal shows how we can perform some amount of data transformation in an Azure Data Factory V2 data flow:

Azure Data Factory (ADF) offers a convenient cloud-based platform for orchestrating data from and to on-premise, on-cloud, and hybrid sources and destinations. But it is not a full Extract, Transform, and Load (ETL) tool. For those who are well-versed with SQL Server Integration Services (SSIS), ADF would be the Control Flow portion.

You can scale out your SSIS implementation in Azure. In fact, there are two (2) options to do this: SSIS On-Premise using the SSIS runtime hosted by SQL Server or On Azure using the Azure-SSIS Integration Runtime.

Azure Data Factory is not quite an ETL tool as SSIS is. There is that transformation gap that needs to be filled for ADF to become a true On-Cloud ETL Tool. The second iteration of ADF in V2 is closing the transformation gap with the introduction of Data Flow.

Despite it not being nearly as complete as SSIS, there are useful data transformations available in Azure Data Factory, as Marlon shows.

Comments closed

Preventing SQL Server Startup With A Simple INI File

Solomon Rutzky is a month early with this:

In the event shown directly above, towards the bottom, in the final “<Data>” element that starts with “\\?\C:\ProgramData...“, that entry does point to a folder containing a Report.wer file. It is a plain text containing a bunch of error dump info, but nothing that would seem to indicate where to even start looking to fix this. And, nothing useful for searching on, at least not as far as my searching around revealed.

Conclusion
There you have it: a nearly untraceable way to prevent SQL Server from starting.

Read on to see how what Solomon did.

Comments closed

SET STATISTICS IO And Automated Statistics Updates

Niko Neugebauer shows us something new in SQL Server 2019:

There has never been such information before!
We are just writing into it!
Why do we have those wonderful 1351498 logical reads ?
Are they actually writes ? And if they would be, would not it be correct to display them as physical accesses ?

The answer is rather simple and actually should have been expected.
We are inserting a big amount of data into an empty table with a Primary Key, which triggers a creation/update of the statistics and those are the reads of the statistics scan operation. 

I hadn’t noticed that, but it is quite interesting.

Comments closed

SQL Server 2019 CTP 2.3 Released

The SQL Server team announces SQL Server 2019 CTP 2.3:

At the SQL bits keynote today, we announced the release of SQL Server 2019 community technology preview 2.3, the fourth in a monthly cadency of preview releases. Previewed in September 2018, SQL Server 2019 is the first release of SQL Server to closely integrate Apache Spark and HDFS with SQL Server in a unified data platform.

There’s not a giant list but there are some interesting items on it. Click through for the full list.

Comments closed

Fixing Orphaned Users In All Databases

Max Vernon shows how we can find orphaned users across all databases in an instance and fix them using dynamic SQL:

The next piece of code helps fix orphaned users by reconnecting them to logins that have precisely the same name, but a differing SID. This code is a variant of the above code that dynamically creates ALTER USER statements. A statement is created for each orphaned user where there is a match-by-name in the list of server logins. Once the list of dynamically created ALTER USERstatements are compiled, the commands to fix orphaned users are automatically executed.

Click through for the scripts.

Comments closed

Creating a UDF in KSQL

Saeed Barghi shows us how we can create user-defined functions in Kafka Streams’ KSQL language:

The UDF I want to implement here is DATEADD. If you’re familiar with SQL, you have definitely used it: it takes in a date and adds or subtracts a specific number value to a specific part of datetime, and spits out a new datetime.

To implement a User Defined Function (UDF or UDAF) you would need to code your function in Java and then import the jar file in your KSQL server. You can read about the full process here, I point out a couple of things that I believe you should pay attention to:

I think I prefer Spark’s method for UDFs in Spark SQL: create it in Scala and register the function.

Comments closed

Kafka Connect Neo4j Sink

Michael Hunger and Andrea Santurbano announce a new sink for Kafka Connect which handles writing to Neo4j:

We’ve been using the work we did for the Kafka sink – Neo4j extension and have made it available via remote connections over our binary bolt protocol. So you can stream your events from Apache Kafka®directly into Neo4j to create, update and enrich your graph structures. Then it is really up to you what you want to with the event data.

The events can come from frontend systems, API notifications, other databases or streaming systems like Apache Spark™ and Apache Flink®.

For details on the format and internals, please see our previous article or the documentation for the Neo4j sink.

Read on for details and demos.

Comments closed

Remote Code Execution with CallBimlScriptContent

Bill Fellows describes the ability to execute code in Biml from a remote source with the CallBimlScriptContent function:

CallBimlScriptContent was introduced with the migration from Mist to BimlStudio. Why is this cool? You do not have to use files sitting on your computer as the source for your Biml. As long as you can reconstitute the Biml contents into a string, you can store your scripts where ever you’d like. If you want them in a database, that’s great. Store them in the cloud? Knock yourself out.

As a consultant, the latter is rather compelling. Maybe I’m only licensing my clients to use accelerators during our engagement. If I leave files on the file system after I roll off, or they image my computer and accidentally collect them, I am David fighting Goliath. CallBimlScriptContent is a means to protect myself and my IP. Let’s look at a trivial example. I set a C# string with an empty Package tag (hooray for doubling up my double quotes). Within my Packages collection, I invoke CallBimlScriptContent passing in my Biml content.

Bill’s use case was one I hadn’t thought about, but it does make sense.

Comments closed