Unit Testing Of Spark Streaming

Felipe Fernandez shows how to unit test Spark Streaming:

Controlling the lifecycle of Spark can be cumbersome and tedious. Fortunately, Spark Testing Baseproject offers us Scala Traits that handle those low-level details for us. Streaming has an extra bit of complexity as we need to produce data for ingestion in a timely way. At the same time, Spark internal clock needs to tick in a controlled way if we want to test timed operations as sliding windows.

This is part one of a series.  I’m interesting in seeing where this goes.

Securing Spark Shuffle

Cheng Xu uses Apache Commons Crypto to secure data when Spark shuffles off to disk:

The basic steps can be described as follows:

  1. When a Spark job starts, it will generate encryption keys and store them in the current user’s credentials, which are shared with all executors.

  2. When shuffle happens, the shuffle writer will first compress the plaintext if compression is enabled. Spark will use the randomly generated Initial Vector (IV) and keys obtained from the credentials to encrypt the plaintext by using CryptoOutputStream from Crypto.

  3. CryptoOutputStream will encrypt the shuffle data and write it to the disk as it arrives. The first 16 bytes of the encrypted output file are preserved to store the initial vector.

  4. For the read path, the first 16 bytes are used to initialize the IV, which is provided to CryptoInputStreamalong with the user’s credentials. The decrypted data is then provided to Spark’s shuffle mechanism for further processing.

Once you have things optimized, the performance hit is surprisingly small.

AutoRestart SSAS Extended Events

Bill Anton looks at the AutoRestart option on Extended Events for Analysis Services:

So how do we handle the scenario where the server is rebooted?

  • Option 1: always remember to restart the trace after server reboots
  • Option 2: create a SQL Agent job to poll for the SSAS service status and start the xEvent trace if its not already running
  • Option 3: write a custom .NET watchdog service to poll for the SSAS service status and start the xEvents trace if its not already running

Those are the options I’ve used or seen used in the past… and to be sure, all of them have their drawbacks in reliability and/or complexity.

…which is why I was so excited when it was brought to my attention that there is an “AutoRestart” option for SSAS xEvents!

Do read the whole thing.

Microsoft R Server On Spark

Kevin Feasel

2016-08-10

Cloud, R, Spark

Max Kaznady, et al, discuss using Microsoft R Server on Spark to perform rapid prototyping against the NYC Taxi dataset:

Once the cluster is created, you can connect to the edge node where MRS is already pre-installed by SSHing to r-server.YOURCLUSTERNAME-ssh.azurehdinsight.net with the credentials which you supplied during the cluster creation process. In order to do this in MobaXterm, you can go to Sessions, then New Sessions and then SSH.

The default installation of HDI Spark on Linux cluster does not come with RStudio Server installed on the edge node. RStudio Server is a popular open source integrated development environment (IDE) available for R that provides a browser-based IDE for use by remote clients. This tool allows you to benefit from all the power of R, Spark and Microsoft HDInsight cluster through your browser. In order to install RStudio you can follow the steps detailed in the guide, which reduces to running a script on the edge node.

If you’ve been meaning to get further into Spark & R, this is a great article to follow along with on your own.

Azure SQL Threat Detection

Ron Matchoro discusses use cases for Azure SQL Threat Detection:

Thanks to SQL Threat Detection, we were able to detect and fix code vulnerabilities to SQL injection attacks and prevent potential threats to our database. I was extremely impressed how simple it was to enable threat detection policy using the Azure portal, which required no modifications to our SQL client applications. A while after enabling SQL Threat Detection, we received an email notification about ‘An application error that may indicate a vulnerability to SQL injection attacks’.  The notification provided details of the suspicious activity and recommended concrete actions to further investigate and remediate the threat.  The alert helped me to track down the source my error and pointed me to the Microsoft documentation that thoroughly explained how to fix my code.  As the head of IT for an information technology and services company, I now guide my team to turn on SQL Auditing and Threat Detection on all our projects, because it gives us another layer of protection and is like having a free security expert on our team.”

Anything which helps kill SQL injection for good makes me happy.

Data Cleansing In SQLite

Allison Tharp wants to clear out kinda-sorta duplicates from a SQLite table:

However, now I have a lot of database entries that are unneeded.  I thought I would take the time to clean this up (even though I’ll no longer use the data and could easily just delete the tables).  For the BGG Hotness, I have the tables: hotgame, hotperson, and hotcompany.  I have 7,350 rows in each of those tables, since I collected data on 50 rankings every hour for just over 6 days.  However, since the BGG hotness rankings only update daily, I really only need 300 rows (50 rankings * 6 days = 300 rows).

I know think the rankings update between 3 and 4, so I want to only keep the entries from 4:00 AM.  I use the following SELECT statement to make sure I’m in the ballpark with where the data is that I want to keep:

There are several ways to solve this problem; this one is easy and works.  The syntax won’t work for all database platforms, but does the trick for SQLite.

Considerations For Azure SQL Database

Kevin Feasel

2016-08-10

Cloud

Grant Fritchey discusses whether new database administrators might want to start with Azure SQL Database rather than on-premises SQL Server:

Since you are right at the start of your career, you may as well plan on maximizing the life of the knowledge and skills you’re building. By this, I mean spend your time learning the newest and most advanced software rather than the old approach. Is there still work for people who only know SQL Server 2000? Sure. However, if you’re looking at the future, I strongly advocate for going with online, cloud-based systems. This is because, more and more, you’re going to be working with online, connected, applications. If the app is in the cloud, so should the data be. Azure and the technologies within it are absolutely the cutting edge today. Spending your limited learning time on this technology is an investment in your future.

This answer is a tougher call for me.  Looking at new database developers (or development DBAs or database engineers or whatever…), I think the case is pretty solid:  there’s so much skill overlap that it’s relatively easy to move from Azure SQL Database to on-prem.  With production DBAs, the story’s a little different:  as Grant mentions, this is a Platform as a Service technology, and so the management interface is going to be different.  There are quite a few commonalities (common DMVs, some common functionality), but Grant gives a good example of something which is quite different between the PaaS offering and the on-prem offering:  database backup and restoration.  I think the amount of skills transfer is lower, and so the question becomes whether the marginal value of learning PaaS before IaaS/on-prem is high enough.  Given my (likely biased) discussions of Azure SQL Database implementations at companies, I’d stick with learning on-prem first because you’re much more likely to find a company with an on-prem SQL Server installation than an Azure SQL Database.

Filtering Data

Slava Murygin shows various ways to filter data, in particular data from fn_dblog():

Would say we need to extract an information associated with an “UPDATE” for LSNs started at “0000004f:00000087:0001”. You can just specify Starting and Ending LSNs as “fn_dblog” parameters:

[…]

That portion of code would return you ONLY Log records between LSNs “0000004f:00000087:0001″ and “0000004f:00000088:0001″.

Slava’s post uses fn_dblog() as an example but the techniques are applicable across the board, and in practice sum up to “get the fewest number of rows and fewest number of columns you need to solve the problem at hand.”

Identifying Object Names Using fn_dblog()

Frank Gill digs into the transaction log to find object names used in CREATE and ALTER statements:

This is the object id of the view that was created.  So, Jes’s question was answered.  But this led me to one of my other favorite SQL Server topics: string manipulation.  The following script will identify all transactions for a particular Transaction Name and return the object name affected.  The comments provide additional information about the functionality.

Click through to check out Frank’s script.

Joins Versus NOT IN Clause

Kevin Hill explains a potential performance difference between using NOT IN and using a left join:

Basic stuff, right?   Both will return 951 records (books) that I do not own.  And, very quickly…because the tables are tiny.   Sub-1 second is fast.

The issue here is HOW the rows are compared.

English version now, techy stuff later:

In the first query, this is equivalent to you standing at the bookstore and calling home to have someone check to see if the book in your hand is already in your collection.  EVERY time.  One by one.

In the second, you got really smart and brought a list with you, which you are comparing to the books on the shelf at the store.   You’ve got both “lists” in one place, so it is far more efficient.

Even in the case with a few hundred records, you can see why there’d be a performance difference.

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031