Debugging Spark Code

Kevin Feasel



Vida Ha has an article on troubleshooting when writing code using the Spark APIs:

When working with large datasets, you will have bad input that is malformed or not as you would expect it. I recommend being proactive about deciding for your use case, whether you can drop any bad input, or you want to try fixing and recovering, or otherwise investigating why your input data is bad.

A filter command is a great way to get only your good input points or your bad input data (If you want to look into that more and debug). If you want to fix your input data or to drop it if you cannot, then using a flatMap() operation is a great way to accomplish that.

This is a good set of tips.

Cloudera, Polybase, And Active Directory

Ajay Jagannathan shows how to integrate a SQL Server instance + Polybase with a Cloudera Hadoop cluster, all using Active Directory for accounts:

For all usernames and principals, we will use the suffixes like Cluster14 for name-scalability.

  1. Active Directory setup:
  • Create a new Organizational Unit for Hadoop users in AD say (OU=Hadoop, OU=CORP, DC=CONTOSO, DC=COM).
  • Create a hdfs superuser : [email protected]
  • Cloudera Manager requires an Account Manager user that has privileges to create other accounts in Active Directory. You can use the Active Directory Delegate Control wizard to grant this user permission to create other users by checking the option to “Create, delete and manage user accounts”. Create a user [email protected] in OU=Hadoop, OU=CORP, DC=CONTOSO, DC=COM as an Account Manager.
  1. Install OpenLDAP utilities (openldap-clients on RHEL/Centos) on the host of Cloudera Manager server. Install Kerberos client (krb5-workstation on RHEL/Centos) on all hosts of the cluster. This step requires internet connection in Hadoop server. If there is no internet connection in the server, you can download the rpm and install.

This is absolutely worth the read.

Automated Emails

Allison Tharp shows how to send automated e-mails with Powershell:

The update has two parts: how I feel about my work and how I feel about my department.  For each of these, I wrote a few ‘beginning’ sentences and a few ‘ending’ sentences.  The script picks a random beginning and ending sentence for each category (work and department), color codes it, and sends the email to my personal and my work emails.

I love the randomization.

Deployment Contributors

Richie Lee discusses an alternative to pre-model scripts:

According to the blurb, deployment contributors can perform custom actions when deploying a SQL script. And one such use of deployment contributors would be to alter index builds to be an online operation. Microsoft also have a Github DACExtensions repo, and this is very useful because, and in the interests of full disclosure, I have never written a deployment contributor myself. This is partly because the repo has some very good examples, including the online index issue (this post nicely covers how to make use of deployment contributors.) I know those that have and have explained how they work very well. But I think there are a few challenges w/r/t deployment contributors:

  • No one has ever heard of them

  • You have to use C#

  • They’re not entirely straightforward.

This is a good discussion of deployment contributors, including why we don’t see them more frequently.

Force-Directed Graphs

Devin Knight’s series on Power BI custom visuals continues with the force-directed graph:

Key Takeaways

  • Shows relationships between different entities in your data.

  • The width of the line that separate each entity represents the strength of the relationship.

Click through for the discussion and a video.  I’m not too sure that I’d use this in a real dashboard, but it’s available.

Subqueries And Performance

Grant Fritchey busts a myth:

I’ve written before about the concept of cargo cult data professionals. They see one issue, one time, and consequently extrapolate that to all issues, all the time. It’s the best explanation I have for why someone would suggest that a sub-query is flat out wrong and will hurt performance.

Let me put a caveat up front (which I will reiterate in the conclusion, just so we’re clear), there’s nothing magically good about sub-queries just like there is nothing magically evil about sub-queries. You can absolutely write a sub-query that performs horribly, does horrible things, runs badly, and therefore absolutely screws up your system. Just as you can with any kind of query. I am addressing the bad advice that a sub-query is to be avoided because they will inherently lead to poor performance.

There are times not to use subqueries, but this post is absolutely correct:  understand the reasons why things may or may not perform well, and don’t be afraid to try things out.


Kenneth Fisher discusses synchronous versus asynchronous in programming terms:

Synchronous – Code that runs one one line at a time. Each line of code is completed before the next one starts. If an external call is made then it is completed before the next line of code runs.

Asynchronous – Code that is launched and runs separately from the initial code. If a SQL job is launched from inside a batch of code (using sp_start_job for example) then the job is running in parallel (at the same time as) to the remainder of the batch of code.

Understanding which operations are synchronous versus asynchronous, and which operations are blocking versus non-blocking versus semi-blocking, will do wonders for improving application performance.


October 2016
« Sep Nov »