Handling Missing Data In Spark

Igor Sorokin explains how to implement DataFrameNaFunctions:

Unfortunately, C&P comes in to play, therefore, if at some point in time a default value for ‘trackLength’ is also required, you may end up changing both of these methods. Another disadvantage is that if another similar method, which requires the same default values, is added, code duplication is unavoidable.

A possible solution, which helps to reduce boilerplate, is DataFrameNaFunctions, which is intended to be used for handling missing data: replacing specific values, dropping ‘null’ and ‘NaN’, and setting default values

Read on for an example.

Troubleshooting Ambari Server

Jay SenSharma has an interesting article on troubleshooting Ambari Server:

When we notice that the ambari server is responding slow then we should look first the following details first:

1). The number of hosts added to the ambari cluster. So that accordingly we can tune the ambari agent thread pools.

2). The number of concurrent users (or the view users) who access the ambari server at a time. Sothat accordingly we can tune the ambari thread pools.

3). The age of the ambari cluster. If the ambari server is too old then the possibility is that some of the operational logs and the alert histories will be consuming a large amount of the Database which might be causing ambari DB queries to respond slow.

4). The Ambari Database health and it’s geographic location from the ambari server, to isolate if there are any network delays.

5). Ambari server memory related tuning parameters to see if the ambari heap is set correctly.

6). For ambari UI slowness we should check the network proxy issues to see if there are any network proxies added between client the ambari server machine Or the network slowness.

7). If the ambari users are synced with the AD or external LDAP and if the communication between server and the AD/LDAP is good.

8). Also the resource availability on the ambari host like the available free memory and if any other service/component running on ambari server is consuming more Memory/CPU/IO.

There is a lot of detail here, including quite a few checks to run.

Parallelism In R

Florian Prive shows off a few methods for parallelizing code in R:

Parallelize with foreach

You need to do at least two things:

  • replace %do% by %dopar%. Basically, always use %dopar% because you can use registerDoSEQ() is you really want to run the foreach sequentially.

  • register a parallel backend using one of the packages that begin with do (such as doParalleldoMCdoMPI and more). I will list only the two main parallel backends because there are too many of them.

Check it out.  Florian spends a lot of time with foreach and doParallel, a little bit of time with flock, and mentions Microsoft R Open.  H/T R-Bloggers

Dealing With Trace Flags In The Registry

Wayne Sheffield shows us how to configure SQL Server trace flags within T-SQL using registry access commands:

In a recent post, I introduced you to how to work with the registry directly from within SQL Server. Continuing this theme, this post provides an example situation where you would do so.

In this example, we will want to configure SQL Server to enable a few trace flags (TF) when SQL Server starts. Specifically, let’s set trace flags 1117 and 1118 so that they are enabled when SQL Server starts up, and enable them now. Additionally, we have trace flags 1204 and 1222 now enabled on some servers, and we want to disable those (since we have the deadlocks being captured in the system health XE, we don’t need them in the error log also). We also don’t want to force a restart of the SQL Server services.

I’ve always felt a little icky about writing to the registry from SQL Server, but Wayne shows how to do it right.

Multi-Structured Data In U-SQL

Melissa Coates shows us how to use U-SQL to normalize JSON files in which different rows may have differing structures:

Handling the varying formats in U-SQL involves a few steps if it’s the first time you’ve done this:

  1. Upload custom JSON assemblies  [one time setup]

  2. Create a database   [one time setup]

  3. Register custom JSON assemblies   [one time setup]

  4. Upload JSON file to Azure Data Lake Store [manual step as an example–usually automated]

  5. Run U-SQL script to “standardize” the JSON file(s) into a consistent CSV column/row format

Melissa then shows us how to do this step-by-step.

Azure SQL Data Warehouse Patterns

Murshed Zaman shows us a couple of patterns and anti-patterns for Azure SQL Data Warehouse:

Azure SQL DW is a Massively Parallel Processing (MPP) data warehousing service. It is a service because Microsoft maintains the infrastructure and software patching to make sure it’s always on up to date hardware and software on Azure. The service makes it easy for a customer to start loading their tables on day one and start running queries quickly and allows scaling of compute nodes when needed.

In an MPP database, table data is distributed among many servers (known as compute or slave nodes), and in many MPP systems shared-nothing storage subsystems are attached to those servers. Queries come through a head (or master) node where the location metadata for all the tables/data blocks resides. This head node knows how to deconstruct the query into smaller queries, introduce various data movement operations as needed, and pass smaller queries on to the compute nodes for parallel execution. Data movement is needed to align the data by the join keys from the original query. The topic of data movement in an MPP system is a whole another blog topic by itself, that we will tackle in a different blog. Besides Azure SQL DW, some other examples of a MPP data warehouses are Hadoop (Hive and Spark), Teradata, Amazon RedShift, Vertica, etc.

The opposite of MPP is SMP (Symmetric Multiprocessing) which basically means the traditional one server systems. Until the invention of MPP we had SMP systems. In database world the examples are traditional SQL Server, Oracle, MySQL etc. These SMP databases can also be used for both OLTP and OLAP purposes.

Murshed spends the majority of this blog post covering things you should not do, which is probably for the best.

Drillthrough In Power BI Reports

Dustin Ryan shows off drillthrough actions in Power BI:

At some point this weekend, the Microsoft Power BI folks turned on the drillthrough feature in the Power BI service. This is the same drillthrough feature that demonstrated during day 1 of the Microsoft Data Insights Summit, which you can read about here. So I thought it’d be good to quickly walk through how you can set up the drill through action.

First, navigate to a report that you’ve already published to the Power BI service. This report will become the target of your drill-through action. Start editing the report and you’ll notice that on the right side there is a Drillthrough Filters section.

Read on for more.  At the moment, this is only available within the Power BI service, not in Desktop.

Automating Table Creation Within Biml

Bill Fellows shows off some C# skills that he uses to build table definitions inside Biml:

Good grief, that’s a lot of code, how do I use it? The basic usage would be something like

 <Tables> <#= GetAstTableNodeFromQuery(this.RootNode.OleDbConnections["msdb"], "SELECT 100 AS demo", "dbo", "DemoQuery").GetBiml() #> </Tables>

The call to GetAstTableNodeFromQuery return an AstTableNode which is great, but what we really want is the Biml behind it so we chain a call to .GetBiml() onto the end.

What would make that better though is to make it a little more dynamic. Let’s improve the code to create tables based on a pairs of names and queries. I’m going to use a Dictionary called namedQueries to hold the names and queries and then enumerate through them, calling our GetAstTableNodeFromQuery for each entry.

Read on for the underlying code.  Bill also promises to refactor this code and make it more versatile.

Impact Bubble Chart

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the Impact Bubble Chart Custom Visual.  The Impact Bubble Chart is an animated bubble chart that allows you to compare two entities against each other.

It looks like there’s an issue with the video, but Devin does have links to the custom visual and a sample data set, so you can follow along without the video.


September 2017
« Aug