Press "Enter" to skip to content

Day: October 9, 2019

Differences in Spark RDDs and DataSets

Brad Llewellyn looks at some of the differences between RDDs and DataSets in Spark:

We see that there are some differences between filtering RDDsData Frames and Datasets.  The first major difference is the same one we keep seeing, RDDs reference by indices instead of column names.  There’s also an interesting difference of using 2 =’s vs 3 =’s for equality operators. Simply put, “==” tries to directly equate two objects, whereas “===” tries to dynamically define what “equality” means.  In the case of filter(), it’s typically used to determine whether the value in one column (income, in our case) is equal to the value of another column (string literal “<=50K”, in our case).  In other words, if you want to compare values in one column to values in another column, “===” is the way to go.

Interestingly, there was another difference caused by the way we imported our data.  Since we custom-built our RDD parsing algorithm to use <COMMA><SPACE> as the delimiter, we don’t need to trim our RDD values.  However, we used the built-in sqlContext.read.csv() function for the Data Frame and Dataset, which doesn’t trim by default.  So, we used the ltrim() function to remove the leading whitespace.  This function can be imported from the org.apache.spark.sql.functions library.

Read on for more, including quite a few code samples.

Comments closed

The Flexible Data Lake

Neil Stokes explains how you can optimize a Hadoop-based data lake:

There are many details, of course, but these trade-offs boil down to three facets as shown below.

Big refers to the volume of data you can handle with your environment. Hadoop allows you to scale your storage capacity – horizontally as well as vertically – to handle vast volumes of data.

Fast refers to the speed with which you can ingest and process the data and derive insights from it. Hadoop allows you to scale your processing capacity using relatively cheap commodity hardware and massively parallel processing techniques to access and process data quickly.

Cheap refers to the overall cost of the platform. This means not just the cost of the infrastructure to support your storage and processing requirements, but also the cost of building, maintaining and operating the environment which can grow quite complicated as more requirements come into play.

The bottom line here is that there’s no magic in Hadoop. Like any other technology, you can typically achieve one or at best two of these facets, but in the absence of an unlimited budget, you typically need to sacrifice in some way.

Software development is full of trade-offs, and data lakes are no different. Read the whole thing.

Comments closed

Debugging with PerfView

Erik Darling takes us through PerfView:

If you seriously want to debug, you’ve gotta use something like WinDbg. It’s tough, but powerful.

It’s also… Where do you even start? The most common thing to do is set a Break Point.

Figuring out where you wanna set it can be challenging.
– Sometimes you can capture a thread and step through the code.
– Sometimes you can find the call stack you want by searching through symbols

The problem with these is that stepping through code that runs in milliseconds has taken me over an hour at times, and call stacks often have cryptic names.

This is where tools like PerfView can help.

Erik does an excellent job going through PerfView and introducing people to debugging SQL Server.

Comments closed

Using Azure Kubernetes Services for Big Data Clusters

Mohammad Darab explains why it’s a good idea to use Azure Kubernetes Service when building out a Big Data Cluster:

According to the Microsoft documentation, there are three ways to deploy a Big Data Cluster:

1. Minikube
2. Kubeadm
3. AKS

I’ll go into each and list the pros and cons.

Of course, if you have a great Kubernetes admin, on-prem is certainly a viable option, but AKS is definitely easier to get started with.

Comments closed

Searching for Column-Level Metadata

Aaron Bertrand takes us through a few DMVs and catalog views which will help us find SQL Server metadata:

Let’s say we are interested in each column of each table: the ordinal position, the name, the data type, and whether it is nullable. In all currently supported versions of SQL Server (2012 and up), there is a handy function that allows you to get all the columns for a T-SQL query, sys.dm_exec_describe_first_result_set. This function is superior to querying the information from sys.columns, mainly because you don’t have to join to sys.types, or embed complicated logic to deal with translation; for example, CASE expressions that change -1 to max, eliminating types that have synonyms, or cutting nvarchar lengths in half – but only if they aren’t max. (You can see the type of query you end up in an earlier tip, “SQL Server Data Type Consistency.”)

This use of sys.dm_exec_describe_first_result_set() is new to me.

Comments closed

Sorting by Month in Power BI

Cecilia Brusatori shows how you can sort properly by month using Power BI:

Are your visualizations sorting the month field alphabetically? Don’t worry, I’ve been there when I started with Power BI.

If your model has a Dimension Date Table, first congrats, you’re on the right path in what respects to data model, second, you want to use the Month Name from that table, but you noticed that the visualizations don’t sort this column in the way we all wanted, from Jan to Dec.

Click through to see how.

Comments closed

Tracking Security Changes with the Default Trace

Jack Vamvas shows how we can use the default trace with SQL Server to find security-related events:

Question: There were some recent user security changes on a SQL Server login user account. Is there a way to grab this information – via the default trace? I haven’t got a dedicated Extended events monitor set up – so I’m relying on the default trace . Management want to understand the details related to this security change. 
 
Answer: It is possible to report on security changes administered on SQL Server via the default trace. In fact – it’s excellent – except if the default trace configuration is set up to rollover the files.

Read on to see how.

Comments closed

Generating Unique File Names

Slava Murygin gives us unique file names:

That is pretty common task to generate new files with a timestamp in their names.
It gives you ability to easily identify them, sort them and make them pretty unique.
However, if you have a very busy process it is possible that duplicate name will be produced and you might loose some data.

To avoid that situation I’ve came up with following solution.

It is difficult to envision this solution going wrong.

Comments closed