Understanding Spark APIs

Kevin Feasel

2016-07-19

Spark

Jules Damji explains when to use RDDs, when to use DataFrames, and when to use Datasets in Spark:

Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.

With Spark 2.0, the balance moves in favor of the more structured data types.  What’s old is new; what’s unstructured is structured…

Pay Attention To Benefits

Andy Mallon discusses the importance of benefits:

I honestly believe that company benefits tell you more about a company than any formal interview will. The benefits that a company offers to an employee are a sign of how you’ll be treated if you work there. There is definitely a correlation between great benefits and happy employees.

Companies who respect their employees will express that respect through their benefits. (The inverse is also true–companies often have poor benefits because they do not respect their employees.)

Employment is a package deal, so if you’re not paying attention to benefits, you’re missing part of your bottom line.

Generating Change Scripts

Steve Jones shows off a rather valuable feature in SQL Server Management Studio:

One of the really basic things I think everyone should understand is how to get scripts from Management Studio (SSMS) and saving them. In fact, I’ve written that everyone should use this button and really not ever execute their GUI changes. Capture the script, save that, and automate things.

However, that’s not what this post is about. This post is about how you get a script to look at changes, or better understand how SSMS might implement your changes.

The ability to script out your changes has a number of benefits, one of which is that you’ll get to learn the code you need to write to perform an action, which could make all the difference in a production-down situation.

Why Learn Scala?

Kevin Feasel

2016-07-19

Spark

Kevin Jacobs explains why Scala is a useful language:

Is it a functional programming language? Is it an object-oriented programming language? The answer to both questions is yes! Scala is a object-functional programming language. The good old well-known stuff is all in Scala. You can build complex applications by the means of objects and classes. On the other hand, Scala tries to teach programmers a paradigm called functional programming. In functional programming, a computation is treated as the evaluation of a mathematical function. So in that sense, everything in Scala is an evaluation. You might wonder why you would ever need functional programming if you are used to object-oriented programming. Well, the case is that in imperative programming you are changing the state over and over again. This is not allowed in functional programming. The changing of the state causes side-effects and makes your application less transparent. A imperative application is therefore often hard to debug while a functional program is easy to debug since it does not change the state. A concrete example is given below

Scala is to Java as F# is to C#.

Common Table Expressions Aren’t Tables

Grant Fritchey shows that CTEs are not tables; they’re expressions:

The Common Table Expression (CTE) is a great tool in T-SQL. The CTE provides a mechanism to define a query that can be easily reused over and over within another query. The CTE also provides a mechanism for recursion which, though a little dangerous and overused, is extremely handy for certain types of queries. However, the CTE has a very unfortunate name. Over and over I’ve had to walk people back from the “Table” in Common Table Expression. The CTE is just a query. It’s not a table. It’s not providing a temporary storage space like a table variable or a temporary table. It’s just a query. Think of it more like a temporary view, which is also just a query.

Read the whole thing.

Looping Over Flat Files

Kevin Feasel

2016-07-19

Biml

Ben Weissman uses Biml to loop over flat files with matching schemas:

To get that done in SSIS, we need:

– A variable to store the current file name and will be used as the connection string for our flat file connection
– A for each loop that will loop through our directory, pass each file name in there to our variable and then run a data flow task
– A derived column in our data flow task that will store our variable so we can import it into our SQL table
– Since we want this to be “clean”, we will also run a truncate before the for each loop

So let’s look at the code for that

It’s Biml, so there isn’t much code.  Check it out.

CTEs And Semi-Colons

Jason Brimhall discusses common table expressions:

Wait! Hold on two seconds there! Surely the semi-colon is an absolute requirement because we see it everywhere that it is a mandatory requirement.

The reality is that the semi-colon requirement is not really entirely accurate. If the CTE happens to be in the same batch, then the previous statement in the batch must be terminated by the semi-colon.

This post went down an unexpected path, and ended up being rather interesting.  Read the whole thing.

Stretch Database

SQL Padre checks out Stretch database:

As you can see, we now have a new path in our query plan with an operator called “Remote Query”.  Basically the local server queries the remote query then using the local Primary key Concatenates them back together to produce the desired result. So can we update the data?

Nope, sure can’t.  Once the data lives in Azure, the data is READ ONLY.

Check it out.  He’s a bit more sanguine about Stretch than I am, so maybe it will fit your use cases.

Dot Plots

Devin Knight continues his custom visualization series:

In this module you will learn how to use the Dot Plot Power BI Custom Visual.  The Dot Plot is often used when visualizing a distribution of values or a count of an occurrence across different categorical data you may have.  Watch this module to learn more!

This particular visualization seems a bit distracting for my tastes, but check out Devin’s video.

Descriptive Statistics With SQL Server And R

Kevin Feasel

2016-07-19

R

Mala Mahadevan digs into descriptive statistics:

With R integration into SQL Server 2016 we can pull an R script and integrate it rather easily. I will be covering all 3 approaches. I am using a small dataset – a single table with 915 rows, with a SQL Server 2016 installation and R Studio. The complexities of doing this type of analysis in the real world with bigger datasets involve setting various options for performance and dealing with memory issues – because R is very memory intensive and single threaded.

My table and the data it contains can be created with scripts here. For this specific post I used just one column in the table – age. For further posts I will be using the other fields such as country and gender.

Mala compares T-SQL versus R for calculating minimum, maximum, mean, and mode.  She wraps the post up by showing how to call her R code via T-SQL using SQL Server R Services.

Categories

July 2016
MTWTFSS
« Jun Aug »
 123
45678910
11121314151617
18192021222324
25262728293031