Kevin Feasel – Page 747

Apache Spark Performance Tuning

Published 2020-12-29 by Kevin Feasel

Tomaz Kastrun provides a few hints when performance tuning Apache Spark code:

DataFrame versus Datasets versus SQL versus RDD is another choice, yet it is fairly easy. DataFrames, Datasets and SQL objects are all equal in performance and stability (at least from Spar 2.3 and above), meaning that if you are using DataFrames in any language, performance will be the same. Again, when writing custom objects of functions (UDF), there will be some performance degradation with both R or Python, so switching to Scala or Java might be a optimisation.

Read on for the details. My version is “When performance matters the most, be willing to switch to Scala.” It’s not always correct, but is rarely outright bad advice.

Comments closed

Naive Bayes and Continuous Predictor Variables

Published 2020-12-29 by Kevin Feasel

Akhila takes us through the intuition of how Naive Bayes works:

Usually we use the e1071 package to build a Naive Bayes classifier in R. And then using this classifier, we make some predictions on the training data.
So probability for these predictions can be directly calculated based on frequency of occurrences if the features are categorical.
But what if, there are features with continuous values? What the Naive Bayes classifier is actually doing behind the scenes to predict the probabilities of continuous data?

Click through for the answer. Also, Naive Bayes isn’t Bayesian, but that’s not important.

Comments closed

The Intuition Behind Averaging

Published 2020-12-29 by Kevin Feasel

The Stats Guy takes a look at averages:

In this diagram, there are a bunch of numbers and a single question mark. Behind the question, is also a number. The known numbers are the same as in our friend v above.
Our task is as follows:
– Make a guess on what that mystery number could be. And,
– If we can’t get it right, then reduce, as much as possible, the error we incur on our guess.

This is a well-written explanation of an important concept. H/T R-Bloggers

Comments closed

Working with Excel in Powershell

Published 2020-12-29 by Kevin Feasel

Mikey Bronowski has a festive post:

This blog post is part of the Festive Tech Calendar.
If you want to practice the whole thing I have prepared an interactive notebook for you that could be opened with Azure Data Studio for example (link to the notebook). For more things about the PowerShell module check this post out.
I would like to invite you to the world of magic!

Click through for an image-rich and extremely detailed post.

Comments closed

Power BI Composite Model V2 Demo

Published 2020-12-29 by Kevin Feasel

Wolfgang Strasser gives us a walkthrough of DirectQuery for Power BI datasets:

With the December 2020 release of Power BI Desktop, this approach changed. You are now able to change a live connection to a Power BI dataset (or an Azure Analysis Services connection) to DirectQuery mode. Which allows us, to enhance the remote model with new columns, tables, additional datasources and create relationships between the datasources.
Let’s dive deeper into this and look at the story together with a sample.

I’ve seen and linked to several posts talking about the idea, but Wolfgang has a demo going, which makes it easier to follow.

Comments closed

DirectQuery for Power BI Datasets

Published 2020-12-29 by Kevin Feasel

James Serra takes us through a new Power BI feature:

Announced last week is a major new feature for Power BI: you can now use DirectQuery to connect to Azure Analysis Services or Power BI Datasets and combine it with other DirectQuery datasets and/or imported datasets. This is a HUGE improvement that has the Power BI community buzzing! Think of it as the next generation of composite models. Note this requires the December version of Power BI Desktop, and you must go to Options -> Preview features and select “DirectQuery for Power BI datasets and Analysis Services”.

Read on for more details.

Comments closed

Inlining KQL in Power Query

Published 2020-12-29 by Kevin Feasel

Chris Webb shows you how you can include KQL query fragments in Power Query:

If the title wasn’t enough to warn you, this post is only going to be of interest to M ultra-geeks and people using Power BI with Azure Data Explorer – and I know there aren’t many people in either group. However I thought the feature I’m going to show you in this post is so cool I couldn’t resist blogging about it.

Limited in its utility, but still quite interesting.

Comments closed

Another Batch of ETL Antipatterns

Published 2020-12-28 by Kevin Feasel

Tim Mitchell wraps up a series on ETL antipatterns with three posts. The first one is about not testing the ETL process:

Building ETL processes is quite easy. Building ETL processes that deliver accurate results as quickly as possible is substantially more difficult. Modern ETL tools (including my personal favorite, SQL Server Integration Services) make it deceptively easy to create simple load process. That’s a good thing, because an easy-to-understand front end shortens the timeline of going from zero to first results.
The challenge with such a low bar to entry is that some folks will stop refining the process when the load process is successful.

The second post looks at processes which don’t scale:

With very few exceptions, data volume will increase over time. Even when using an incremental load pattern, the most common trend is for the net data (new + changed) to increase with time. Even with steady, linear changes, it’s possible to outgrow the ETL design or system resources. With significant data explosion – commonly occurring in corporate acquisitions, data conversions, or rapid company growth – the ETL needs can quickly outrun the capacity.
Refactoring ETL for significant data growth isn’t always as simple as throwing more resources at the problem. Building ETL for proper scaling requires not just hefty hardware or service tiers; it requires good underlying data movement and transformation patterns that allow for larger volumes of data.

The final post implores us to think of the documentation:

Documentation is an asset that is both loathed and loved. Creating technical and business documentation is often looked upon as a tedious chore, something that really ought to be done for every project but is often an easy candidate to push until later (or skip entirely).
On the other hand, good documentation – particularly around data movement and ETL processes – is as valuable as the processes it describes. A clear and up-to-date document describing the what, when, where, and why of an ETL workflow adds transparency and makes the process much easier to understand for those who support it.

This has been an enjoyable series from Tim, so if you haven’t already, do check it out.

Comments closed

Using Powershell to Automate Azure Databricks Processes

Published 2020-12-28 by Kevin Feasel

Tomaz Kastrun continues a series on Databricks:

Yesterday we looked into bringing the capabilities of Databricks closer to your client machine. And making that coding, data wrangling and data science little bit more convenient.
Today we will look into deploying Databricks workspace using Powershell.

By the way, if Powershell automation of Databricks tasks is of interest to you, also check out Gerhard Brueckl’s extension module for much more along those lines.

Also, I give Tomaz a lot of credit: most Advent calendars stop at 24 days but Tomaz laughs off such limitations.

Comments closed

String Concatenation with STRING_AGG()

Published 2020-12-28 by Kevin Feasel

Jack Vamvas takes us through a fairly recent quality of life improvement:

Question: Below is table and expected result. What is the query to achieve this result
Table : Test
ID LOT
7065161 4
7065212 1
7065212 4
7065203 1
7065203 2
7065203 3

Expected Result of query
ID LOT
7065161 4
7065212 1_4
7065203 1_2_3

Click through to learn the easiest way to do this as of SQL Server 2017.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Author: Kevin Feasel