Press "Enter" to skip to content

Author: Kevin Feasel

Powershell Dot-Sourcing

Shane O’Neill takes us through some of the intricacies of dot-sourcing in Powershell:

The dot used to represent the location is, as I’ve said before, for the current location. Our ConvertTo-Message02 script changed it’s location as part of the script.

When we used the “dot source dot location” method, we weren’t using where our function is as a frame of reference to import the other functions. We were using what directory we are currently in.

Interesting reading.

Comments closed

Local Randomness and R

Evgeni Chasnovski has a problem around generating random data:

Let’s say we have a deterministic (non-random) problem for which one of the solutions involves randomness. One very common example of such problem is a function minimization on certain interval: it can be solved non-randomly (like in most methods of optim()), or randomly (the simplest approach being to generate random set of points on interval and to choose the one with the lowest function value).

What is a “clean” way of writing a function to solve the problem? The issue with direct usage of randomness inside a function is that it affects the state of outer random number generation:

Click through for a solution which uses random numbers but doesn’t change the outside world’s random number generation after it’s done.

Comments closed

The Databricks File System

Brad Llewellyn takes us through the Azure Databricks File System:

Today, we’re going to talk about the Databricks File System (DBFS) in Azure Databricks.  If you haven’t read the previous posts in this series, IntroductionCluster Creation and Notebooks, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let’s move on to the core of this post, DBFS.

As we mentioned in the previous post, there are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data.  For this post, we’re going to talk about the storage layer underneath Azure Databricks, DBFS.  Since Azure Databricks manages Spark clusters, it requires an underlying Hadoop Distributed File System (HDFS).  This is exactly what DBFS is.  Basically, HDFS is the low cost, fault-tolerant, distributed file system that makes the entire Hadoop ecosystem work.  We may dig deeper into HDFS in a later post.  For now, you can read more about HDFS here and here.

Click through for more detail on DBFS.

Comments closed

Logical Equivalence and Query Tuning

Erik Darling didn’t warn us that there would be math:

Often when query tuning, I’ll try a change that I think makes sense, only to have it backfire.

It’s not that the query got slower, it’s that the results that came back were wrong different.

Now, this can totally happen because of a bug in previously used logic, but that’s somewhat rare.

And wrong different results make testers nervous. Especially in production.

This is where knowledge of abstract math and logic (like De Morgan’s Laws, both of which I’ve used to tune queries in the past because I’m a nerd) can pay off

Comments closed

Pitfalls with Window Functions

Itzik Ben-Gan takes us through two issues you might run into when using window functions:

There are two common pitfalls involving window functions, both of which are the result of counterintuitive implicit defaults that are imposed by the SQL standard. One pitfall has to do with calculations of running totals where you get a window frame with the implicit RANGE option. Another pitfall is somewhat related, but has more severe consequences, involving an implicit frame definition for the FIRST_VALUE and LAST_VALUE functions.

There’s a lot going on in these two examples, so read on.

Comments closed

So Your tempdb is Corrupt

Steve Stedman walks us through how to fix corruption in tempdb:

The fact that you know you have corruption in TempDB is good news, that shows that you are running CheckDB against TempDB and many people overlook this.

    The corrupt page in TempDB may cause some issues if it is not cleared up.

    Since the TempDB files are recreated if they don’t exist when SQL Server restarts, here is what I would recommend.

Read on for Steve’s advice.

Comments closed

SQL Server Execution Plan Operators

Bert Wagner takes us through some of the more common execution plan operators:

Spools come in a variety of types, but most of them can be summarized as operators that store an intermediary result table in tempdb.

SQL Server often uses spools to process complex queries, transforming the data into a tempdb worktable to allow further data operations. The downside to this however is the need to write the data to disk in tempdb.

When I see a spool, I first often try to think if there is a way to rewrite the query to avoid the spool in the first place.

Spools are my bete noire. I love what Erik Darling calls them: SQL Server passive-aggressively telling you that you need an index without actually saying that you need an index.

At the end of the post, Bert calls out Hugo Kornelis’s operator list. Definitely check that out too, as Hugo is putting together a masterpiece in the original meaning of the term.

Comments closed

An Apache Flume Overview

Daniel Berman takes us through an overview of Apache Flume:

Apache Flume was developed by Cloudera to provide a way to quickly and reliably stream large volumes of log files generated by web servers into Hadoop. There, applications can perform further analysis on data in a distributed environment. Initially, Apache Flume was developed to handle only log data. Later, it was equipped to handle event data as well.

Click through to get a code-free, high-level understanding of Flume and where it can work for you.

Comments closed

ggforce Updates

Thomas Lin Pedersen has some ggforce updates for us:

Now, the above plot has some obvious shortcomings. The diagonal is pretty useless for starters, and it is often that these panels are used to plot the distributions of the individual variables. Using e.g. geom_density() won’t work as it always start at 0, thus messing with the y-scale of each row. ggforce provides two new geoms tailored for the diagonal: geom_autodensity() and geom_autohistogram() which automatically positions itself inside the panel without affecting the y-scale. We’d still need to have this geom only in the diagonal, but facet_matrix() provides exactly this sort of control

There are some interesting improvements in here.

Comments closed