Press "Enter" to skip to content

Curated SQL Posts

Installing Apache Airflow

Achilleus walks us through a process to install Apache Airflow on a machine:

Airflow is an amazing tool by Airbnb and is a kinda defacto standard of ETL deployments in the Data Engineering domain nowadays. But at the same time, you can also use Airflow to schedule to ML pipeline and automate the whole ML pipeline(almost).

This is my attempt to install and set up a fairly robust Apache Airflow deployment for my needs. I am pretty sure there might be some better ways of doing it or add any enhancements to it. Any comments or suggestions are highly appreciated!

This is an easy-to-follow set of steps, so check it out.

Comments closed

Tuning Random Forest HyperParameters with R

Julia Silge gives us an idea of how to tune random forest hyperparameters in R:

Our modeling goal here is to predict the legal status of the trees in San Francisco in the #TidyTuesday dataset. This isn’t this week’s dataset, but it’s one I have been wanting to return to. Because it seems almost wrong not to, we’ll be using a random forest model! 🌳

Let’s build a model to predict which trees are maintained by the San Francisco Department of Public Works and which are not. We can use parse_number() to get a rough estimate of the size of the plot from the plot_size column. Instead of trying any imputation, we will just keep observations with no NA values.

Click through to some data exploration, the initial model, and a process for using Grid Search with the caret package.

Comments closed

Using the Tune Package in R for Hyperparamter Optimization

Abderrahim Lyoubi-Idrissi takes us through a Bayesian approach to tune hyperparameters:

In contrast to the model parameters, which are discovered by the learning algorithm of the ML model, the so called Hyperparameter(HP) are not learned during the modeling process, but specified prior to training.

Hyperparameter tuning is the task of finding optimal hyperparameter(s) for a learning algorithm for a specific data set and at the end of the day to improve the model performance.

Abderrahim contrasts two different methods here: Grid Search and Bayesian Optimization. Definitely an interesting read if you develop data science models.

Comments closed

Removing Old Backups from Azure Blob Storage

Niko Neugebauer has some advice on how to clean up backups which live in Azure Blob Storage:

Continuing the topic of the Backups to Azure Blob Storage that I have kind of kicked off with the post Striping Backups to Azure Blob Storage, I want to touch on the important aspect of “keeping it clean” – thus deleting the old backups.
On the regular Windows Server this is a rather easy task, and if you are using a standard maintenance solution, such as Ola Hallengren’s Maintenance Solution or any other one. You can also use the regular SSMS maintenance (*cough* for whatever reason that is unknown to me, that you might wish to *cough*), or you can easily set up a regular Windows Scheduler with Command Line Batch or Powershell or whatever tool/script/language you like.

The situation is quite different with the Backup To URL functionality, the one that is available since more than 6 years (and the good old SQL Server 2012 has even got a support for it in a certain Cumulative Update – SQL Server 2012 Service Pack 1 CU 2, to be more prices)

Niko goes through five different methods you can use, so check it out.

Comments closed

Avoid Default String Lengths

Kenneth Fisher warns us against creating strings without specifying a length:

Every now and again I see someone get lazy and declare a string (CHARNCHARVARCHAR and NVARCHAR) without specifically declaring what the length is going to be. This can lead to some interesting problems. First of all it’s usually going to be a length of one.

I’d take this one step further and say avoid creating strings without specifying a length in products like ADO.NET as well—there, you won’t get the single-character length Kenneth mentions here, but you do get one plan per character length passed in, which does a great job of bloating the plan cache.

Comments closed

Using T-SQL to Generate a Login Migration Script

Thomas Rushton shows us how to migrate logins from one server to another using just T-SQL:

There are three things to look at:

1. Creating Logins from Windows accounts
2. Creating logins that are authenticated by SQL Server
3. Assigning membership of the appropriate server roles

Note that we’re just creating a snapshot of what’s there – we’re not aiming to keep these things in true synchronisation between servers. Note also that this script will only create accounts that don’t exist. If there’s an account already there with the same name, the script generated will not recreate it with new options.

We get the final script as well as a nice walkthrough of each component.

Comments closed

Using DAX’s GroupBy Function

Reza Rad takes us through the GroupBy function in DAX:

There are many different ways you can create aggregations in Power BI, You can do it in the source (using the database t-SQL language), or using Group By operation in Power Query. You can also do it in DAX using some functions. One of the functions that can be used for grouping and aggregation is Group By. This article is about how to use Group By in DAX. Creating aggregation using DAX is a very useful skill because you can use it to create virtual tables in your measures and have better dynamic calculations in Power BI.

Click through for the explanation and an example.

Comments closed

Finding Below-Average Sales Per Hierarchy Level in Power BI

Soheil Bakhshi has an interesting problem to solve:

Now, the requirement is to show the above behaviour in a “Column Chart” (yes! visualising time series with column chart, that’s what the customer wants) and highlight the columns with values below average amount in Orange and leave the rest in default theme colour.

So, I need to create Measures to conditionally format the column chart. I also need to add a bit of intelligent in the measures to:

– Detect which hierarchy level I am in
– Calculate the average of sales for that particular hierarchy level
– Change the colour of the columns that are below the average amount

Let’s get it done!

Read on to see how you can do exactly this.

Comments closed

All About Line Graphs

Mike Cisneros takes us through one of the most useful visuals out there:

A typical line graph will have continuous data along both the vertical (y-axis) and horizontal (x-axis) dimensions. The y-axis usually shows the value of whatever variable we are measuring; the x-axis is most often used to show when we measured it, either chronologically or based on some independent variable (e.g., as we rev our old car’s engine, we measure the decibel level at different RPM). 

While some line graphs do not use continuous data on the x-axis (particularly slopegraphs and parallel coordinates diagrams, which are specialized variants of line graphs), what we absolutely can’t use on our x-axis is data that doesn’t have any meaningful relationship among the categories shown. 

Read on for a lot of good information on a workhorse visual.

Comments closed

Schema Management for Spark Applications

Walaa Eldin Moustafa takes us through some of the things that LinkedIn has learned about schema management with Apache Spark:

At LinkedIn, the Hive Metastore is the source of truth catalog for all Hadoop data. The Hive Metastore is managed by Dali. Dali is a data access and processing platform that is integrated to compute engines and ETL pipelines at LinkedIn to ensure consistency and uniformity in the access and storage of data. Dali utilizes the Hive Metastore to store data formats, data locations, partition information, and table information. Among other features, Dali also manages the definition of SQL views, as well as storing and accessing those definitions from the Hive Metastore.

Read on for a good explanation of the how as well as the why.

Comments closed