Press "Enter" to skip to content

Day: October 1, 2021

Adaptive Query Execution in Spark 3

Amarjeet Singh explains what Adaptive Query Execution is in Apache Spark:

As we all know optimization plays an important role in the success of spark SQL. Therefore, a lot of work has been done in this direction. Before spark 3.0, cost-based optimization was a major hit in which different stages related to cost (based on time efficiency and estimated CPU and I/O usage) are compared and executes the strategy which minimizes the cost. But, because of outdated statistics, it has become a sub-optimal technique. Therefore in spark 3.0, Adaptive Query Execution was introduced which aims to solve this by reoptimizing and adjusts the query plans based on runtime statistics collected during query execution. Thus re-optimization of the execution plan occurs after every stage as each stage gives the best place to do the re-optimization.

Item number 2 from the list is also available in SQL Server, giving you an idea that this is an active battleground for query processing in data platform technologies.

Comments closed

Constraint Programming with R and MiniZinc

Holger von Jouanne-Diedrich solves a classic puzzle:

The following puzzle is a well-known meme in social networks. It is said to have been invented by young Einstein and back in the days I was ambitious enough to solve it by hand (you should try too!).

Yet, even simpler is to use Constraint Programming (CP). An excellent choice for doing that is MiniZinc, a free and open-source constraint modelling language. And the best thing is that you can control it by R! If you want to see how, read on!

I’d solved it once by hand as well, but here we get to see a much easier route. Constraint-based programming is one of those things which doesn’t show up very often in the business world, but I think part of the reason is that most programming languages lack the capacity to implement constraints really well. It could also be that people are usually pretty mushy about laying out proper constraints.

Comments closed

Understanding Logistic Regression

Luis Valencia explains the idea of logistic regression:

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

However, unlike ordinary linear regression, in it’s most basic form logistic regressions target value is a binary variable instead of a continuous value.

Read on to learn more about logistic regression. The point I like to make about logistic regression is that people brand new to it say it’s regression, because hey, it has “regression” in its name! People who are more familiar with it say that’s a misnomer and it’s really a classification algorithm, not a regression algorithm. But as Luis shows, people who are very familiar with it understand that it is a regression algorithm, which just happens to have nice classification properties because in many cases, elements get pushed to the edges (0 and 1).

Comments closed

Get-FileAgeGroup

Jeffrey Hicks builds a useful function:

If there’s one task I’ve never stopped doing, it is finding files. I am constantly creating new ways to organize files and display them in a meaningful format. Naturally, PowerShell is a great tool for this task. Get-ChildItem is obviously the proper starting point. The cmdlet works fine in getting only files from a folder and I can do basic early filtering by name and extension with wildcards. But my latest task was organizing files by date, and because of the way Get-ChildItem works under-the-hood, I’m going to need to resort to late filtering with Where-Object. There’s nothing wrong with that. But if this is a task I’m likely to repeat, then a PowerShell function is on the drawing board. My goal is to create a function that will display files grouped into aging buckets such as 1-week or 6-months. Even though I’m primarily concerned with age based on the last write time, I (or you) might have a need to base aging on the creation time. Let’s code.

Click through for a great walkthrough and code.

Comments closed

Verbose Powershell Results

Robert Cain gets verbose on us:

In my previous post, Fun With the PowerShell Switch Parameter, I introduced the use of switches with PowerShell functions. We learned how they work, and how to create your own. For all functions, PowerShell creates a set of switches that are just “built in” to PowerShell. You do not have to explicitly create these yourself, you can simply use them and implement them within your own functions.

Two of the most used are -Verbose and -Debug. When used with the Write-Verbose and Write-Debug cmdlets they will display informational messages to the user of your functions. In this post, we’ll focus on the Verbose switch. The next post in the series will examine the Debug switch.

Click through for examples of how this can be useful.

Comments closed

Against Reporting Tables

Erik Darling doesn’t like reporting tables:

I’ve seen a lot of crazy ways for people to run reports on live data, but one of the worst ideas is creating a reporting table based on the set of data a user wants, and then letting them query that data.

As usual, Erik says something I want to disagree with, and then I read the post and don’t really disagree with him—or if I do, he’s already laid out the “Yes, I understand X” exception. I’ve used reporting tables to good effect, but the important thing is that they’re general-purpose and designed into the application, not specific to a single user.

Comments closed

Recommendations for SQL Server on VMware

Anthony Nocentino has some recommendations for us:

The intent of this post is a quick reference guide based on the recommendation made in “Architecting Microsoft SQL Server on VMware vSphere” April 2019 version. The target audience for this blog post is for SQL Server DBAs introducing them to the most impactful configurations and settings for running SQL Server in VMware.

For the explanations for each of these settings and how to configure the base VMware infrastructure, please read the “Architecting Microsoft SQL Server on VMware vSphere” guide and consult with your VMware administrators and experts.

Click through for Anthony’s summary.

Comments closed