Press "Enter" to skip to content

Day: October 17, 2023

New R Package: hstats

Michael Mayer has a new package:

The current version offers:

  • H statistics per feature, feature pair, and feature triple
  • multivariate predictions at no additional cost
  • a convenient API
  • other important tools from explainable ML:
    • performance calculations
    • permutation importance (e.g., to select features for calculating H-statistics)
    • partial dependence plots (including grouping, multivariate, multivariable)
    • individual conditional expectations (ICE)
  • Case-weights are available for all methods, which is important, e.g., in insurance applications.

Click through for an example of how it works, followed by some simple benchmarking to give you an idea of how it performs compared to similar tools.

Comments closed

A Primer on Functional Programming

Anirban Shaw gives us the skinny:

In the ever-evolving landscape of software development, there exists a paradigm that has been gaining momentum and reshaping the way we approach coding challenges: functional programming.

In this article, we delve deep into the world of functional programming, exploring its advantages, core principles, origin, and reasons behind its growing traction.

I like this as an introduction to the topic, helping explain what functional programming languages are and why they’ve become much more interesting over the past 15-20 years. Anirban hits the topic of concurrency well, showing how a functional approach with immutable data makes it easy for multiple machines to work on separate parts of the problem independently and concurrently without error. I’d also add one more bit: functional programming languages tend to be more CPU-intensive than imperative languages, so in an era of strict computational scarcity, imperative languages dominate. With strides in computer processing, we tend to be CPU-bound less often, so the trade-off of some CPU for the benefits of FP makes a lot more sense. H/T R-Bloggers.

Comments closed

Plotting Time Series Growth Rates

Steven Sanderson builds a chart:

The ts_growth_rate_vec() function is part of the healthyR.ts library, designed to work with numeric vectors or time series data. It calculates the growth rate or log-differenced growth rate of the provided data, offering valuable insights into the underlying trends and patterns.

Read on to see how this function works, as well as several examples of plotting growth rates of airline data which exhibits both strong cycles and an overall trend.

Comments closed

DISTINCT Papers up Problems

Aaron Bertrand wants to solve the actual problem:

I’ve quietly resolved performance issues by re-writing slow queries to avoid DISTINCT. Often, the DISTINCT is there only to serve as a “join-fixer,” and I can explain what that means using an example.

I’ve seen this a lot as well, and it usually comes from people not understanding the data model or not understanding how to use subqueries (or common table expressions, the APPLY operator, etc.) to define subsets of data.

Comments closed

A Critique (and Defense) of Generic Programming Languages for ETL/ELT

Teo Lachev doesn’t like general programming languages for ETL and ELT operations:

Someone asked the other day for my opinion about the open-source dbt tool for ETL. I hadn’t heard about it. Next thing I’ve noticed was that Fabric Warehouse added support for it, so I got inspired to take a first look. Seems like an ELT-oriented tool. Good, I’m a big fan of the ELT pattern whose virtues I extolled I discussed many times here. But a Python-based tool that requires writing custom code for orchestration in a dev environment, such as Visual Studio Code? Yuck!

My reasoning is simple: complexity. Bespoke ETL/ELT tools like SQL Server Information Services, Informatica, Azure Data Factory, Airflow, and the like are good when you fit into their primary use cases: moving data from a few data sources into a destination, perhaps with some level of transformation in between.

But here are areas off the top of my head where I’ve seen these tools not work well:

  • Wide scale. In one environment, we had to move contents from a couple thousand databases (with identical schemas) across 50-60 instances of SQL Server into a warehouse, including some facts and dimensions we needed within a minute or two. Even assuming those packages don’t change frequently—not a reasonable assumption—the pains of orchestrating that would be enormous. I don’t think we could have used metadata-driven approach and foreach loops in your ADF workflows, either, as that would not satisfy the time requirements. There are also resource limitation requirements on the other side—you don’t want to overwhelm the warehouse by trying to process a couple thousand clients’ worth of data all at once, so you’ve got to stagger this work using an orchestration engine with enough smarts to limit concurrent processes.
  • Limiting copy-paste efforts and drudgery. Going back to SSIS, it sucks having to maintain dozens of packages, especially common components you need to update in each one. I got to be pretty good at Biml, but a) that has its limits, and b) that’s C# development with SSIS packages as an output, so I’m claiming that for the generic programming languages side of the argument.
Comments closed

Oracle OCI Labeling with Bounding Boxes

Brendan Tierney continues a series on image classification:

In a previous post, I gave examples of how to label data using OCI Data Labeling. It was a simple approach to data labeling images for input to AI Vision. In that post, we just gave a label for the image to indicate if the image contained a Cat or a Dog. Yes, that’s a very simple approach, and we can build image classification models, and use the resulting model to predict a label for new images. These would be labeled as a Cat or a Dog with a degree of certainty. Although this simple approach can give OK-ish results, we typically want a more detailed model and predictions. For a more detailed approach, we can use Object Detection. For this, we need to prepare our data set in a slightly different way and Yes it does take a bit more time to prepare. Or perhaps it takes a lot more time to prepare the data. But this extra time in preparing the data should (in theory) give us a more accurate model.

This post will focus on creating a new labeled dataset using bounding boxes, and in a later post, we’ll examine the resulting model to see if it gives better or more accurate results.

Read on for the process.

Comments closed

Finding the ACTIVE_TRANSACTION Culprits

Thamires Lemes digs into high transaction log utilization:

The transaction log in SQL Server records all changes made to a database, allowing for data recovery and consistency. When a transaction is initiated, it acquires space in the transaction log to record its activities. Long running transactions have the potential to hold the transaction log, and, depending on database write activity, cause errors and disruptions in the SQL Server environment.

It is important to point out that the transaction that is holding the transaction log might not be performing any write activities to consume additional log space, but subsequent transactions that writes to the transaction log will cause its utilization to increase, even if they are fast. The log space won’t be released until the oldest transaction concludes its execution.

Click through for a few queries on the topic. I’d also highly recommend sp_whoisactive for this kind of work.

Comments closed

Using Tableau with Power BI and Fabric

Kurt Buhler crosses the streams:

If you use Power BI, Fabric, or Excel, connecting to Power BI datasets is straightforward. However if you use other BI tools like Tableau, it’s not obvious how you can leverage a Power BI semantic model in your workflow. In this article, I’ll explain how to connect to and use a Power BI dataset from Tableau Desktop.

Read on to see how. Also check out the notes in drill-down sections, as there’s a lot of content in there.

Comments closed