Press "Enter" to skip to content

Month: October 2023

Time Series Stationarity Testing in R

Steven Sanderson isn’t just spinning in place:

Before we delve into the ts_adf_test() function, let’s understand the concept behind it. The Augmented Dickey-Fuller (ADF) test is a crucial tool in time series analysis. It’s like the Sherlock Holmes of time series data, helping us detect whether a series is stationary or not. Stationarity is a fundamental assumption in time series modeling because many models work best when applied to stationary data.

So, why “Augmented”? Well, it’s an extension of the original Dickey-Fuller test that accounts for more complex relationships within the time series data.

Click through to see how you can use the ts_adf_test() function to get a better feel for whether a time series is stationary.

Comments closed

Running Apache Flink Jobs from HDInsight

Sairam Yeturi builds a streaming job:

Could you already complete creating your first Apache Flink® cluster and submit your streaming job on it with HDInsight on AKS?

Well, if you are yet to do that – Let me help you get started.

Click through for a step-by-step walkthrough on how to create a Flink-centric HDInsight cluster on Azure Kubernetes Service and how to create a new job, assuming you have the Jarfile for that job already.

Comments closed

A Primer on Boyce-Codd Normal Form

I have a new video:

In this video, we drill into one of the two most important normal forms, learning what Boyce-Codd Normal Form (BCNF) is, how you can get to BCNF, and a practical example of it. We also learn why I cast so much shade on 2nd and 3rd Normal Forms.

Boyce-Codd Normal Form is one of the two most important normal forms, and I’m pretty happy with the way this video came together to explain how you can get from 1NF into BCNF, as well as the specific benefits this provides.

Comments closed

Monitoring Power BI Gateways with Microsoft Fabric

Tom Martens builds a solution:

No matter what, when the on-premises gateways are not working as expected, data will not refresh, and direct query queries will not succeed. For this reason, I consider it a good idea to track the well-being of these valuable resources. This article describes a solution built with Microsoft Fabric. It’s not necessary to use Fabric, and it’s also not necessary to build a solution on your own. If you want to track the well-being of your on-premises data gateways but do not want to build something, I recommend using the solution by Rui Romano you can find here: https://github.com/RuiRomano/pbigtwmonitor

I built this monitoring solution focusing on the well-being of the on-premises data gateway. I might extend this solution in the future, but for now, it’s about the availability of the on-premises data gateway and the data gateway connections. Availability and analysis will follow during the next weeks.

Click through for Tom’s solution.

Comments closed

Cloning Tables in Microsoft Fabric

Koen Verbeeck is interested:

A while ago I had a little blog post series about cool stuff in Snowflake. I’m starting up a similar series, but this time for Microsoft Fabric. I’m not going to cover the basic of Fabric, hundreds of bloggers have already done that. I’m going to cover little bits & pieces that I find interesting, that are similar to Snowflake features or something that is an improvement over the “regular” SQL Server.

To kick off this series, I’m going to start with a feature that also exists in Snowflakezero-copy cloning. The idea is that you create a copy of a table, but instead of actually copying the data, pointers are created behind the scenes that just point to the original data. This means creating a clone is a metadata-only operation and is thus very fast. If you make updates against your clone, they will be stored separately, so in all purposes it seems you created a brand new table. Except you didn’t.

Read on to see how this works and what its current limitations are compared to Snowflake.

Comments closed

Backups Are for DR, Not HA

Kevin Hill gives us a poignant reminder:

Please continue doing your backups!

Backups are Disaster Recovery, yes…but not HA.

Some will argue with this (in the comments most likely), but I broadly define “High Availability” as a system that can recover in seconds or minutes at most. Sometimes that is automatic, sometimes manual.

I agree that backups are for DR, not HA. I’d consider log shipping an option for both HA and DR, albeit one that requires manual failover (or rigging up a script that performs the failover for you).

I disagree about replication as an HA solution. Yes, you do need to make sure that everything can replicate, but if your publisher goes down, the subscriber can continue and your data is still available for use. And if you’re a complete masochist, you can use merge replication to allow writes to continue while the publisher is down. Cleaning up after that is a mess, especially if you end up with a bunch of conflicts, but High Availability doesn’t mean Easy Mode.

Comments closed

T-SQL Tuesday 167 Roundup

Matthew McGiffen talks encryption:

Huge thanks to everyone who responded to my invitation to blog on Encryption and Data Protection for this month’s T-SQL Tuesday.

I got what I hoped for, which was a wide(ish) interpretation of the topic. Posts from those who are clearly advocates of encryption, and from those who have scepticism around the encryption approaches that some people take.

We’ve also got discussions about the importance of basic data governance, and the protection of intellectual property.

Read on for links to several posts on the topic of security.

Comments closed

New R Package: hstats

Michael Mayer has a new package:

The current version offers:

  • H statistics per feature, feature pair, and feature triple
  • multivariate predictions at no additional cost
  • a convenient API
  • other important tools from explainable ML:
    • performance calculations
    • permutation importance (e.g., to select features for calculating H-statistics)
    • partial dependence plots (including grouping, multivariate, multivariable)
    • individual conditional expectations (ICE)
  • Case-weights are available for all methods, which is important, e.g., in insurance applications.

Click through for an example of how it works, followed by some simple benchmarking to give you an idea of how it performs compared to similar tools.

Comments closed

A Primer on Functional Programming

Anirban Shaw gives us the skinny:

In the ever-evolving landscape of software development, there exists a paradigm that has been gaining momentum and reshaping the way we approach coding challenges: functional programming.

In this article, we delve deep into the world of functional programming, exploring its advantages, core principles, origin, and reasons behind its growing traction.

I like this as an introduction to the topic, helping explain what functional programming languages are and why they’ve become much more interesting over the past 15-20 years. Anirban hits the topic of concurrency well, showing how a functional approach with immutable data makes it easy for multiple machines to work on separate parts of the problem independently and concurrently without error. I’d also add one more bit: functional programming languages tend to be more CPU-intensive than imperative languages, so in an era of strict computational scarcity, imperative languages dominate. With strides in computer processing, we tend to be CPU-bound less often, so the trade-off of some CPU for the benefits of FP makes a lot more sense. H/T R-Bloggers.

Comments closed

Plotting Time Series Growth Rates

Steven Sanderson builds a chart:

The ts_growth_rate_vec() function is part of the healthyR.ts library, designed to work with numeric vectors or time series data. It calculates the growth rate or log-differenced growth rate of the provided data, offering valuable insights into the underlying trends and patterns.

Read on to see how this function works, as well as several examples of plotting growth rates of airline data which exhibits both strong cycles and an overall trend.

Comments closed