Press "Enter" to skip to content

Curated SQL Posts

Quick Takes on Logistic Regression

John Cook talks about my favorite form of regression that serves to solve classification problems:

Logistic regression models the probability of a yes/no event occurring. It gives you more information than a model that simply tries to classify yeses and nos. I advised a client to move from an uninterpretable classification method to logistic regression and they were so excited about the result that they filed a patent on it.

It’s too late to patent logistic regression, but they filed a patent on the application of logistic regression to their domain. I don’t know whether the patent was ever granted.

Read on for a few more thoughts on and around logistic regression and logits from a mathematician.

Comments closed

Arbitrary Intervals for Partitioning in Postgres

Keith Fiske does a bit of interval math:

Whether you are managing a large table or setting up automatic archiving, time based partitioning in Postgres is incredibly powerful. pg_partman’s newest versions support a huge variety of custom time internals. Marco just published a post on using pg_partman with our new database product for doing analytics with PostgresCrunchy Bridge for Analytics. So I thought this would be a great time to review the basic and complex options for the time based partitioning.

Read on for a note of how pg_partman works and interval management, especially for versions earlier than 5.0.

Comments closed

Indexing for Read-Scale Databases

Jose Manuel Jurado Diaz shares a customer case:

Today, I worked on a service request that our customer has a Business Critical database with 4 vCores and Read-Scale Out enabled. Our customer noticed several performance issues using Read-Scale Out database and I would like to explain several lessons learned found during the troubleshooting steps.

Click through for notes on troubleshooting and improving performance.

Comments closed

Dangling Images with Oracle 23ai Free Edition

Kellyn Gorman runs into an issue:

When I tried to connect via SQLPlus as SYSDBA, I received an EXTPROC error. It pointed clearly to the listener.ora file, which I discovered a path listed still to ora23c for the extproc, corrected it, started the Listener, but to no avail- an ORA-12547 error, realizing I had a make file issue on the binaries for Oracle.

I contacted Geral Venzl, who was very gracious and after some quick research, he came back that his folks said everything was fine with the images, so I thanked him and dug into the issue deeper.  I quickly discovered this problem could happen to others, so decided I better document here for anyone who does happen upon it.

Click through for the high-level explanation and a bit more detail on dangling images.

Comments closed

Multi-Master Architecture in PostgreSQL

Semab Tariq describes a scale-out technique for Postgres:

Multi-master architecture has gained significant traction in the world of database management, offering a solution to traditional limitations in scalabilityfault tolerance, and high availability. By allowing multiple nodes to operate as master, this architecture promises a more flexible and robust database system. However, along with these benefits come certain challenges, including data consistency, resource demands, and conflict resolution.

In this blog, we will explore what multi-master architecture is, delve into its key advantages, and discuss the potential drawbacks that come with its implementation. Also in our upcoming blogs, we will see how you can setup your first multi-master architecture with a tool called PGD (Postgres Distributed) by EnterpriseDB (EDB).

Read on to learn how it works, as well as some of the pros and cons of using it.

Comments closed

Counting NA Values in R

Steven Sanderson counts what doesn’t exist:

Welcome back, R enthusiasts! Today, we’re going to explore a fundamental task in data analysis: counting the number of missing (NA) values in each column of a dataset. This might seem straightforward, but there are different ways to achieve this using different packages and methods in R.

Let’s dive right in and compare how to accomplish this task using base R, dplyr, and data.table. Each method has its own strengths and can cater to different preferences and data handling scenarios.

Read on for 3 1/2 separate methods.

Comments closed

Classification with Random Forest

I have a new video:

In this video, I cover a powerful ensemble method for classification: random forests. We get an idea of how this differs from CART, learn the best possible metaphor for random forests, and dig into random search for hyperparameter optimization.

Click through to see the video in all its glory.

Comments closed

Adding the Current Date and Time to a PySpark Data Frame

Gilbert Quevauvilliers wants to know what time it is:

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

In the blog post below, I am going to describe how to add the current Date Time to your existing Spark data frame.

This is really useful when I am inserting data into a Fabric Lakehouse table, and I want to know when the data got inserted.

Read on for the answer.

Comments closed

Building Workers in Azure Data Factory

Martin Schoombee continues a series on orchestration in Azure Data Factory:

We’re finally ready to dive into the Data Factory components that form part of the framework, and we’re going to work our way from the bottom up. To paraphrase the previous blog post, worker pipelines perform the actual work of either moving data (from source to staging) or executing a stored procedure that will load a dimension/fact table.

Although worker pipelines can contain any number of tasks you may need, my worker pipelines that move data from a source system into the staging area follow a similar pattern with at least the following activities:

Click through for that list, as well as more information.

Comments closed