July 2020 – Page 4 – Curated SQL

Incremental Loading in ETL

Published 2020-07-28 by Kevin Feasel

Tim Mitchell gives us the wherefore around incremental loads:

When moving data in an extraction, transformation, and loading (ETL) process, the most efficient design pattern is to touch only the data you must, copying just the data that was newly added or modified since the last load was run. This pattern of incremental loads usually presents the least amount of risk, takes less time to run, and preserves the historical accuracy of the data.
In this post, I’ll share what an incremental load is and why it is the ideal design for most ETL processes.

“Move less data rather than more data” is how I’d put it, but Tim does a much better job of putting it.

Comments closed

Demand Forecasting with knime

Published 2020-07-27 by Kevin Feasel

Shubham Goyal walks us through using knime for a product demand forecasting scenario:

In this blog, we are going to see, Importance of demand forecasting and how we can easily create these forecasting workflows with Knime.
Market request forecasting is a basic procedure for any business, however maybe none more so than those in buyer packaged products. Stock, production, storage, delivering, showcasing – each aspect of CPG and retail organizations’ activities are influenced by accurate forecasting. Identifying shoppers’ preferences and their likeness to buy, make these organizations settle on better choices with respect to product offerings, entering new markets and their supply chains, guarantee that stores/stocks are stocked, and limit the danger of stock shortages or overflow.

Click through for the process.

Comments closed

Exactly-Once Semantics in Kafka

Published 2020-07-27 by Kevin Feasel

Boyang Chen and Bob Barrett look at some changes to exactly-once semantics in Apache Kafka:

When using EOS, the producer and broker both have logic to determine whether it is safe for a producer to continue to send data without violating the exactly-once guarantees. Prior to Kafka 2.5, if either the producer or broker was ever not able to make this determination, the producer would enter a fatal error state. The only way to continue processing was to close the producer and create a new one. This process is generally very disruptive to client applications. For example, if a producer fails in Kafka Streams, then the associated task needs to be migrated, which causes a rebalance of the full workload. This results in throughput drop until the rebalance is complete.
To address this issue, KIP-360 added a mechanism for producers to automatically recover when they encounter these cases and continue processing. To better understand how it works, the following describes some of the situations that can cause fatal errors.

There have been several improvements to the process. Though to be honest, when I hear someone mention exactly-once in a distributed system, it sets off my spidey senses.

Comments closed

Enforcing Database Practices with Policy-Based Management

Published 2020-07-27 by Kevin Feasel

Nisarg Upadhyay hits on one of my favorite features Microsoft doesn’t care about anymore:

In this article, I am going to explain how we can enforce the SQL database best practices using Policy-Based Management. The policy-based management feature of SQL Server was introduced in SQL Server 2005. This feature was useful because it helps database administrators to define and enforce the database policies based on the organizations’ requirements.

Back when I was a DBA, I really appreciated Policy-Based Management. Unfortunately, although my job has changed a few times since then, Policy-Based Management hasn’t.

Comments closed

Calculating Cloud App Availability

Published 2020-07-27 by Kevin Feasel

Dave Bermingham gives you a way to calculate how available you should expect your application to be given SLAs:

When deploying business critical applications in the cloud you want to make sure they are highly available. The good news is that if you plan properly, you can achieve 99.99% (4-nines) of availability or more. However, calculating your true availability may not be as straightforward as it seems.
When considering availability you must consider the key components that make access to your application possible, which I’ll call the availability chain. Component of the availability chain are:
– Compute
– Network
– Storage
– Application
– Dependent services
Your application is only as available as your weakest link, and your downtime increases exponentially with each additional link you add to the chain. Let’s examine each of the links.

Read on for a breakdown of these items.

Comments closed

Bug with Halloween Protection and the OUTPUT Clause

Published 2020-07-27 by Kevin Feasel

Paul White writes up a bug report:

Looking at the execution plan, it is hard to see how deleting a row (at the Clustered Index Delete) then inserting it again (at the Clustered Index Insert) could possibly result in a duplicate key in the index. Remember there is only one row, one column, and one index.
Logically, the only way this error can occur is if the Delete operator does not delete the row.

Read the whole thing. It’s probably not something you’ll ever come across yourself, hopefully.

Comments closed

Building a Windows Terminal Theme

Published 2020-07-27 by Kevin Feasel

Chrissy LeMaire has a nice-looking Windows Terminal theme:

I was sooo excited when Microsoft first made the announcement but was disappointed when I found out I’d have to run a specific version of Windows and compile the app myself. Whaat? No way, too much work. Now, it’s more widely available, so I decided to jump in and try it out. I love it and even miss Windows Terminal when I develop PowerShell on my Mac.
So here’s the Theme I’m contributing, which is based off of my favorite VS Code Theme, 1984 Unbolded. I call it Retrowave.

I went with essentially a black-and-grey theme for cmd + PowerShell and a bit more color (but still grey background) for WSL, but that’s so I can create screenshots easily without having to worry about color contrast on the printed page.

Comments closed

T-SQL Tuesday Roundup: Incident Reports

Published 2020-07-27 by Kevin Feasel

Kerry Tyler rounds up the usual suspects:

Hello T-SQL Tuesday Readers! I’m sorry for being really late in getting this post out this week.
So! A couple of weeks ago, for this month’s topic, I asked everyone to post about something that broke or went wrong, and what it took to fix it. Last week, fourteen of you responded with your stories of woe so we could all learn from your incidents and recoveries in a constructive way, like pilots do. Here’s the recap of those posts, in the order that they came in.

Read on for a summary of all of the entrants for this month.

Comments closed

Using Dynamic Datasets in Power BI

Published 2020-07-27 by Kevin Feasel

Jose Mendes implements the screening pattern in Power BI:

I recently came across the need to build a screening pattern based on the Kimball “Screening” concept. One of the desired outputs was a Power BI report that allowed a data steward to easily identify the failed screen, drill down to the detail and show the row(s) rejected by the data quality rule. To achieve this goal, I had to mimic in Power BI an SSRS functionality called dynamic dataset, which allowed me to, using a single matrix, show different source columns based on a selected screen. Feeling curious already? Then, let’s dive into the details.
A screen is designed to operate on a single input file or database table and contains the data quality condition to check. For example, there could be a Missing Customer Postcode screen which would test for any customers who are missing a postcode.

Read on for an implementation.

Comments closed

Optimizing a Poisson Survival Model

Published 2020-07-24 by Kevin Feasel

Joshua Entrop shows off optimx() in R to perform a survival analysis:

In this blog post, we will fit a Poisson regression model by maximising its likelihood function using optimx() in R. As an example we will use the lung cancer data set included in the {survival} package. The data set includes information on 228 lung cancer patients from the North Central Cancer Treatment Group (NCCTG). Specifically, we will estimate the survival of lung cancer patients by sex and age using a simple Poisson regression model. You can download the code that I will use throughout post here.

Read the whole thing. H/T R-bloggers

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Month: July 2020