Press "Enter" to skip to content

Day: July 15, 2021

The Benefits of Cluster Sampling

Muhammad Touhidul Islam explains what cluster sampling is and why it can be useful:

Cluster sampling is defined as a sampling method where multiple clusters of people are created from a population where they are indicative of homogenous characteristics and have an equal chance of being a part of the sample. In this sampling method, a simple random sample is created from the different clusters in the population. This is a probability sampling procedure.

Click through for a few examples of where this can be useful.

Comments closed

Using Kafka for Security Situational Awareness

Kai Waehner continues a series on using Apache Kafka on security teams:

Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part two: Cyber Situational Awareness.

Click through for the high-level discussion.

Comments closed

High-Performance ETL via Buffer Table

Daniel Hutmacher needs things to zoom:

It’s almost like a myth – one that I’ve heard people talk about, but never actually seen myself. The “shock absorber” is a pretty clever data flow design pattern to ingest data where a regular ETL process would choke on the throughput or spikes. The idea is to use a buffer table to capture incoming data, and then run an asynchronous process that loads that data in batches from the buffer into its intended target table.

While I’ve seen whitepapers and blog posts mention the concept loosely along with claims of “7x or 10x performance”, none of them go into technical detail on how it’s done, so I decided to try my hand at it.

I’ve compiled my findings, along with some pre-baked framework code if you want to try building something yourself. Professional driver on closed roads. It’s gonna get pretty technical.

Combine that with Eitan Blumin’s post yesterday and you’d think it were buffer week.

This shock absorber pattern works well for warehouse loading, especially when you’re trickle-loading data into columnstore indexes and don’t want to have open rowgroups slowing everything down.

Comments closed

The Benefits of a Cache Layer

Monica Rathbun espouses upon the utility of Azure Cache:

One of the biggest impacts on resource consumption for Azure SQL DB are repeated data pulls by the application layer. No matter how fast those queries execute calling the same procedure or issuing the same SQL statements hundreds, thousands, or million times a day can wreak havoc on database performance. Death by a thousand cuts can easily bring a system to its knees. Sometimes it’s hard for DBAs to troubleshoot these actively as the execution of the statements happens so quickly they don’t even show in tools like sp_whoisactive. It’s not until you begin to dive into things like Query Performance Insights or Query Store that you start to see the real issue.

Check it out. And if you want to get into implementation, I’ve found the Cache-Aside design pattern to be useful. Bowen Li has a rundown of several caching patterns as well.

Comments closed

Renaming a YAML Pipeline in Azure DevOps

Hamish Watson figures out what’s in a name:

I had created a pipeline using YAML – which was called InfrastructureAsCode as the YAMP file was in the root directory.

However I wanted to move it into a folder .\InfrastructureAsCode\pipelines\… and run the YAML file from there – as I would have a non-prod and PROD version of them (as the schedule was different for each).

Click through to see how Hamish was able to resolve this.

Comments closed

SQL Server Baselines with the TIG Stack

Mark Wilkinson combines Telegraf, InfluxDB, and Grafana:

Lots of folks wonder why I would go through the trouble of building out a system when so many vendors have already solved the problem of collecting baseline metrics. The answer at the time was simple: cost. With my setup I could monitor close to 600 instances (including dev) for $3,000 USD per year. That includes data retention of ~2 years! Are there some administration costs as far as my time is concerned? Of course. In the begining things were a little rough as I learned more about InfluxDB, but once things were configured correctly the most work I’ve had to do is to expand the size of the data drive as we started collecting more metrics.

Click through for more info and check out the GitHub repo.

Comments closed