Press "Enter" to skip to content

Author: Kevin Feasel

Where Git Repositories Store File Versions

Julia Evans digs into a folder:

Hello! I was talking to a friend about how git works today, and we got onto the topic – where does git store your files? We know that it’s in your .git directory, but where exactly in there are all the versions of your old files?

For example, this blog is in a git repository, and it contains a file called content/post/2019-06-28-brag-doc.markdown. Where is that in my .git folder? And where are the old versions of that file? Let’s investigate by writing some very short Python programs.

Read on to learn how you can parse it all out. And this is also reason number 3 why you don’t want to commit a large file to Git: even if you delete that file later, the contents will live in the .git folder forever, or at least until you take some manual action to excise it from Git’s history.

Comments closed

Service Level Agreements (RPO and RTO) and SQL Server

David Klee wants to know how much downtime is acceptable to you:

Database professionals of the world – I have a question. Has your organization defined service level agreements (SLAs) for your data estate? I’m talking specifically the Recovery Point Objective (RPO) and Recovery Time Objective (RTO), and to have these defined not in an arbitrary number of nines, but in minutes or hours. If these aren’t defined from above, your business continuity plan is doomed to fail.

Read on to learn what RPO and RTO mean, how to think in terms of RPO and RTO, and some of David’s recommendations.

Comments closed

An Overview of Postgres Data Types

Arindam Mondal categorizes various Postgres data types:

This article will show PostgreSQL Data Types with various examples.

Data Types are an important part of a database. It represents values associated with it. Choosing the right data type for a table is one of the most important tasks because it determines the kind of data we want to store in a table. While creating a table you must specify a data type for each column. A column can store a specific type of data, like integer, string, Boolean, floating points, and so on. In this article, we are going to discuss PostgreSQL data types.

The list is quite similar to what’s available in SQL Server, though there are a few differences, such as built-in support for storing network addresses.

Comments closed

Row Level Security Anti-Patterns and Alternatives

Ben Johnston tells us why we might not want to use row level security in SQL Server:

One of the primary reasons to implement RLS is to facilitate reporting and ease the administrative burden. This section covers some considerations for using RLS with the primary Microsoft reporting engines and gives you an idea of things to look for in your reporting engine. Some anti patterns and alternatives to RLS are also examined.

This article goes a long way toward explaining why I find row level security so rare in the wild and never implemented it myself: most databases I’ve worked with are either transactional or hybrid OLTP/OLAP, they’re mostly multi-tenant, and they’re accessed through service accounts. That’s just a no-go across the board.

Comments closed

Reading Parquet Files with DuckDB and R

Michaël read a Parquet file:

Querying a remote parquet file via HTTP with DuckDB.

The french statistical service (INSEE) has made available its first parquet file on data.gouv.fr in June.

It’s a 470 MB file (from a 1.8 GB CSV) with 16·106 rows, showing for each address in France which polling station it belongs to.

Click through for the code and results. The only thing which surprised me at all was that the performance was so fast for a remote file, unless I’m misunderstanding something. For a local file, I’d expect 16 million rows to complete in under 2 seconds for heavy aggregation on two columns in Parquet. H/T R-Bloggers.

Comments closed

Multi-Plot Graphs in R

Steven Sanderson needs more than one line:

Data visualization is a crucial aspect of data analysis. In R, the flexibility and power of its plotting capabilities allow you to create compelling visualizations. One common scenario is the need to display multiple plots on the same graph. In this blog post, we’ll explore three different approaches to achieve this using the same dataset. We’ll use the set.seed(123) and generate data with x and y equal to cumsum(rnorm(25)) for consistency across examples.

Click through for three common techniques.

Comments closed

An Overview of Microsoft Fabric Domains

Reza Rad provides an overview:

Microsoft Fabric introduced a new concept called Domains. Domains are more than just a separation of Fabric data items. They come with a whole lot of security, administration, and governance features, which brings the concept of data mesh into the world of data analytics using Microsoft Fabric. Domains are logical categorizations inside the OneLake. In this article and video, I will explain what domains are in Microsoft Fabric, why they are important, and their associated features and configurations.

Click through for both a video on the topic and a lengthy article.

Comments closed

Heap-Only Tuples in Postgres

Umair Shahid explains the benefit of Heap-Only Tuples in PostgreSQL:

Heap-only tuples, also known as HOT, are PostgreSQL’s answer to the update query performance issues caused by MVCC. These tuples allow PostgreSQL to mark a row as “dead” and physically reuse the space it occupies in the table. This process eliminates the need to keep multiple versions of the same row, reducing I/O and improving query performance.

Read on to see how these compare to the normal MVCC process in Postgres, as well as cases when you should and should not use them.

Comments closed

The Search for Extended Events Information

Grant Fritchey stays on the first page:

Here’s their paraphrased (probably badly) story:

“I was working with an organization just a few weeks back. They found that Trace was truncating the text on some queries they were trying to track. I asked them if they had tried using Extended Events. They responded: What’s that? After explaining it to them, they went away for an hour or so and came back to me saying that had fixed the problem.”

We all smiled and chuckled. But then it struck me. This wasn’t a case of someone who simply had a lot more experience and understanding of Profiler/Trace, so they preferred to use it. They had literally never heard of Extended Events.

Why?

This led Grant to perform some search engine shenanigans and what he found was curious. A couple of points with search engines, though:

  • Search engine results will differ based on your location (IP address) and whether you are signed in or not. Google is particularly selective about this stuff. It might also affect Bing, but let’s face it: if you’re using Bing to search for anything other than images, you’ve already resigned yourself to failure.
  • In my case, a search for “extended events” (without quotation marks) did show quite a few pages which I’d consider reasonable for the topic: a Microsoft Learn quickstart article on using extended events, Brent Ozar’s extended events material, a SQL Shack article on the topic, etc. A good number of these links are content from the past 5 years, as well.
  • Grant mentions the “page 1” effect in search engines, and he’s absolutely right. The vast majority of people performing a search never leave the first page of results. This is part of why Google went to an infinite scrolling approach rather than showing explicit numbered pages.
Comments closed

CI/CD for Synapse Serverless SQL Pool with SqlPackage and Azure DevOps

Rui Cunha has a tutorial for us:

Azure Synapse Analytics Serverless SQL is a query service mostly used over the data in your data lake, for data discovery, transformation, and exploration purposes. It is, therefore, normal to find in a Synapse Serverless SQL pool many objects referencing external locations,  using disparate external data sources, authentication mechanisms, file formats, etc. In the context of CICD,  where automated processes are responsible for propagating the database code across environments, one can take advantage of database oriented tools like SSDT and SqlPackage CLI , ensuring that this code is conformed with the targeted resources.

In this article I will demonstrate how you can take advantage of thee tools when implementing the CICD for the Azure Synapse Serverless SQL engine. We will leverage SQL projects in SSDT to define our objects and implement deploy-time variables (SQLCMD variables).  Through CICD pipelines, we will build the SQL project to a dacpac artifact, which enables us to deploy the database objects one or many times with automation.

Click through for the demonstration.

Comments closed