Press "Enter" to skip to content

Month: September 2024

From Pandas to Polars

Ari Lamstein explains why it might be worth a switch:

I recently decided to switch from Pandas to Polars for my Python projects that use dataframes. I came to this decision while taking a workshop on Polars last week: I found its syntax to be so intuitive that I couldn’t justify continuing to try to get “better” at Pandas, despite Pandas being the more established library. The fact that Polars is faster (it’s main selling point) was, surprisingly, not a factor in my decision.

A similar transformation recently happened in R. For most of the history of R there was only one way to interact with dataframes: Base R. Then the Tidyverse came along, and offered both performance improvements and easier syntax. Eventually the Tidyverse became the primary way that many people interact with dataframes. I believe that the Tidyverse’s easier syntax is what led to its widespread adoption, and I think that something similar is likely to happen with Polars.

Click through for Ari’s thoughts on the matter. H/T R-Bloggers.

Comments closed

Indexing Vector Databases

Brendan Tierney continues a series on vector databases:

In this post on Vector Databases, I’ll explore some of the commonly used indexing techniques available in Databases. I’ll also explore the Vector Indexes available in Oracle 23c. Be sure to check that section towards the end of the post, where I’ll also include links to other articles in this series.

As with most data in a Databases, indexes are used for fast access to data. They create an organised structure (typically B+ tree) for storing the location of certain values within a table. When searching for data, if an index exists on that data, the index will be used for matching and the location of the records is used to quickly retrieve the data.

Read on to get an idea of what kinds of indexing techniques are useful in that space.

Comments closed

Data Compression and CPU Utilization

Kendra Little shares some advice:

Every time I share a recommendation to use data compression in SQL Server to reduce physical IO and keep frequently accessed data pages in memory, I hear the same concern from multiple people: won’t this increase CPU usage for inserts, updates, and deletes?

DBAs have been trained to ask this question by many trainings and a lot of online content – I used to mention this as a tradeoff to think about, myself– but I’ve found this is simply the wrong question to ask.

In this post I’ll share the two questions that are valuable to ask for your workload.

Kendra’s advice is very good, and to add my own two cents to the mix: the last place I was at did, in fact, see a pretty reasonable reduction in CPU utilization by performing page-level compression on any index where it made sense—and this was a very busy OLTP environment. The exceptions would be indexes making prominent use of things like Guids or chunks of binary, which don’t compress very well at all. In all my FTE and consulting years, I’ve never run into a circumstance in which compression caused a significant gain in CPU utilization.

Comments closed

Finding Basic Table Information via T-SQL

Andy Brownsword has a script for us:

In Management Studio we can view object details by hitting F7 in Object Explorer. It gives us basic metrics but I find it very slow to load for the details I typically need.

For that reason I though I’d share a script to turn to for metrics I commonly need. This query returns:

  • The table details (schema, name, created date)
  • The primary storage (Heap, Clustered, or Columnstore)
  • The numbmer of Nonclustered / Columnstore Indexes
  • The number of records and rough size for data / indexes

Click through for the script and an example of what it looks like.

Comments closed

SQL Server Msg 3023 during DBCC SHRINKFILE

Tom Collins gets an error:

Question : Executing the following  Database shrinkfile activity and getting the error message

use myDatabase
DBCC SHRINKFILE (N’myDatabase_log’ , 0, TRUNCATEONLY)

Msg 3023, Level 16, State 2, Line 4
Backup, file manipulation operations (such as ALTER DATABASE ADD FILE) and encryption changes on a database must be serialized. Reissue the statement after the current backup or file manipulation operation is completed.

Read on for the answer.

Comments closed

Shrinking Large Tables in Postgres

Andrew Atkinson shrinks a table:

In this post, you’ll learn a recipe that you can use to “shrink” a large table. This is a good fit when only a portion of the data is accessed, the big table has become unwieldy, and you don’t want a heavier solution like table partitioning.

This recipe has been used on tables with billions of rows, and without taking Postgres offline. How does it work?

Click through to find out.

Comments closed

Real-Time Streaming in Azure

Temidayo Omoniyi takes us through an architecture:

In today’s world, billions of data are generated daily from messaging applications like WhatsApp, financial data like the New York Stock Exchange, or video streaming platforms like YouTube. As a data engineer or solution architect, you are tasked to design a real-time streaming platform that captures the data as they are generated and stored in the necessary storage for decision-making.

This does a great job of going into detail, not only at the architectural level, but also setup and practical implementation.

Comments closed

Vacuum, MERGE, and ON CONFLICT Directives

Shane Borden looks back at a directive:

I previously blogged about ensuring that the “ON CONFLICT” directive is used in order to avoid vacuum from having to do additional work. You can read the original blog here: Reduce Vacuum by Using “ON CONFLICT” Directive

Now that Postgres has incorporated the “MERGE” functionality into Postgres 15 and above, I wanted to ensure that there was no “strange” behavior as it relates to vacuum when using merge.

Read on to see how things look now.

Comments closed

Querying a Fabric SQL Endpoint via Notebook and T-SQL

Sandeep Pawar talks about a Spark connector:

I am not sharing anything new. The spark data warehouse connector has been available for a couple months now. It had some bugs, but it seems to be stable now. This connector allows you to query the lakehouse or warehouse endpoint in the Fabric notebook using spark. You can read the documentation for details but below is a quick pattern that you may find handy.

Despite it not being anything new, it is still interesting to see the use case of writing T-SQL instead of Spark SQL.

Comments closed