Press "Enter" to skip to content

Day: April 10, 2023

Reading Multi-Sheet Excel Files in R

Steven Sanderson does a bit of Excel file reading:

Reading in an Excel file with multiple sheets can be a daunting task, especially for users who are not familiar with the process. In this blog post, we will walk through a sample function that can be used to read in an Excel file with multiple sheets using the R programming language.

Click through for the process, which makes use of the lapply() function and the readxl package.

Comments closed

An Overview of the Kappa Architecture

Amian Patnaik provides an overview:

The Kappa Architecture, introduced by Jay Kreps, co-founder of Confluent, is designed to handle real-time data processing in a scalable and efficient manner. Unlike the traditional Lambda Architecture, which separates data processing into batch and stream processing, the Kappa Architecture promotes a single pipeline for both batch and stream processing, eliminating the need for maintaining separate processing pipelines.

What’s interesting to me is that Lambda, an architecture which was an explicit product of its time (in the sense that it was a compromise architecture trying to do two things, the combination of which limited hardware and tooling didn’t allow), is still thriving today. Kappa, meanwhile, isn’t an architectural style that people throw around a lot anymore, at least in the circles I run around in.

Comments closed

Spark ELT in Synapse Notebooks

Liliam Leme performs some data movement:

I often receive various requests from customers while working on FastTrack projects, and I have compiled some examples to help you build your solution on top of a data lake using useful tips. Most of the examples in this post use pandas, and I hope they will be helpful for you as they were for me.

Please note that all examples in this post use pyspark.

In my scenario, I exported multiple tables from SQLDB to a folder using a notebook and ran the requests in parallel.

Read on for the examples and some of the things you can do with Spark notebooks in Azure Synapse Analytics.

Comments closed

Against Triggers in PostgreSQL

Laetitia Avrot is not a fan of triggers:

My opinion comes from years of practicing as a production DBA, then as a database consultant. As such a professional, my opinion is biased because I am never called when it works! I’ve always been called when there are problems (big problems, usually) so that I see the worst developers can do and never the best. I try to be aware of that bias, but it’s not that easy.

I am sympathetic to Laetitia’s argument but ultimately don’t agree, at least in the general case. Some of these thoughts and alternatives are Postgres-specific, so I don’t have an opinion on those.

Comments closed

Ordered Columnstore Indexes in SQL Server 2022

Ed Pollack gives us the scoop on ordered columnstore indexes:

One of the more challenging technical details of columnstore indexes that regularly gets attention is the need for data to be ordered to allow for segment elimination. In a non-clustered columnstore index, data order is automatically applied based on the order of the underlying rowstore data. In a clustered columnstore index, though, data order is not enforced by any SQL Server process. This leaves managing data order to us, which may or may not be an easy task.

To assist with this challenge, SQL Server 2022 has added the ability to specify an ORDER clause when creating or rebuilding an index. This feature allows data to be automatically sorted by SQL Server as part of those insert or rebuild processes. This article dives into this feature, exploring both its usage and its limitations.

I’ve seen a couple places where ordered columnstore indexes make enough sense to use, though not as many as I had first anticipated. That might change over time, as we see additional columnstore development.

Comments closed

A Love-Hate Relationship with Triggers

Ryan Booz shares some thoughts on triggers:

By design, plain ANSI SQL is declarative (“hey database, this is the data I want, you figure out how to do it”), not procedural (“Hey database, I want this data and I want you to retrieve it like this”). Early on, there wasn’t a standard design for how to add on additional procedural-like features, although that later came with the definition of SQL/PSM sometime in the mid-90s.

However, through the late 80s and most of the 90s, database vendors were trying to keep pace with very quickly changing requirements and needs in the database space. Even though triggers weren’t officially added until the SQL:99 standard, databases like Oracle had already released their own procedural languages and features. Triggers may have been deferred in the SQL-92 standard, but the Standards team couldn’t ignore them (or the complexity that triggers add to transactional consistency).

Click through for a bit more background, some of the pros and cons of triggers, and a few cases where triggers can make sense.

Comments closed

Listing Filtered Indexes

Tom Collins has a filter for filtered indexes:

Question: How can I find SQL Server Filtered Indexes  ?    We are woking on some migrations  from SQL Server to other engine platform and part of the process includes locating a range of database objects . The identified objects will be recoded into the target engine plaform 

Click through for a query which provides this answer. Tom is even kind enough to include the filter definition.

Comments closed

Allowing Multiple Users to Use the Same Data Source in a Power BI Dataset

Gilbert Quevauvilliers answers a question:

Recently I got a question on the Power BI Community forum which was “Is it possible to allow multiple users to use the same data source for their report”?

I thought it would be a great blog post so that other users could also benefit from this knowledge, as once you know it is quite a simple process to complete

I’m mildly surprised by the question itself, as one of the core reasons for having datasets be top-level entities in Power BI is to allow people to re-use them.

Comments closed