Press "Enter" to skip to content

Day: June 1, 2023

Listing Topics in Kafka without Zookeeper

The BIg Data in Real World team has a quick one for us:

Kafka uses Zookeeper to manage it’s internal state. So it is not possible to run Kafka without Zookeeper. Even if you don’t have access to Zookeeper in your organization, there is a Zookeeper cluster running which your Kafka cluster connects to.

So, how to list topics and execute other commands if we don’t have access to Zookeeper?

Eventually, this won’t even be a question, as Kafka already has production versions using KRaft, and by Kafka 4.0, there won’t be a Zookeeper to kick around anymore.

Comments closed

Data Inconsistency in Postgres HA Clusters

Umair Shahid gives us an overview:

While PostgreSQL is known for its robustness, scalability, and reliability, data inconsistency can occur in PostgreSQL clusters, which can cause issues and impact the overall performance of the system. In this blog, we’ll define data inconsistency in PostgreSQL clusters, discuss the challenges it poses, its causes, and provide some tips on how to prevent and resolve it if it occurs.

Click through for the article.

Comments closed

Building a Data Warehouse in Microsoft Fabric

Reza Rad continues a video series on Microsoft Fabric:

Microsoft Fabric Data Warehouse is a database system that stores data in OneLake and provides a medium to interact with the database using SQL commands. The Fabric Data Warehouse, which is also called Data Warehouse, or in short, Warehouse, also provides a powerful computing engine behind the scene to account for large volumes of data and support a fast-performing database system. The term Data Warehouse comes from the fact that this is not usually a place to store transactional data for an operational system (for that, you can use Azure SQL Database). A Data Warehouse, in generic Business Intelligence terminology, is a place where you would store the data that needs to be analyzed.

Reza also explains how the warehouse differs from a lakehouse.

Comments closed

Microsoft Fabric and Process Unification

Paul Andrew gets to the heart of things:

Moving on and assuming you have seen the event sessions, I want to give you my point of view to help explain what Microsoft Fabric is. Firstly, lets clear up call out was terminology to support this understanding. Is this software offering a resource, service, platform, or solution? To answer this question, perspective is key, perspective with a timeline (2018 to 2023). We could simply say that Microsoft Fabric is all these things. All things to all data professionals and beyond. But, to understand this, let’s consider the journey Microsoft has been on and how this technology has evolved. I believe this journey is the best way to help explain what Microsoft Fabric is, rather than focusing on all the new and shiny bits.

Click through for Paul’s take on the matter and how this whole area of “modern data warehousing” has evolved over the past several years in Azure.

Comments closed

Cosmos DB Serverless Scaling to 1TB

Hasan Savran shares the news:

Azure Cosmos DB’s Serverless option is a great way to save money if your application expects intermittent and unpredictable traffic with long idle times. I use serverless in developing, prototyping, and integrating with computing services such as Azure Functions.

     The limitation of Azure Cosmos DB serverless was a show-stopper if your solution needed scalability or a large storage. Cosmos DB announced that many of the limitations of the serverless option of Azure Cosmos DB are lifted in Build 2023.

Read on for the gist of these updates.

Comments closed

MVCC and Vacuuming in Postgres

Ryan Booz explains one area where Postgres’s implementation differs from most other vendors:

All relational databases handle transaction isolation in some way, typically with an implementation of Multi-version Concurrency Control (MVCC). Plain ‘ol, mainline SQL Server uses a form of MVCC, but all older rows (currently retained for ongoing transactions) are stored in TempDB. Oracle and MySQL also do something similar, storing (essentially) diffs of the modified data outside of the table that is merged at runtime for ongoing transactions that still need to see the older data.

Among these databases, PostgreSQL stands alone in the specific way MVCC is implemented. Rather than storing some form of the older data outside of the current table for transactions to query/merge/etc. at runtime, PostgreSQL always creates the newly modified row in-table alongside the existing, older versions that are still needed for running transactions. Yes, every UPDATE creates a new row of data in the table, even if you just change one column.

Read on to understand some of the implications of this and how it affects the way we manage databases.

Comments closed