Press "Enter" to skip to content

Month: June 2023

A Primer on Regular Expressions

Steven Sanderson provides a quick guide to regular expressions:

Regular expressions, often abbreviated as regex, are powerful tools used in programming to match and manipulate text patterns. While they might seem intimidating at first, regular expressions are incredibly useful for tasks like data validation, text parsing, and pattern matching. In this blog post, we’ll explore regular expressions in the context of R programming, breaking down the concepts step by step and providing practical examples along the way. By the end, you’ll have a solid understanding of regular expressions and be ready to apply them to your own projects.

This is an extremely powerful language which can take years (decades?) to master, especially considering that there are several regular expression syntaxes and they don’t all behave the same way. But still, I’ve found that the more familiar you are with regular expressions, the simpler certain classes of problem become.

Comments closed

PyPI and Malicious Code

Steven Vaughan-Nichols gives us the story:

The Python Package Index (PyPI), is the most popular Python programming language software repository. It’s also a mess. Earlier this year, the FortiGuard team discovered zero-day malware in three PyPI packages called “colorslib,” “httpslib,” and “libhttps.”  Before that, 2022 closed with  PyTorch-nightly on Linux being poisoned with a fake dependency. More recently, PyPI had to stop new user registrations and project creations because of a flood of malicious users. PyPI isn’t the only one to notice the user trouble. The Python Software Foundation (PSF) received three subpoenas for PyPI user data. What is going on here!?

Read on to learn more about what’s happening with the most popular Python repository.

Comments closed

SQL Anti-Patterns Extended Event in SQL Server 2022

Dennes Torres finds some anti-patterns:

One of the new Extended Event available in SQL Server 2022 is the query_antipattern. This extended event allows to identify anti-patterns on the SQL queries sent to the server.  An anti-pattern in this case is some code that the SQL Server optimizer can’t do a great job optimizing the code (but cannot correct the issue automatically).

This is a very interesting possibility: Including this event in a session allow us to identify potential problems in applications. We can do this in development environments to the the problems earlier in the SDLC (Software Development Life Cycle).  Let’s replicate some examples and check how this works.

Dennes shows two examples and notes that there are five total listed in the Extended Event, but that the documentation is a bit lacking to explain their intent.

Comments closed

Showing KQL Queries

Dany Hoter looks at some KQL query plans:

Each visual on the page is going to summarize data from one or more queries and add the summarize part of the query.

If your model contains multiple tables in direct query with relations between them, the connector will generate joins between the tables.

Selecting values in filters will create multiple where conditions.

In order to see the final query and understand the performance implications of each query and the total query load created by a report, you need to use the command “.show queries” in the context of the database.

Click through for Dany’s notes on the topic, including a few tips on what to look for.

Comments closed

Listing Topics in Kafka without Zookeeper

The BIg Data in Real World team has a quick one for us:

Kafka uses Zookeeper to manage it’s internal state. So it is not possible to run Kafka without Zookeeper. Even if you don’t have access to Zookeeper in your organization, there is a Zookeeper cluster running which your Kafka cluster connects to.

So, how to list topics and execute other commands if we don’t have access to Zookeeper?

Eventually, this won’t even be a question, as Kafka already has production versions using KRaft, and by Kafka 4.0, there won’t be a Zookeeper to kick around anymore.

Comments closed

Data Inconsistency in Postgres HA Clusters

Umair Shahid gives us an overview:

While PostgreSQL is known for its robustness, scalability, and reliability, data inconsistency can occur in PostgreSQL clusters, which can cause issues and impact the overall performance of the system. In this blog, we’ll define data inconsistency in PostgreSQL clusters, discuss the challenges it poses, its causes, and provide some tips on how to prevent and resolve it if it occurs.

Click through for the article.

Comments closed

Building a Data Warehouse in Microsoft Fabric

Reza Rad continues a video series on Microsoft Fabric:

Microsoft Fabric Data Warehouse is a database system that stores data in OneLake and provides a medium to interact with the database using SQL commands. The Fabric Data Warehouse, which is also called Data Warehouse, or in short, Warehouse, also provides a powerful computing engine behind the scene to account for large volumes of data and support a fast-performing database system. The term Data Warehouse comes from the fact that this is not usually a place to store transactional data for an operational system (for that, you can use Azure SQL Database). A Data Warehouse, in generic Business Intelligence terminology, is a place where you would store the data that needs to be analyzed.

Reza also explains how the warehouse differs from a lakehouse.

Comments closed

Microsoft Fabric and Process Unification

Paul Andrew gets to the heart of things:

Moving on and assuming you have seen the event sessions, I want to give you my point of view to help explain what Microsoft Fabric is. Firstly, lets clear up call out was terminology to support this understanding. Is this software offering a resource, service, platform, or solution? To answer this question, perspective is key, perspective with a timeline (2018 to 2023). We could simply say that Microsoft Fabric is all these things. All things to all data professionals and beyond. But, to understand this, let’s consider the journey Microsoft has been on and how this technology has evolved. I believe this journey is the best way to help explain what Microsoft Fabric is, rather than focusing on all the new and shiny bits.

Click through for Paul’s take on the matter and how this whole area of “modern data warehousing” has evolved over the past several years in Azure.

Comments closed

Cosmos DB Serverless Scaling to 1TB

Hasan Savran shares the news:

Azure Cosmos DB’s Serverless option is a great way to save money if your application expects intermittent and unpredictable traffic with long idle times. I use serverless in developing, prototyping, and integrating with computing services such as Azure Functions.

     The limitation of Azure Cosmos DB serverless was a show-stopper if your solution needed scalability or a large storage. Cosmos DB announced that many of the limitations of the serverless option of Azure Cosmos DB are lifted in Build 2023.

Read on for the gist of these updates.

Comments closed