Press "Enter" to skip to content

Author: Kevin Feasel

Modern Data Warehousing with Data Lake Storage and Azure Data Factory

Josephine Bush continues a series on modern data warehousing:

In today’s data-driven world, having the right tools to manage and process large datasets is crucial. That’s where Azure Data Lake Storage (ADLS) and Azure Data Factory (ADF) come in handy, making it easier than ever to store and transform your data. In this post, I’ll show you how to set up ADLS to store your Parquet files and configure ADF to manage your data flows efficiently.

Read on for an overview of both technologies.

Comments closed

R’s Global Regular Expression Function

Steven Sanderson has me wondering who Greg is and why he gets an expression of his own:

If you’ve ever worked with text data in R, you know how important it is to have powerful tools for pattern matching. One such tool is the gregexpr() function. This function is incredibly useful when you need to find all occurrences of a pattern within a string. Today, we’ll go into how gregexpr() works, explore its syntax, and go through several examples to make things clear.

Read on to learn more about the global regular expression function and how it works.

Comments closed

Preventing Passwords from Getting into GitHub

Eduardo Pivaral does some work:

Healthy code should not include passwords, keys, or secrets in the source code. Sometimes, developers hard-code sensitive information while testing new features but forget to remove it afterward.

How can we validate code without including sensitive information so we can take action before we publish or share code?

Click through for a couple of options. If you do have GitHub Advanced Security (part of GitHub Enterprise Cloud), you can also create a custom pattern for secret scanning that can include passwords, database connection strings, and the like.

Comments closed

Synchronous and Asynchronous Replication in Postgres

Semab Tariq takes us through a pair of replication options:

In the world of database replication, choosing between synchronous and asynchronous methods can have a big impact on how reliable, consistent, and fast your data is.

This blog dives into what these methods are, how they work, and when you might want to use one over the other. Whether you’re trying to keep your data super safe or just want it to move quickly, we’ll break down everything you need to know about synchronous and asynchronous replication in PostgreSQL.

Read on for a quick overview of streaming replication and the differences between asynchronous and synchronous options.

Comments closed

Backup Storage Redundancy in Cosmos DB

Manvendra Singh talks about backups:

This article will explain backup storage redundancy for Azure Cosmos DB. Backups are a critical feature to keep copies of our data to ensure data protection and recoverability in case of any accidental deletion, updating, or any kind of disaster. But this is not enough to run backups only to save its copies. We must also protect those backup copies from accidental deletes or corruption and ensure their proper resiliency should be in place to keep backups safe from any unforeseen circumstances. It depends on the criticality of your data whether you want to keep them locally to want to replicate them in other locations or regions to ensure their resiliencies.

The backup process isn’t the same as with a relational database, but it’s still critical to back up your data, for the same reasons that you’d take backups of relational data.

Comments closed

Counting Words in a String in R

Steven Sanderson counts the ways:

Counting words in a string is a common task in data manipulation and text analysis. Whether you’re parsing tweets, analyzing survey responses, or processing any textual data, knowing how to count words is crucial. In this post, we’ll explore three ways to achieve this in R: using base R’s strsplit(), the stringr package, and the stringi package. We’ll provide clear examples and explanations to help you get started.

I, of course, would commission a 128-node Hadoop cluster and write a few dozen pages of Java code to get the answer.

Comments closed