Press "Enter" to skip to content

Category: Data

Documenting Table Columns with the Python SDK for Purview

Danaraj Ram Kumar breaks out the Python IDE:

There are several approaches to work with Microsoft Purview entities programmatically, especially when needing to perform bulk operations such as documenting a large number of tables and columns dynamically. 

This article shows how to use the Python SDK for Purview to programmatically document Purview table columns in bulk – assuming there are many tables and columns that needed to be automatically documented based off a reference tables – as in this example, the data dictionary maintained in Excel.

On the other hand, Purview REST APIs can be used to natively work with the REST APIs whereas the Python SDK for Purview is a wrapper that makes it easier to programmatically interacts with the Purview Atlas REST APIs in the backend.

Click through for sample code and explanations.

Comments closed

Regular Expressions in R

Steven Sanderson now has two problems:

Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.

Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named df, consists of four columns (x1x2x3x4) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.

Click through for a worked-out example.

Comments closed

Removing Multiple Rows from a DataFrame via Base R

Steven Sanderson gets rid of rows:

As data analysts and scientists, we often find ourselves working with large datasets where data cleaning becomes a crucial step in our analysis pipeline. One common task is removing unwanted rows from our data. In this guide, we’ll explore how to efficiently remove multiple rows in R using the base R package.

Read on for a couple of ways to do this, including removing by some filter and removing by some index.

Comments closed

Returning a Row when there’s No Row to Return

Erik Darling has an existential dillema:

Rather selfishly, I do this for my stored procedures, for all the reasons in the first sentence. Especially when debugging stored procedures, you’ll want to know where things potentially went wrong.

In this post, I’m going to walk through a couple different ways that I use to do this. One when you’re storing intermediate results in a temporary object, and one when you’re just using a single query.

Read on for an example of how to do this.

Comments closed

Comparing pgvector and Postgres ARRAY

Ernst-Georg Schmid makes a comp based on a mass spectrometry database:

As said in the introduction, mass spectrometry is one, if not the tool to identify unknown compounds, to quantify known compounds, and to determine the structure of molecules. But it is a lot of work, and you need reference spectra to compare against.

So, there are curated databases of validated spectra available, like MassBank JapanMassBank Europe and the NIST mass spectral libraries. Laboratories might also want to store their own libraries for future use.

However, such databases often come in their own formats and with their own retrieval software. If you need to efficiently connect spectra to other data, e.g. chemical structures or genomic data, this calls for central management and a common API.

Read on to see the comparison of the pgvector extension versus built-in functionality with ARRAY.

Comments closed

Reducing the Cost of Delete Operations in SQL Server

Ben Johnston eats the elephant:

One of the first things you learn when working with SQL Server, and other transactional based SQL systems, is that set based operations perform best. If you are querying data, a cursor pulling individual rows doesn’t perform as well as a single query. It doesn’t matter if that cursor is on the client side or the server side. A set-based operation is more efficient, runs faster, locks less, and is generally better than submitting multiple queries.

This is also generally true with delete statements. This post covers the exceptions to that rule. Large delete statements impacting many rows and large amounts of data (millions of rows and many gigs of data) can actually have decreased performance. With transactional systems, such as SQL Server, each transaction follows the ACID standard. Part of that standard ensures that transactional statements either complete or roll back fully – partial transactions are not allowed. For a delete statement, that means that all of the rows specified by the delete are removed from the table, or none are removed and the data rolls back to the original state. The delete and rollback behavior must be predictable and consistent or the data could be left in a contaminated, unreliable state. Performing very large deletes can present some challenges and needs to be treated differently in production systems.

Read on for the reasoning behind this, as well as several techniques you can use and how they compare.

Comments closed

Using the Cake Dataset

Rasmus Baath bakes a cake:

Now that I’ve got my hands on the source of the cake dataset I knew I had to attempt to bake the cake too. Here, the emphasis is on attempt, as there’s no way I would be able to actually replicate the elaborate and cake-scientifically rigorous recipe that Cook followed in her thesis. Skipping things like beating the eggs exactly “125 strokes with a rotary beater” or wrapping the grated chocolate “in waxed paper, while white wrapping paper was used for the other ingredients”, here’s my version of Cook’s Recipe C, the highest rated cake recipe in the thesis:

Click through for the ingredients and instructions, as well as Rasmus’s results in the test that counts the most: the taste test. H/T R-Bloggers.

Comments closed

Finding the Cake Dataset’s Original Source

Rasmus Baath has done a good deed for all:

In statistics, there are a number of classic datasets that pop up in examples, tutorials, etc. There’s the infamous iris dataset (just type iris in your nearest R prompt), the Palmer penguins (the modern iris replacement), the titanic dataset(s) (I hope you’re not a guy in 3rd class!), etc. While looking for a dataset to illustrate a simple hierarchical model I stumbled upon another one: The cake dataset in the lme4 package which is described as containing “data on the breakage angle of chocolate cakes made with three different recipes and baked at six different temperatures [as] presented in Cook (1938)1”. For me, this raised a lot of questions: Why measure the breakage angle of chocolate cakes? Why was this data collected? And what were the recipes?

Read on as Rasmus unravels the mysteries of the cake dataset with the help of several others. H/T R-Bloggers.

Comments closed

Embracing the Boring Part of Data Governance

Nikki Kelly shares some thoughts on data governance:

Data Governance – you have heard the term a million times and not once has it driven excitement in to your heart. I’d like to spend the next few minutes changing that.

Data Governance is formally defined as “… a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.”

Boring.

Nikki makes a great point that the process may feel boring but the net results are critical.

Comments closed

The Importance of Data Retention Policies

Ed Pollack shares some great advice:

It is always an afterthought. New objects are created that start off small and current. New feature development takes over and the recently architected data structures become old news. Over time, data grows and suddenly a previously small table contains millions or billions of rows.

Is all that data necessary? How long should it be retained for? If there is no answer to this question, then the actuality may be “Forever”, or more honestly “No one knows for sure.”

Retention takes on many forms and this article dives into ways in which data can be managed over time to ensure that it is fast, accurate, and readily available.

We don’t tend to think about data retention in the development phase, but it’s an important consideration and thinking about it up-front might save you disk space headaches later.

Comments closed