Press "Enter" to skip to content

Day: March 7, 2024

Removing Skew in Data with Python

Vinod Chugani kicks the lop-sided distribution to straighten it out:

Data transformations enable data scientists to refine, normalize, and standardize raw data into a format ripe for analysis. These transformations are not merely procedural steps; they are essential in mitigating biases, handling skewed distributions, and enhancing the robustness of statistical models. This post will primarily focus on how to address skewed data. By focusing on the ‘SalePrice’ and ‘YearBuilt’ attributes from the Ames housing dataset, we will provide examples of positive and negative skewed data and illustrate ways to normalize their distributions using transformations.

Read on to see what kinds of transformations are available.

Comments closed

Data Management with Open Table Formats

Anandaganesh Balakrishnan covers a few open-source products and formats:

Apache Iceberg is an open-source table format designed for large-scale data lakes, aiming to improve data reliability, performance, and scalability. Its architecture introduces several key components and concepts that address the challenges commonly associated with big data processing and analytics, such as managing large datasets, schema evolution, efficient querying, and ensuring transactional integrity. Here’s a deep dive into the core components and architectural design of Apache Iceberg:

Click through for a review of Iceberg, Hudi, and the Delta Lake format.

Comments closed

Choosing a Data Platform Technology

Jenn Junod builds a flow chart:

From a structural perspective you can think of data as existing on a continuum or gradient: on the one side, there’s highly structured data, like customer profiles. Each customer has a known name, address, phone number, and we can expect those to fall within certain formatting constraints. For example, a phone number will always be a set of numbers and special characters. On the other side, there’s unstructured data: the contents of an email, for example, are unstructured.

The shape of the data should define what kind of software you use to store and manipulate it.

Read on for several questions along these lines and recommendations based on your answers.

Comments closed

Configuring a New Powershell Installation

Jeff Hicks starts over:

The other day on X, I was asked about what things I would setup or configure on a new PowerShell installation. This is something I actually have thought about and face all the time when I setup a new demo virtual machine. I had been meaning to build new tooling to meet this challenge, and the question provided the spark I needed to get off my butt and get it done.

Read on for some thoughts on the topic, as well as a lengthy script to get you going.

Comments closed

That the Division of Labor is Limited by the Extent of the Market

Eitan Blumin lets me quote one of my favorite chapter headings from Adam Smith:

You see, in this past decade or so, we’ve all noticed a trend where more and more businesses are looking for DBAs who know a bit about everything: RDBMS, NoSQL, Key-Value databases, Cache Databases, Full-Text Search Databases, Vector Databases, and more. In response, more and more data professionals prefer to become multi-disciplinary, doing a bit of everything… Even if it means becoming a part-time Data Engineer / Data Analyst / Data Scientist / DevOps / whatever. Otherwise, they fear that they won’t be able to find enough job opportunities.

But, there’s a cost to that approach. Sometimes a very high cost.

Eitan argues in favor of an approach with part-time experts and full-time jacks-of-many-trades. For smaller companies, I completely agree: a software development company with 5 total employees usually can’t afford a full-time DBA or networking specialist or virtualization specialist, but they can still run into problems that an expert would easily be able to solve. But I think once you get past a couple dozen IT people at a company, there’s enough scope for some critical specialization. In other words, the market within a company grows (hush, people who are going to throw Ronald Coase at me for that statement—it’s metaphorical!) and allows for additional division of labor and specialization.

Anyhow, Eitan has some tips around determining whether you’d rather be a jack-of-all-trades or a master of something.

Comments closed