Press "Enter" to skip to content

Month: July 2024

A/B Testing with Survival Analysis in R

Iyar Lin combines two great flavors:

Usually when running an A/B test analysts assign users randomly to variants over time and measure conversion rate as the ratio between the number of conversions and the number of users in each variant. Users who just entered the test and those who are in the test for 2 weeks get the same weight.

This can be enough for cases where a conversion either happens or not within a short time frame after assignment to a variant (e.g. Finishing an on-boarding flow).

There are however many instances where conversions are spread over a longer time frame. One example would be first order after visiting a site landing page. Such conversions may happen within minutes, but a large churn could also happen within days after the first visit.

Read on for the scenario, as well as a simulation. I will note that, in the digital marketing industry, there’s usually a hard cap on number of days where you’re able to attribute a conversion to some action for exactly the reason Iyar mentions. H/T R-Bloggers.

Comments closed

Contrasting Data Mesh and Data Fabric

Sahil Babbar makes a comparison:

The concept of a data mesh proposes that each business domain takes charge of hosting, preparing, and delivering its own data to both its internal team and broader stakeholders. This decentralized approach empowers autonomous data teams to take full ownership and accountability for their data products and management processes.

Data fabric is a system designed to help a company manage and use its data from various storage types, like databases, tagged files, or document stores. It supports different tools and applications to easily access this data, working with technologies like Apache Kafka for real-time data streaming, ODBC for database connections, HDFS for big data storage and REST APIs for web services. It focuses on creating a unified data environment that acts as a reliable, centralized source for all organizational data. This setup ensures data is accurate, consistent, and secure, making it easy for different teams to access and manage data efficiently.

Read on to learn a bit more about the two architectures.

Comments closed

Removing Leading Zeroes from a String in T-SQL

Steve Stedman gets rid of leading zeroes:

When working with data in SQL Server, there may be times when you need to remove leading zeros from a string. This task can be particularly common when dealing with numerical data stored as strings, such as ZIP codes, product codes, or other formatted numbers. In this blog post, we’ll explore several methods to remove leading zeros in SQL Server.

I’m not sure I see the reason to use anything other than CAST() (or, better yet, TRY_CAST()), but Steve does show two other methods.

2 Comments

Random Walks in R with TidyDensity

Steven Sanderson goes for a walk:

A random walk is a mathematical object that describes a path consisting of a succession of random steps. It’s a cornerstone concept in fields like physics, economics, and biology. In finance, for example, the random walk hypothesis suggests that stock market prices evolve according to a random walk and thus cannot be predicted.

Read on to see how you can generate a dataset matching a random walk, as well as a comparison of techniques for generating them.

Comments closed

Measure-Object in Powershell

Patrick Gruenauer counts the ways:

The Measure-Object cmdlet counts objects. But it can do even more. We can calculate the sum, the average and much more. In this blog post I show a few examples with Measure-Object. Let’s dive in.

It’s a fairly straightforward cmdlet but it has a lot of use, being a combination of something like wc in Linux as well as collecting basic statistics on objects.

Comments closed

JSON and JSONB Data Types in Postgres

Andrea Gnemmi covers a pair of data types to manage one thing:

We have all encountered the need to store non-structured or semi-structured data in an RDBMS; XML or JSON data in particular. This can be complicated, especially in the past with limited technical options, and even more complicated if we want to query this data efficiently.

Read on to learn more about the differences between JSON and JSONB, as well as mechanisms you can use to query subsets of the data.

Comments closed

Visual Calculations and Multi-Bar Graphs

Erik Svensen builds a thing:

In this post I will guide you through creating this chart in Power BI – it is a stacked bar chart that show the size/impact of three different measures – Sales Value, Sales Units and Avg Price in one visual.

It’s not a visualization that I would recommend but there might be use cases for it somewhere and it has been a good exercise in what we can do with visual calculations.

It’s very clever, I’ll give it that.

Comments closed

An Introduction to Streamlit

I have started a new video series:

In this video, I talk about Streamlit, a great Python library for building data applications quickly. We discuss what data applications are, get an idea of how Streamlit compares to other code-first data visualization techniques, and start building a demo application. I also toss in a lengthy sidebar on Python virtual environments because of how important they are.

Streamlit certainly has its foibles—many of which I’ll cover in the series—but I like it a lot as a simple way of building data applications.

Comments closed

Extracting Strings before a Space using R

Steven Sanderson grabs a name:

Hello, R users! Today, we’ll dive into a common text manipulation task: extracting strings before a space. This is a handy trick for dealing with names, addresses, or any text data where you need to isolate the first part of a string.

We’ll explore three approaches: using base R, stringr, and stringi. Each method offers its unique advantages, so you can choose the one that fits your style best.

Click through for the three examples. I will note that if you’re actually using this code to split names, well, names tend to be a lot trickier than we give them credit for. Keep in mind that people can have multi-part names (“Debbie Mae” or “van den Berg”), so unless you know the data all follows a specific pattern, don’t assume the data follows a specific pattern.

Comments closed