Press "Enter" to skip to content

Author: Kevin Feasel

Finding Duplicate Rows and Values in R

Steven Sanderson de-duplicates, starting with values:

In data analysis and programming, it’s common to encounter situations where you need to identify duplicate values within a dataset. Whether you’re a beginner or an experienced programmer, knowing how to find duplicate values is a fundamental skill. In this blog post, we will explore two different approaches to accomplish this task using base R functions and the dplyr package in R. By the end, you’ll have a clear understanding of how to detect and manage duplicate values in your own datasets.

From there, we get to see various ways to de-duplicate rows in R:

In data analysis and manipulation tasks, it’s common to encounter situations where we need to identify and handle duplicate rows in a dataset. In this blog post, we will explore three different approaches to finding duplicate rows in R: the base R method, the dplyr package, and the data.table package. We’ll compare their performance using the benchmark function and provide insights on when to use each approach. So, grab your coding gear, and let’s dive in!

Duplicate values is a relatively tricky one, with rows being much easier.

Comments closed

Multi-Source Replication in MySQL

Aisha Bukar continues a series on replication in MySQL:

MySQL’s multi-source replication allows a replica server to receive data from multiple source servers. Let’s say you have a replica server at your workplace, and there are multiple source servers in different locations, you need a way to directly receive data from these source servers to your replica server. This is where the multi-source replication technique comes into play. It allows you to efficiently gather data from various sources and consolidate it on your replica server.

Note that this is quite different from merge replication or peer-to-peer replication in SQL Server and there are some limits to its capabilities. That said, I could see this being really useful for performing ELT into a warehouse: use replication to keep the staging tables in sync and then run a job to perform transformations into facts and dimensions periodically.

Comments closed

Data Syncs between Azure SQL DB and Amazon RDS

Joey D’Antoni crosses clouds:

A while back, a client, who host user-facing databases in Azure SQL Database, had a novel problem. One of their customers, had all of their infrastructure in AWS, and wanted to be able to access my client’s data in an RDS instance. There aren’t many options for doing this–replication doesn’t work with Azure SQL Database as a publisher because there’s no SQL Agent. Managed Instance would have been messy from a network perspective, as well as cost prohibitive compared to Azure SQL DB serverless. Even using an ETL tool like Azure Data Factory would have worked, but would have required a rather large amount of dev cycles to check for changed data. Enter Azure Data Sync.

Read on to see what Azure Data Sync is and how it helps solve this problem.

Comments closed

Migrating Column-Level Encryption to Azure SQL MI

Keshav Kiran performs a migration:

One of our customers came up with a requirement where they wanted to Migrate On-prem Database to Azure SQL Managed instance. The databases had traditional column level encryption enabled.

He has restored the database on the SQL Managed instance by Backup/Restore approach. Now when he was trying to read the encrypted column on the destination database, It was showing NULL values after decryption.

Read on for the solution.

Comments closed

Viewing the Power BI Format Pane during On-Object Interaction

Gilbert Quevauvilliers is missing something:

I have enabled the new On-Object Interaction for the formatting pane in Power BI and while it is constantly improving there are times when I would like to have the good old formatting pane available.

I have also found that sometimes when you create a new visual there is no option to format it as shown below.

There’s a workaround to this, so check it out.

Comments closed

Model Diagnostics in Python

Christian Lorentzen has released a new package:

Version 1.0.0 of the new Python package for model-diagnostics was just released on PyPI. If you use (machine learning or statistical or other) models to predict a mean, median, quantile or expectile, this library offers tools to assess the calibration of your models and to compare and decompose predictive model performance scores.

This looks like a really useful package, so check it out.

Comments closed

Automating Database Copy in Azure SQL Managed Instance

Sasa Popovic creates some clones:

Database copy and database move operations for Azure SQL Managed Instance are very convenient in various situations when you want to copy or move database from one managed instance to another in an online way. What does online mean in this context? It means that the database on destination managed instance will be identical to the source database at the moment when operation is explicitly completed by user action. Copying a database is a size of data operation, and you can expect copy will take some time, but what is important and convenient, unlike point in-time restore where database is in state from some point in time in the past, with database copy you get database in state as it was when the operation was completed.

Read on to see how you can set this up for an Azure SQL Managed Instance.

Comments closed

Window Functions and Serialization in KQL

Robert Cain tries out some window functions:

The Kusto Query Language includes a set of functions collectively known as Window Functions. These special functions allow you to take a row and put it in context of the entire dataset. For example, creating row numbers, getting a value from the previous row, or maybe the next row.

In order for Window Functions to work, the dataset must be serialized. In this post we’ll cover what serialization is and how to create serialized datasets. This is a foundational post, as we’ll be referring back to it in future posts that will cover some of the KQL Windowing Functions.

Read on to see how to serialize data, what the risks of serialization are, and then how to generate a row number in KQL.

Comments closed