Press "Enter" to skip to content

Day: September 17, 2024

Simple Data Cleanup with Pandas

Ivan Palomares Carrascosa builds a process:

Few data science projects are exempt from the necessity of cleaning data. Data cleaning encompasses the initial steps of preparing data. Its specific purpose is that only the relevant and useful information underlying the data is retained, be it for its posterior analysis, to use as inputs to an AI or machine learning model, and so on. Unifying or converting data types, dealing with missing values, eliminating noisy values stemming from erroneous measurements, and removing duplicates are some examples of typical processes within the data cleaning stage.

As you might think, the more complex the data, the more intricate, tedious, and time-consuming the data cleaning can become, especially when implementing it manually.

Ivan handles some of the most common types of data clean work and shows a simple way of implementing these.

Comments closed

Random Walks in R with RandomWalker

Steven Sanderson is going for a walk (not the after-dinner kind):

Welcome to the world of ‘RandomWalker’, an innovative R package designed to simplify the creation of various types of random walks. Developed by myself and my co-author, Antti Rask, this package is in its experimental phase but promises to be a powerful tool for statisticians, data scientists, and financial analysts alike. With a focus on Tidyverse compatibility, ‘RandomWalker’ aims to integrate seamlessly into your data analysis workflows, offering both automatic and customizable random walk generation.

Read on to learn more about the package, including why you might want to use it and the functionality you can get out of it.

Comments closed

Cloning Tables in Databricks

Chen Hirsh hogs the photocopier:

The simplest use case to explain why table cloning is helpful is this: Let’s say you have a large table, and you want to test some new process on it, but you don’t want to ruin the data for other processes, so you need a clean copy of your table (or multiple tables) to play with. Coping a large table might take time (Databricks does it very fast, but if it’s a big table it still takes time to copy the data) ,and what happens if you then need to change your code? you have to drop the target table, copy the source table again, and so on.

here is where cloning can be your friend.

Read on to learn about three cloning techniques. H/T Madeira Data Solutions blog.

Comments closed

Change Management for the DBA

Terri Hurley moved our cheese:

If you have never been involved in Change Management processes but now find yourself part of one, it may seem a bit overwhelming or confusing. However, the reasons for and benefits of these processes are simple and straightforward.

This article will explain Change Management and how DBAs are involved and can benefit from it.

Read on to learn more about how change management can work for a DBA. I’ve worked for several organizations as they’ve moved from a philosophy of “just do it” toward proper change management, typically for regulatory reasons like Sarbanes-Oxley compliance.

Comments closed

Working with Always Encrypted Data in SSIS

Rod Edwards continues a series on Always Encrypted:

So now, lets see how it plays with another one of those common toolsets that you may use alongside your Encrypted data. In this post, i’ll be talking about accessing and importing data using SSIS, nothing fancy, just reading data from an Excel sheet, and piping into our Always Encrypted table, encrypting as we go.

I’m not saying to use Excel for housing confidential data either!… as no one does that…oh no, not anywhere, ever….</sarcasm>.

As previously, this focuses on using Azure Key Vault for securing Encryption keys required.

Considering that all corporate data is in Excel someplace (some variant of which may eventually become Feasel’s Second Law), of course that sensitive and confidential data will be in a plain Excel file that people e-mail around.

Comments closed

Checking Cumulative Update Status on SQL Server Instances

Steve Jones reminds you to check those cumulative updates:

How can I quickly get a CU patch for a system that’s out of date? I’ll discuss that situation.

You might think you get to patch every instance every few months, and you may be able to. But most of us have laggards in any decent-sized estate. Someone always wants to avoid patching, or skip patching on the day you’ve scheduled every other system.

Steve’s solution involves using Redgate Monitor, though you could also do this on your own or using something like dbatools to get information about your estate.

Comments closed

Fixing Implicit Conversion without Changing Queries

Vlad Drumea solves a challenge:

Why wouldn’t you be able to change the query?

The two most common scenarios I’ve ran into are:

  • the software vendor does not want to change the code
  • a legacy application that’s no longer maintained and nobody has access to the code base

Read on for Vlad’s solution to a fairly common problem. The real fix, of course, is to use NVARCHAR everywhere and not have to worry about VARCHAR to NVARCHAR conversion. The secondary fix is to get your queries right and make sure your data types are consistent.

Comments closed