Press "Enter" to skip to content

Category: Data

Handling Bad Records with Apache Spark

Divyansh Jain shows three techniques for handling invalid input data with Apache Spark:

Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes:

– Missing information
– Incomplete information
– Schema mismatch
– Differing formats or data types

Since ETL pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that data engineers must both expect and systematically handle corrupt records.

This is the seedy underbelly of semi-structured data: you don’t have control over the data as it comes in, so you have to control the data coming out.

Comments closed

Handling Missing Data

Marina Wyss explains various techniques for handling missing data in data sets:

Missing or incomplete data can have a huge negative impact on any data science project. This is particularly relevant for companies in the early stages of developing solid data collection and management systems.

While the best solution for missing data is to avoid it in the first place by developing good data-collection and stewardship policies, often we have to make due with what’s available.

This blog covers the different kinds of missing data, and what we can do about missing data once we know what we’re dealing with. These strategies range from simple – for example, choosing models that handle missings automatically, or simply deleting problematic observations – to (probably superior) methods for estimating what those missing values may be, otherwise known as imputation.

I like the distinction in form Marina draws, and we also get a good set of techniques for filling the gaps.

Comments closed

Data Exfiltration Protection when Using Azure Databricks

Bhavin Kukadia, et al, explain how to prevent users from taking data from your Databricks cluster without authorization:

Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider’s network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of the fastest growing Data & AI service on Azure. We’ve come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it’s time that we share it out broadly.

Click through for the architectural pattern.

Comments closed

An Example of Complex CSV Rule Parsing with Power Query

Cedric Charlier shows off some of the benefit of Power Query with a fairly complicated set of rules:

At the beginning, some of us thought that it would be easy to fix these issues by returning to the data quality team and ask them to fix these issues but it was not so easy. Identifing the rules needing a fix would be huge task (the CSV files are not created if the test is successful, maling it impossible to address this issue in one run and other impediments). I took the decision to go over this issue with the implementation of the following heuristic:

– if the CSV has a column DateTime then we’ll use it
– if the header is empty or no column is named DateTime then use the first column
– if the content of the selected column is not a date then try to parse it as the inner content of a JSON element.

Read on to see how.

Comments closed

Loading Data from CSVs with Inconsistent Quoted Identifiers

Dave Mason has some fun with loading data from files:

BCP and OPENROWSET are long-lived SQL Server options for working with data in external files. I’ve blogged about OPENROWSET, including a recent article showing a way to deal with quoted data. One of the shortcomings I’ve never been able to overcome is an inconsistent data file with data fields in some rows enclosed in double quotes, but not all. I’ve never found a way around this limitation.

Let’s demonstrate with BCP. Below is a sample data file I’ll attempt to load into a SQL Server table. Note the data fields highlighted in yellow, which are enclosed in double quotes and contain the field terminator , (comma) character. For reference, the file is also available on Github.

I get unduly frustrated with the implementations of various data loaders around SQL Server and how they handle quoted identifiers differently. And don’t get me started on PolyBase.

Comments closed

Syncing Mobile Apps via Change Tracking

Davide Mauri shows how we can perform data synchronization using change tracking in Azure:

Sending data from the cloud to the app is way more tricky. You want to do it in the most efficient way, to spare bandwidth and device battery life, so you need a way to know what has changed since the last time that specific user and device synced. As data is surely stored in a database of some sort, you also need some efficient method on the database side to make sure you can quickly get everything that is new or changed and that is in the scope for that specific user/device. If your mobile application is successful, this means that you may literally have millions and millions of rows or documents to scan and check for changes.

Not an easy task: all hope is lost then? Just send back the whole data set and that’s it? Of course not! We don’t want to just be developers, but better developers, right?

Modern databases can help a lot in tackling this challenge. Azure SQL, for example, has a feature called Change Tracking that, guess what?, will take care of keeping track of changes for you.

Davide includes a lot of detail and even a sample application on GitHub.

Comments closed

Audio Analysis in R

Jeroen Ooms walks us through some audio analysis with R and the av package:

The latest version of the rOpenSci av package includes some useful new tools for working with audio data. We have added functions for reading, cutting, converting, transforming, and plotting audio data in any popular audio / video format (mp3, mkv, aac, etc).

The functionality can either be used by itself, or to prepare audio data for further analysis in R using other packages. We hope this clears an important hurdle to use R for research on speech, music, and whale mating calls.

One of the most interesting things I saw Edward Tufte demonstrate was visualizing music using the Music Animation Machine. There’s a lot of space here to experiment. H/T R-Bloggers.

Comments closed

Generating Fake Data with R

Dave Mason takes a look at generating fake PII in R:

I’ve been thinking about R and how it can be used by developers, DBAs, and other SQL Server professionals that aren’t data scientists per se. A recent article about generating a data set of fake transactional data got me thinking about this again and I wondered, can R be used to obfuscate PII data?

In a word, yes. Well, mostly. (More on this in a bit.) As with anything R-related, there are probably multiple packages that are useful for any given task. For this one, I’ll focus on the “generator” package.

Click through to see what it does and Dave’s thoughts on the topic. It would also be possible to generate fake data in R by hitting a web API like Daniel Hutmacher’s service.

Comments closed

Moving Data Around in Azure Synapse Analytics

Niko Neugebauer looks at some techniques for copying data into a table in an Azure Synapse Analytics SQL Pool:

First of all, let us list some of them (and I am not even attempting on providing all of them, of course):
– INSERT INTO … SELECT FROM … (the most well known one)
– SELECT INTO … FROM … (the most well-known to perform well, since it will create a HEAP while copying most of the properties from the original table(s))
– CREATE TABLE … AS SELECT … (the old way, which must be like 10 years old on PDW/APS & Azure SQL DW, but that has never gotten into a Box Product or Azure SQL Database)
– Polybase (that will use the External Tables & externally allocated data to transfer into Azure SQL DW)
– BCP (good old tested friend that will give you a pain in the neck until you dominate it)
– OPENROWSET / BULK INSERT (some very good and very old friends with complicated histories (who remembers all the code pages?, settings and uncertain future mostly because of their original restrictions, I guess)
– COPY INTO … (the brand new command that will allow you under very neat privileges to copy data from the external storage accounts, much like BULK INSERT)

In this blog post I will simply focus on those features that have not been ported (hopefully just yet): CTAS & COPY INTO.

Read on to see how these two work. Also, I too have wanted CTAS in on-premises SQL Server for years.

Comments closed