Data Cleanup Using Drools

Kevin Feasel

2017-07-24

Data

Rathnadevi Manivannan gives an example of using Drools to create rule-based data cleansing processes:

The oil well drilling datasets contain raw information about wells and their formation details, drill types, and production dates. The Arkansas dataset has 6,040 records and the Oklahoma dataset has 2,559 records.

The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well and invalid well information with modified dates.

This raw data from the source is transformed to MS SQL for further filtering and normalization. To download raw data, look at the Reference section.

This is an example of applying several constraints and rules to a single data set.  Each individual rule would probably be easier to do in T-SQL, but the whole bunch becomes easier to understand with a procedural language.

Related Posts

Finding Different Casings In Case-Insensitive Columns

Louis Davidson walks through different permutations of case in a data set: The other day, I had a problem with some data that I never dreamed I would ever see. In a case insensitive database, in a table’s column that was case insensitive, the customer was using the data as case sensitive. Firstly, let’s just […]

Read More

The Intersection Of Multiple Averages Is The Empty Set

Adi Gaskell argues that we shouldn’t get too wrapped up in “average” behaviors: I’ve written extensively about the tremendous potential for big data in healthcare to drive enormous changes in how we keep people healthy for longer. It goes without saying however that all data is not created equal, and just having a large sample […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31