Press "Enter" to skip to content

Day: April 6, 2020

Handling Bad Records with Apache Spark

Divyansh Jain shows three techniques for handling invalid input data with Apache Spark:

Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes:

– Missing information
– Incomplete information
– Schema mismatch
– Differing formats or data types

Since ETL pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that data engineers must both expect and systematically handle corrupt records.

This is the seedy underbelly of semi-structured data: you don’t have control over the data as it comes in, so you have to control the data coming out.

Comments closed

Hive + LLAP Now Faster with ElasticMapReduce 6

Suthan Phillips has a benchmark for ElasticMapReduce 5 versus 6:

To evaluate the performance benefits of running Hive with Amazon EMR release 6.0.0, we’re using 70 TCP-DS queries with a 3 TB Apache Parquet dataset on a six-node c4.8xlarge EMR cluster to compare the total runtime and geometric mean with results from EMR release 5.29.0.

The results show that the TPC-DS queries run twice as fast in Amazon EMR 6.0.0 (Hive 3.1.2) compared to Amazon EMR 5.29.0 (Hive 2.3.6) with the default Amazon EMR Hive configuration.

The following graph shows performance improvements measured as total runtime for 70 TPC-DS queries. Amazon EMR 6.0.0 has the better (lower) runtime.

Click through for the measures and a bit more info on LLAP.

Comments closed

VARCHAR Columns and Bytecode Version Mismatch in R

Dave Mason runs through a tricky problem with SQL Server Machine Learning Services:

During my testing, I’ve found R handles CHAR and VARCHAR data within the input data set as long as the ASCII codes comprising the data is in the range from 0 to 127. This much is not surprising–those are the character codes for the ASCII table. Starting with character code 128, R begins having some trouble. 

Read on to see the problem. Dave’s advice at the end is sound (and frankly, my advice for any string data in SQL Server).

Comments closed

Applying the Principles of Site Reliability Engineering

Sheldon Hull has an essay on site reliability engineering in practice:

I’ve always been focused on building resilient systems, sometimes to my own detriment velocity wise. Balancing the momentum of delivery features and improving reliability is always a tough issue to tackle. Automation isn’t free. It requires effort and time to do correctly. This investment can help scaling up what a team can handle, but requires slower velocity initially to do it right.

How do you balance automating and coding solutions to manual fixes, when you often can’t know the future changes in priority?

This is personal experience rather than prescriptive guidance. Very interesting personal experience.

Comments closed

Tying Azure Data Factory to Source Control

Eddy Djaja explains why you really want to tie Azure Data Factory to your source control:

Azure Data Factory (ADF) is Microsoft’s ETL or more precise: ELT tool in the cloud. For more information of ADF, Microsoft puts the introduction of ADF in this link: https://docs.microsoft.com/en-us/azure/data-factory/introduction. As some have argued if ADF will replace or complement the “on-premise”  SSIS, it is uncertain and only time can tell what will happen in the future.
Unlike SSIS, the authoring of ADF does not use Visual Studio. ADF authoring uses a web browser to create ADF components, such as pipelines, activities, datasets, etc. The simplicity of authoring ADF may confuse the novice developers on how ADF components are saved, stored and published. When logging to ADF for the first time after creating an ADF, the authoring is in the ADF mode. How do we know?

Click through for the explanation and some resources on how to do it.

Comments closed

Checking JSON Structure with ADF

Rayis Imayev takes us through the solution of a tricky problem in Azure Data Factory:

Within my “ForEach” container I have also placed a Stored Procedure task and set 4 data elements from my incoming data stream as values for corresponding parameters.

However this approach will not work for all my incoming JSON events, it actually failed for the last one, since it didn’t have both “stop_time” and “last_update” data elements.

An easy way to fix this problem is to add missing data elements with empty values for the last event record, however, when we don’t have control over incoming data, we need to adjust our data processing steps.

Read on to see how Rayis solves this problem.

Comments closed