Press "Enter" to skip to content

Month: December 2019

Choosing a SQL Platform on Hadoop

Sagar Kewalramani walks us through the choices for SQL platforms on the Cloudera Data Platform:

CDW on CDP is a new service that enables you to create a self-service data warehouse for teams of Business Intelligence (BI) analysts.  You can quickly provision a new data warehouse and share any data set with a specific team or detpartment. Do you remember when you could provision a data warehouse on your own?  Without infrastructure and platform teams getting involved? This was never possible.  CDW fulfills this mission.  

However, CDW makes several SQL engines available, and with more choice comes more opportunities for confusion.   Let’s explore the SQL engines available in CDW on CDP and talk about which is the right SQL option for the right use case.

So many choices!  Impala? Hive LLAP?  Spark? What to use when?  Let’s explore.

Infrastructure and platform teams start to get involved approximately two days after the unexpectedly large bill arrives.

That aside, this is a really nice article covering several platform technologies, including Impala, Hive LLAP, and Spark SQL.

Comments closed

Parquet Versus Avro

Matthew Rathbone compares the Parquet and Avro file formats:

JSON improves upon CSV as each row provides some indication of schema, but without a special header-row, there’s no way to derive a schema for every record in the file, and it isn’t always clear what type a ‘null’ value should be interpreted as.

Avro and Parquet on the other hand understand the schema of the data they store. When you write a file in these formats, you need to specify your schema. When you read the file back, it tells you the schema of the data stored within. This is super useful for a framework like Spark, which can use this information to give you a fully formed data-frame with minimal effort.

I was kind of hoping to see ORC in the comparison as well, though even when the Hortonworks-Cloudera competition was at its max, my recollection is that the differences between the two formats were pretty small (where ORC was a little faster for non-nested data and Parquet a little faster for nested data).

Comments closed

From SQL Server to Cassandra

Shel Burkow has started a new series:

A subset of related tables in a relational schema can satisfy any number of queries known and unknown at design time. Refactoring the schema into one Cassandra table to answer a specific query, though, will (re)introduce all the data redundancies the original design had sought to avoid.

In this series, I’ll do just that. Starting from a normalized SQL Server design and statement of the Cassandra query, I’ll develop four possible solutions in both logical and physical models. To get there, though, I’ll first lay the foundation.

This initial article focuses on the Cassandra primary key. There are significant differences from those in relational systems, and I’ll cover it in some depth. Each solution (Part III) will have a different key.

Cassandra (as well as Riak, while that was still a thing people cared about) has the concept of tables and SQL statements to work with them, but it’s quite different from a relational database, different enough that new design patterns are necessary. Just about the worst thing you could do would be to drop your relational database schema in Cassandra and call it a day.

Comments closed

Formatting SSIS Packages

Tim Mitchell shows us different options available when formatting SSIS packages:

Most folks I know go to one extreme or the other when it comes to the visual layout of SQL Server Integration Services packages: either they don’t care a tiny bit about the appearance, or they insist on an easy-to-read layout before a project is considered complete. I am definitely in the second group, and will almost always spend the time to make sure my packages are properly laid out.

Regardless of which of these groups you are in, it is very easy to apply some simple visual formatting to your packages. In this quick tip, I’ll show you how to use the visual formatting options in SSIS to automate the layout.

I consider it quite important—it gives a future viewer an idea of your vision as developer. If I can’t read what you’re doing, that does not bode well.

Comments closed

Automating JSON to Tables

Dave Mason has a stored procedure for us:

The code for my stored procedure is below. It has a single input parameter for a string of JSON data. The name/value pairs of the first “row” of JSON data is parsed to obtain column names and types for the result set. A query string for the OPENJSON function is constructed and executed via EXEC. JSON’s support for data types is pretty sparse compared to SQL Server (notice the big CASE expression where I attempt to do some mapping). If you’re using SQL 2016, you’ll have to make an edit for the STRING_AGG function.

Click through for the code and demos.

Comments closed

Deferred Compilation and Compatibility Level 140 Query Hints

Milos Radivojevic shows that table-valued parameters do not care about your QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140 settings:

The plan is created under CL 140, but the estimation number of rows for the table variable is not 1 but the actual one. Even if you would specify the FORCE_LEGACY_CARDINALITY_ESTIMATION hint, the query will be deferred compiled and behavior of table variable would be the same.

Table variable deferred compilation respects settings at the database scope and ignores all hints except one.

Click through to learn what that one hint is.

Comments closed

Multi-Column Transformations with Power Query

Imke Feldmann shows how we can apply a function to multiple columns at once using Power Query:

You can apply simple transformations to multiple columns at once in Power Query using the UI only. In this article I show how you can apply advanced transformations on multiple columns at once instead. You can also use this to use custom functions instead. And lastly for the lazyefficient fans of custom M-functions: You will get a new “TranformAllMyColumnsAtOnceHowILikeIt”-function as well

Read on for a few examples, including one using custom code.

Comments closed

Parameters, Variables, and ForEach Loops in ADF

Cathrine Wilhelmsen has a few more posts in the Azure Data Factory series for us. First up is on parameters:

We can build dynamic solutions!

Creating hardcoded datasets and pipelines is not a bad thing in itself. It’s only when you start creating many similar hardcoded resources that things get tedious and time-consuming. Not to mention, the risk of manual errors goes drastically up when you feel like you create the same resource over and over and over again.

After that is variables:

Parameters are external values passed into pipelines. They can’t be changed inside a pipeline. Variables, on the other hand, are internal values that live inside a pipeline. They can be changed inside that pipeline.

Parameters and variables can be completely separate, or they can work together. For example, you can pass a parameter into a pipeline, and then use that parameter value in a set variable or append variable activity.

And the latest post in the series is all about ForEach loops:

By default, the foreach loop tries to run as many iterations as possible in parallel. You can choose to run them sequentially instead, for example if you need to copy data into a single table and want to ensure that each copy finishes before the next one starts.

If you choose to run iterations in parallel, you can limit the number of parallel executions by setting the batch count. The default number is 20 and the max number is 50.

This has been a very nice series, and it looks like there is a little bit more to go.

Comments closed

Sticky Partitioning in Kafka 2.4

Justine Olshan takes us through sticky partitioning in Kafka 2.4:

The sticky partitioner addresses the problem of spreading out records without keys into smaller batches by picking a single partition to send all non-keyed records. Once the batch at that partition is filled or otherwise completed, the sticky partitioner randomly chooses and “sticks” to a new partition. That way, over a larger period of time, records are about evenly distributed among all the partitions while getting the added benefit of larger batch sizes.

It looks like this is an improvement with few downside tradeoffs.

Comments closed

Fixing the Small File Problem in Hadoop

Guy Shilo takes us through the Hadoop Archive format:

It has hard time handling many small files. The memory footprint of the namenodes becomes high as they have to keep track of many small blocks and the performance of scans goes down.

The best way to fix this situation is, of course to avoid it in first place. This can be done when designing the application or the pipeline that inserts the data into HDFS, for example, by bundling many files into one container such as sequencefile, Avro or Hadoop archive (.har file).

Hadoop archive is somewhat overlooked option that I want to demonstrate today. You will see that it can be very useful in some cases but not so great in others.

Read the whole thing before giving it a try, as there are some downsides.

Comments closed