Copying Cassandra Data to HDFS

Landon Robinson shows how you can use Spark to extract data from Cassandra and move it into HDFS:

Cassandra is a great open-source solution for accessing data at web scale, thanks in no small part to its low-latency performance. And if you’re a power user of Cassandra, there’s a high probability you’ll want to analyze the data it contains to create reports, apply machine learning, or just do some good old fashioned digging.

However, Cassandra can prove difficult to use as an analytical warehouse, especially if you’re using it to serve data in production around the clock. But one approach you can take is quite simple: copy the data to Hadoop (HDFS).

Read on to learn how.

Arrays in Azure Data Factory

Kevin Feasel

2019-06-10

Cloud, ETL

Rayis Imayev takes us through arrays in Azure Data Factory:

Currently, there are 3 data types supported in ADF variables: String, Boolean, and Array. The first two are pretty easy to use: Boolean for logical binary results and String for everything else, including the numbers (no wonder there are so many conversion functions in Azure Data Factory that we can use).

I’ve also blogged about using Variables in Azure Data Factory:
– Setting Variables in Azure Data Factory Pipelines
– Append Variable activity in Azure Data Factory: Story of combining things together  
– System Variables in Azure Data Factory: Your Everyday Toolbox 
– Azure Data Factory: Extracting array first element

Click through for arrays and follow up with those other posts from there.

Building an ARM Template for Azure Data Factory

Kevin Feasel

2019-06-05

Cloud, ETL

Andy Leondard takes the first steps to building an Azure Data Factory pipeline using Azure Resource Manager Templates:

Azure Resource Manager, or ARM, “allows you to provision your applications using a declarative template.” So says the Azure Quickstart Templates page. ARM templates are JSON and allow administrators to import and export Azure resources using varying management patterns. I really like ARM templates for implementing infrastructure as code in Azure. In this post I show a very simple example of how to use ARM templates to export and then import a basic ADF (Azure Data Factory) pipeline.

The sample code doesn’t do that much by itself, but it does open up a new world of automation.

Minimal Logging with FastLoadContext

Kevin Feasel

2019-05-31

ETL, T-SQL

Paul White takes us through another way to perform minimally logged bulk loads with SQL Server:

This post provides new information about the preconditions for minimally logged bulk load when using INSERT...SELECT into indexed tables.

The internal facility that enables these cases is called FastLoadContext. It can be activated from SQL Server 2008 to 2014 inclusive using documented trace flag 610. From SQL Server 2016 onward, FastLoadContext is enabled by default; the trace flag is not required.

Without FastLoadContext, the only index inserts that can be minimally logged are those into an empty clustered index without secondary indexes, as covered in part two of this series. The minimal logging conditions for unindexed heap tables were covered in part one.

Click thorugh for a highly informative article.

Populating a Data Vault Model with Azure Data Factory

Rayis Imayev gives us an example of ELT into a Data Vault model using Azure Data Factory:

To make a full transition from the existing  DW model to an alternative Data Vault I removed all Surrogate Keys and other attributes that are only necessary to support Kimball data warehouse methodology. Also, I needed to add necessary Hash keys to all my Hub, Link and Satellite tables. The target environment for my Data Vault would be SQL Azure database and I decided to use a built-in crc32 function of the Mapping Data Flow to calculate hash keys (HK) of my business data sourcing keys and composite hash keys of satellite tables attributes (HDIFF).

Data Vault is somewhere on my list of things to learn. It’s not at the top of the list, but that’s not a slight against it.

Amazon Redshift ETL Tips

Kevin Feasel

2019-05-29

Cloud, ETL

The Blendo team shares a few tips around ETL’ing data to Amazon Redshift:

2. The WLM Method
Use Amazon Redshift’s WLM (workload management) for defining a dedicated queue for the ETL process. Configuring the ETL queue with a small number of slots will help in avoiding excessive COMMITs. Also, avoid COMMITing separately for each transaction since commits are expensive.
Instead, surround multiple steps of the ETL process by a BEGIN…END statement. You can perform COMMIT only after all transformation logic is executed.

Click through for the set of tips.

Loading Data Into SnowflakeDB

Dan Bilsborough shows a couple ways of loading data into SnowflakeDB from Azure:

Before being loaded into a Snowflake table, the data can be optionally staged, which is essentially just a pointer to a location where the files are stored. There are different types of stages including:
– User stages, which each user will have by default
– Table stages, which each table will have by default
– Internal named stages, meaning staged within Snowflake

Internal named stages are the best option for regular data loads, if you are thinking along the lines of your standard daily ETL process. One benefit of these is the flexibility in that they are database objects, so you can grant privileges to roles to access these objects as you would expect. Alternatively, there are external stages, such as Azure Blob storage.

Read on to see what comes next.

Azure Data Factory: Mapping and Wrangling Data Flows

Kevin Feasel

2019-05-14

Cloud, ETL

Cathrine Wilhelmsen explains the difference between Mapping Data Flows and Wrangling Data Flows in Azure Data Factory:

Now, we all know that the consultant answer to “which should I use?” is It Depends ™ 🙂 But what does it depend on?

To me, it boils down to a few key questions you need to ask:
– What is the task or problem you are trying to solve?
– Where and how will you use the output?
– Which tool are you most comfortable using?

Read on to see how they both work.

Syncing Google Calendar and Outlook with Microsoft Flow

Kevin Feasel

2019-05-08

ETL

Eugene Meidinger shows how to keep a couple of calendars in sync:

Back when I worked a normal job, I had two calendars: Office 365 for work and Google for home. Now that I work for myself, that’s a lot more complicated. Sometimes a customer will create an account for me in their network. Sometimes I’ll partner with other consultants and work as part of their team. And of course, I’ve got my own work email at [email protected].

I need all of these calendars to consolidate to one place. My natural inclination and personal preference is to put it all into Google. Now, there are sync apps available, but this sort of problem is a perfect use case. A calendar event is created in outlook, a flow is triggered, and that information is transferred to Google.

I was just complaining about this yesterday and then I see the post this morning. I’m pleased though simultaneously concerned that Eugene has bugged my hotel room.

Data Cleansing Options with Azure

James Serra tries to give you an answer of when you should use different Azure services for data cleansing:

Clean the data and optionally aggregate it as it sits in source system.  The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures).  The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details.  The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it.  Strongly advise against this option

Read on for additional options and James’s recommendations.

Categories

July 2019
MTWTFSS
« Jun  
1234567
891011121314
15161718192021
22232425262728
293031