Press "Enter" to skip to content

Day: June 19, 2020

Building Data Pipelines with Apache NiFi

The Hadoop in Real World team takes a look at Apache NiFi:

NiFi is an easy to use tool which prefers configuration over coding.

However, NiFi is not limited to data ingestion only. NiFi can also perform data provenance, data cleaning, schema evolution, data aggregation, transformation, scheduling jobs and many others. We will discuss these in more detail in some other blog very soon with a real world data flow pipeline. 

Hence, we can say NiFi is a highly automated framework used for gathering, transporting, maintaining and aggregating data of various types from various sources to destination in a data flow pipeline.

Click through for an example with instructions. The feeling is pretty close to Informatica or SQL Server Integration Services, so if you’re an old hand at one of those, you’ll get into this pretty easily.

Comments closed

Data Visualization in R

Dan Fitton provides an introductory overview to several visualization tools in R:

The other way to communicate data with R is to produce an interactive dashboard or web application within R using Shiny. Whereas Markdown reports are most useful for explanatory analysis; Shiny, in my opinion, is useful for exploratory data analysis. This is when you want to display information for investigative purposes, allowing the user to gain greater familiarity by having the ability to interact with data, filter it, and dig deeper into the underlying details.

Shiny is incredibly flexible, providing the user the capability of turning their R code and objects, including tables, plots, and analysis, into a comprehensive and interactive web page or app, without requiring a fully-fledged web development skillset. Although there is a steep learning curve, the freedom and precision Shiny brings means that for the most part you are limited only by your skillset rather than the tool itself.

I’ve seen some really useful Shiny dashboards. Dan is right that there can be a lot of work put into getting them right, but if you do, the results can be outstanding.

Comments closed

Breaking out of Azure Data Factory ForEach Activities

Andy Leonard is planning a jailbreak:

“What if something fails inside the ForEach activity’s inner activities, Andy?”

That is an excellent question! I’m glad you asked. The answer is: The ForEach activity continues iterating.

I can now hear some of you asking…

“What if I want the ForEach activity to fail when an inner activity fails, Andy?”

Another excellent question, and you’ve found a post discussing one way to “break out” of a ForEach activity’s iteration.

Read on for the process.

Comments closed

NVARCHAR Everywhere

I get to put on my contrarian hat:

In the last episode of Shop Talk, I laid out an opinion which was…not well received. So I wanted to take some time and walk through my thinking a little more cogently than I was able to do during Shop Talk.

Here’s the short version. When you create a table and need a string column, you have a couple options available: VARCHAR and NVARCHAR. Let’s say that you’re a developer creating a table to store this string data. Do you choose VARCHAR or NVARCHAR? The classic answer is, “It depends.” And so I talk about why that is in video format right below these words.

I have a video which goes into detail, plus a bunch of words. Plus mice and banjos. 🐭🪕

Comments closed

The Origins of SentryOne Plan Explorer

Jason Hall gives us a bit of history:

Greg saw a need in his own work, and I was seeing a need in the field with our customers, for a way to go beyond identifying high-impact queries. DBAs and developers needed a way to tune queries surfaced by SentryOne SQL Sentry’s Top SQL without fiddling with a lot of extra tools to get there. We were already building integration with SQL Server Management Studio (SSMS), which included graphical query plans, so the original thought was to extend that integration from SentryOne with a link that opened plans in SSMS from Top SQL in SQL Sentry.

It seemed like an elegant solution that would allow us to reuse some code, but it wasn’t long before Brooke Philpott discovered that we wouldn’t be able to get what we needed this way. That particular part of SSMS wasn’t exposed to us in the manner we needed. Par for the course, we weren’t going to let that stop us from filling the need. Greg and Brooke dug into the problem to discover a mix of documentation, flow controls, and ingenuity that would provide the foundation for building our own query plan visuals.

Read on for the story and a bit about how the product has morphed through the years.

Comments closed

Securing the Data Prep Area

Tim Mitchell explains why you should limit access to your staging area:

First things first, let’s define what a data prep area is. Data preparation (prep) is a common phase of extract, transform, and load (ETL) operations in which data is temporarily written for cleansing, deduplication, reshaping, or other data modifications. Also sometimes referred to as a landing area or a staging area, this is a common design pattern when moving data from a data store optimized for online transaction processing (OLTP) to a data model more friendly to analytics or reporting.

The data prep area really is a lot like a restaurant kitchen: it’s sometimes chaotic, it’s not consumer friendly, and there is a legitimate risk of consuming half-prepared goods.

Tim lays out why that is, so check it out.

Comments closed

Tips for Reducing Cloud Costs

Manas Narkar has a few tips for reducing the amount of money you spend on cloud infrastructure:

Cost optimization is a continuous process that evolves as you build your solutions. It starts with the initial architecture and continues throughout the entire solution lifecycle. Getting the architecture right will save you a lot of effort and money down the road. Having said that, you should regularly review your architectural approach and selection of services to adapt to business changes.

A fully cost-optimized system optimizes cloud resources without sacrificing performance and efficiency. When it comes to cost optimization, you can use several tools and techniques. The information below lists some of the core principles that you can apply to any cloud solution.

Costing items in the cloud is a good bit different than on-premises, to the point where entirely different architectures succeed.

Comments closed

Azure Data Studio June 2020 Release

Alan Yu announces a new release of Azure Data Studio:

The Data Virtualization extension for Azure Data Studio is now updated with more functionality and a new logo. This update allows you to use the data virtualization wizard to virtualize MongoDB and Teradata data sources into your SQL Server. This new functionality is available for SQL Server 2019 instances running CU5 or later.

To install the extension, search for Data Virtualization in the extension viewlet in Azure Data Studio and click install.

Of course I’m going to clip the bit about PolyBase.

Comments closed