Press "Enter" to skip to content

Category: Architecture

Authentication in Hadoop with Apache Ozone

Xiaoyu Yao explains how we can use Apache Ozone to perform service account authentication for a Hadoop cluster:

Like Hadoop delegation tokens, Ozone security token has a token identifier along with a signed signature from the issuer. Ozone manager issues delegation token and block tokens for users or client applications authenticated with Kerberos. The signature of the token can be validated by token validators to verify the identity of the issuer. This way, a valid token holder can use the token to perform operations against the cluster services as if they have Kerberos tickets of the issuer. 

Read on for the high-level overview.

Comments closed

Building Metadata for an ADF Pipeline

Paul Andrew continues a series on Azure Data Factory and metadata-driven pipelines:

Welcome back friends to part 2 of this 4 part blog series. In this post we are going to deliver on some of the design points we covered in part 1 by building the database to house our processing framework metadata.

Let’s start with a nice new shiny Azure SQLDB database and schema. This can easily be scaled up as our calls from Data Factory increase and ultimately the solution we are using the framework for grows.

Soon we will get to see the Azure Data Factory power in action.

Comments closed

A Metadata-Driven Framework for ADF Pipelines

Paul Andrew has started a series on metadata-driven Azure Data Factory pipelines:

The concept of having a processing framework to manage our Data Platform solutions isn’t a new one. However, overtime changes in the technology we use means the way we now deliver this orchestration has to change as well, especially in Azure. On that basis and using my favourite Azure orchestration service; Azure Data Factory (ADF) I’ve created an alpha metadata driven framework that could be used to execute all our platform processes. Furthermore, at various community events I’ve talked about bootstrapping solutions with Azure Data Factory so now as a technical exercise I’ve rolled my own simple processing framework. Mainly, to understand how easily we can make it with the latest cloud tools and fully exploiting just how dynamic you can get a set of generational pipelines.

This first post lays out some of the architectural decisions and prep work needed for the series.

Comments closed

Fun with Metaphors: Data Lakehouses

Ben Lorica, et al, have a new metaphor to try out:

Over the past few years at Databricks, we’ve seen a new data management paradigm that emerged independently across many customers and use cases: the lakehouse. In this post we describe this new paradigm and its advantages over previous approaches.

The Data Lake’s Aristotelian counterpart is the Data Swamp. I’m working on a similar comp for the Data Lakehouse (Data Swampboat? Data Swamphouse is too easy), but in the meantime, that one person who goes and slaughters your application’s performance by butchering the data in your Data Lakehouse? That’s a Data Jason.

1 Comment

From SQL Server to Cassandra

Shel Burkow has started a new series:

A subset of related tables in a relational schema can satisfy any number of queries known and unknown at design time. Refactoring the schema into one Cassandra table to answer a specific query, though, will (re)introduce all the data redundancies the original design had sought to avoid.

In this series, I’ll do just that. Starting from a normalized SQL Server design and statement of the Cassandra query, I’ll develop four possible solutions in both logical and physical models. To get there, though, I’ll first lay the foundation.

This initial article focuses on the Cassandra primary key. There are significant differences from those in relational systems, and I’ll cover it in some depth. Each solution (Part III) will have a different key.

Cassandra (as well as Riak, while that was still a thing people cared about) has the concept of tables and SQL statements to work with them, but it’s quite different from a relational database, different enough that new design patterns are necessary. Just about the worst thing you could do would be to drop your relational database schema in Cassandra and call it a day.

Comments closed

Use SQL for XML and JSON Creation

Lukas Eder argues that if you’re storing the data in SQL and you need to get data from a database into JSON or XML format, just use SQL for that:

In English: We need a list of actors, and the film categories they played in, and grouped in each category, the individual films they played in.

Let me show you how easy this is with SQL Server SQL (all other database dialects can do it these days, I just happen to have a SQL Server example ready:

Lukas makes a great point and has a FAQ to follow up on it. If there’s a reason for mapping at a higher layer—if you’re actually adding value rather than building out a set of converters—that’s one thing, but if you’re just accepting a data set and returning a JSON blob…well, your database product can do that too.

Comments closed

Star Schemas and Power BI

Alberto Ferrari explains why star schemas are so important to Power BI:

A common question among data modeling newbies is whether it is better to use a completely flattened data model with only one table, or to invest time in building a proper star schema (you can find a description of star schemas in Introduction to Data Modeling). As coined by Koen Verbeeck, the motto of a seasoned modeler should be “Star Schema all The Things!”

The goal is to demonstrate that a report using a flattened table returns inaccurate numbers, whereas using a star schema turns it into a sound analytical system.

Read on for the example.

Comments closed

Being a SQL Server Product Owner

Kevin Chant has an interesting role:

Now, I have had a few people ask me what a Product Owner actually does. Some say that it sounds like an architect role.

In reality, the role is one that’s mainly related to newer working practices like Scrum.

A Product Owners list of responsibilities include talking to all the stakeholders for you team in the business and organise the priorities on your backlog board.

The concept makes sense, though this is the first time I’ve heard of such a role for a tool the engineers use rather than a product offered for sale.

Comments closed

Against Premature Re-Architecture

Cyndi Johnson has a good rant:

One of my biggest pet peeves in software development is the compulsion that so many developers have to rip up the foundation and completely build something over again, pretty much from scratch.

I’ve been that developer plenty of times. It’s easy to walk in, see that there are some problems, and want to raze everything. Sometimes that’s a reasonable answer, but every apparent mismatch or hack was put in to solve a particular business rule, many of which are lost to the mists of time. Burning down and starting over loses a lot of that information, so rebuilding is something you do with caution.

Comments closed

The Flexible Data Lake

Neil Stokes explains how you can optimize a Hadoop-based data lake:

There are many details, of course, but these trade-offs boil down to three facets as shown below.

Big refers to the volume of data you can handle with your environment. Hadoop allows you to scale your storage capacity – horizontally as well as vertically – to handle vast volumes of data.

Fast refers to the speed with which you can ingest and process the data and derive insights from it. Hadoop allows you to scale your processing capacity using relatively cheap commodity hardware and massively parallel processing techniques to access and process data quickly.

Cheap refers to the overall cost of the platform. This means not just the cost of the infrastructure to support your storage and processing requirements, but also the cost of building, maintaining and operating the environment which can grow quite complicated as more requirements come into play.

The bottom line here is that there’s no magic in Hadoop. Like any other technology, you can typically achieve one or at best two of these facets, but in the absence of an unlimited budget, you typically need to sacrifice in some way.

Software development is full of trade-offs, and data lakes are no different. Read the whole thing.

Comments closed