Press "Enter" to skip to content

Category: Architecture

Using the Azure Architecture Icons

Steve Jones tries out some of the Azure Architecture Icons:

The icons are svg, so while they work in PowerPoint, adding them to something like this post in OpenLiveWriter doesn’t work. However, I could make a quick diagram and capture an image of it.

Not great, but it shows I can put icons on a page with arrows.

Going one step further, I’ve been digging into Diagrams by mingrammer lately. With it, you use Python to generate diagrams, and there are quite a few Azure icons in there, as well as AWS, on-prem, etc.

Here’s a quick example of what you can do, taken from an upcoming talk of mine:

There are some limitations based on the underlying library, such as how you can’t connect cluster to cluster—meaning I can’t draw a line from “Logging” to “Storage\Logs”; I have to draw it from a particular element (Loki) to a particular element (Elasticsearch). In a lot of traditional reference architecture diagrams, though, that isn’t a problem.

Comments closed

Calculating Cloud App Availability

Dave Bermingham gives you a way to calculate how available you should expect your application to be given SLAs:

When deploying business critical applications in the cloud you want to make sure they are highly available. The good news is that if you plan properly, you can achieve 99.99% (4-nines) of availability or more. However, calculating your true availability may not be as straightforward as it seems.

When considering availability you must consider the key components that make access to your application possible, which I’ll call the availability chain. Component of the availability chain are:

– Compute
– Network 
– Storage
– Application
– Dependent services

Your application is only as available as your weakest link, and your downtime increases exponentially with each additional link you add to the chain.  Let’s examine each of the links. 

Read on for a breakdown of these items.

Comments closed

Cassandra Monitoring and Data Modeling

Instaclustr has put up a couple interesting posts on Cassandra. First, Anup Shirolkar explains how we can monitor Cassandra installations:

Cassandra is developed in Java and is a JVM based system. Each Cassandra node runs a single Cassandra process. JVM based systems are enabled with JMX (Java Management Extensions) for monitoring and management. Cassandra exposes various metrics using MBeans which can be accessed through JMX. Cassandra monitoring tools are configured to scrape the metrics through JMX and then filter, aggregate, and render the metrics in the desired format. There are a few performance limitations in the JMX monitoring method, which are referred to later. 

The metrics management in Cassandra is performed using Dropwizard library. The metrics are collected per node in Cassandra. However, those can be aggregated by the monitoring system. 

On the development side, the Instaclustr team walks us through data modeling guidelines:

The ultimate goal of Cassandra data modeling and analysis is to develop a complete, well organized, and high performance Cassandra cluster. Following the five Cassandra data modeling best practices outlined will hopefully help you meet that goal:

1. Cassandra is not a relational database, don’t try to model it like one
2. Design your model to meet 3 fundamental goals for data distribution
3. Understand the importance of the Primary Key in the overall data structure 
4. Model around your queries but don’t forget about your data
5. Follow a six step structured approach to building your model. 

Because Cassandra uses a variant of SQL, it’s easy to forget that data is stored completely differently and that design decisions are quite different from what we see in the relational world.

Comments closed

When to Have Multiple Azure Data Factories

Paul Andrew explains how to become a factory mogul:

The obvious and easy reason for having multiple Data Factory’s could be that you simply want to separate your business processes. Maybe they all have separate data delivery requirements and it just makes management of data flows easier to handle. For example:

– Sales
– Finance
– HR

They could have different data delivery deadlines, they process on different schedules and don’t share any underlying connections.

You may also have multiple projects underway that mean you want to keep teams isolated.

But that’s not the only reason, so click through to learn several other reasons why you might have multiple Azure Data Factory instances running.

Comments closed

Diving into Kubernetes: a Workshop

Chris Adkin has been busy:

I have not blogged for a while, it was my hope to produce part 5 in the series of creating a Kubernetes cluster for production grade Big Data Clusters. However, there is a very good reason for this, and that is because I have been working on a one day workshop to be delivered at SQL Bits in September, the material can be found here, enjoy !

I’ve only looked at the module listings, but Chris does a great job putting long-form articles together, so I’ve already added it to my todos.

Comments closed

Data Exfiltration Protection when Using Azure Databricks

Bhavin Kukadia, et al, explain how to prevent users from taking data from your Databricks cluster without authorization:

Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider’s network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of the fastest growing Data & AI service on Azure. We’ve come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it’s time that we share it out broadly.

Click through for the architectural pattern.

Comments closed

Schema Management for Spark Applications

Walaa Eldin Moustafa takes us through some of the things that LinkedIn has learned about schema management with Apache Spark:

At LinkedIn, the Hive Metastore is the source of truth catalog for all Hadoop data. The Hive Metastore is managed by Dali. Dali is a data access and processing platform that is integrated to compute engines and ETL pipelines at LinkedIn to ensure consistency and uniformity in the access and storage of data. Dali utilizes the Hive Metastore to store data formats, data locations, partition information, and table information. Among other features, Dali also manages the definition of SQL views, as well as storing and accessing those definitions from the Hive Metastore.

Read on for a good explanation of the how as well as the why.

Comments closed

Authentication in Hadoop with Apache Ozone

Xiaoyu Yao explains how we can use Apache Ozone to perform service account authentication for a Hadoop cluster:

Like Hadoop delegation tokens, Ozone security token has a token identifier along with a signed signature from the issuer. Ozone manager issues delegation token and block tokens for users or client applications authenticated with Kerberos. The signature of the token can be validated by token validators to verify the identity of the issuer. This way, a valid token holder can use the token to perform operations against the cluster services as if they have Kerberos tickets of the issuer. 

Read on for the high-level overview.

Comments closed

Building Metadata for an ADF Pipeline

Paul Andrew continues a series on Azure Data Factory and metadata-driven pipelines:

Welcome back friends to part 2 of this 4 part blog series. In this post we are going to deliver on some of the design points we covered in part 1 by building the database to house our processing framework metadata.

Let’s start with a nice new shiny Azure SQLDB database and schema. This can easily be scaled up as our calls from Data Factory increase and ultimately the solution we are using the framework for grows.

Soon we will get to see the Azure Data Factory power in action.

Comments closed

A Metadata-Driven Framework for ADF Pipelines

Paul Andrew has started a series on metadata-driven Azure Data Factory pipelines:

The concept of having a processing framework to manage our Data Platform solutions isn’t a new one. However, overtime changes in the technology we use means the way we now deliver this orchestration has to change as well, especially in Azure. On that basis and using my favourite Azure orchestration service; Azure Data Factory (ADF) I’ve created an alpha metadata driven framework that could be used to execute all our platform processes. Furthermore, at various community events I’ve talked about bootstrapping solutions with Azure Data Factory so now as a technical exercise I’ve rolled my own simple processing framework. Mainly, to understand how easily we can make it with the latest cloud tools and fully exploiting just how dynamic you can get a set of generational pipelines.

This first post lays out some of the architectural decisions and prep work needed for the series.

Comments closed