Architecting A Power BI Environment

Reza Rad explains different architectural patterns for a Power BI implementation:

Implementing a Power BI solution is not just about developing reports, creating a data model, or using visuals. Power BI, like any other technologies, can be used in a correct, or incorrect way. Any technology can be used more effective if it harnesses the right architecture. A right architecture can be achieved after a requirement gathering and designing aspects and components of the technology to fit the requirement. In this post, you will learn about some of the most common architectures to use Power BI. You will learn about using Power BI in different architecture guidelines;

  • Sharing architecture

  • Self-service architecture

  • Enterprise architecture

Read on to learn more about these three patterns.

A Frugal Stretch Database Alternative

Chris Bell shares a version of Stretch databases for people with budgets:

Stretch databases were going to provide “Cost-effective” availability for cold data, and unlike typical cold data storage,  our data would always be online and available to query. Applications would not need to be modified to work with the seamless design of the stretch database. Run a query, and the data was there being pulled from the cloud when needed. Streamlining on-premises data maintenance by reducing the local footprint of the data files as well as the size of backups! It was even going to be possible to keep data secure via encrypted connections to the cloud and in theory, make a migration to the cloud even easier.

It was destined to be a major win!

Then the price was mentioned.

Do you know anyone using stretch databases today?

Yeah, me neither.

It’s an interesting workaround with several moving parts.

Use Cases For Apache Kafka

Amy Boyle shows a few scenarios where New Relic uses Apache Kafka:

The Events Pipeline team is responsible for plumbing some of New Relic’s core data streams-specifically, event data. These are fine-grained nuggets of monitoring data that record a single event at a particular moment in time. For example, an event could be an error thrown by an application, a page view on a browser, or an e-commerce shopping cart transaction.

In this post, we show how we built our Kafka pipeline so that it stitches together microservices and serves as a changelog and “durable cache,” all with the idea of processing data streams as smoothly and effectively as possible at our scale. In an upcoming post, we’ll share thoughts on how we manage topic partitions in this pipeline.

If you’re wondering if Kafka might be right for you, check out this post for several scenarios which fit.

Event Sourcing On Kafka

Adam Warski shows how you can use Apache Kafka as your event sourcing data source:

There’s a number of great introductory articles, so this is going to be a very brief introduction. With event sourcing, instead of storing the “current” state of the entities that are used in our system, we store a stream of events that relate to these entities. Each event is a fact, it describes a state change that occurred to the entity (past tense!). As we all know, facts are indisputable and immutable. For example, suppose we had an application that saved a customer’s details. If we took an event sourcing approach, we would store every change made to that customer’s information as a stream, with the current state derived from a composition of the changes, much like a version control system does. Each individual change record in that stream would be an immutable, indisputable fact.

Having a stream of such events, it’s possible to find out what’s the current state of an entity by folding all events relating to that entity; note, however, that it’s not possible the other way round — when storing the current state only, we discard a lot of valuable historical information.

Event sourcing can peacefully co-exist with more traditional ways of storing state. A system typically handles a number of entity types (e.g. users, orders, products, …), and it’s quite possible that event sourcing is beneficial for only some of them. It’s important to remember that it’s not an all-or-nothing choice, but an additional possibility when it comes to choosing how state is managed in our application.

It’s a helpful article and works hand in hand with a CQRS pattern.

Avoid Scalar Functions In Computed Columns

Daniel Hutmacher shows why you should not include scalar functions inside computed column definitions:

Scalar functions can be a real headache when you’re performance tuning. For one, they don’t parallelize. In fact, if you use a scalar function in a computed column, it will prevent any query that uses that table from going parallel – even if you don’t reference that column at all!

Read on for a demonstration.

Azure And The Kappa Architecture

Jared Zagelbaum describes the Kappa architecture and points out how there’s limited built-in support in Azure for it:

You can’t support kappa architecture using native cloud services. Cloud providers, including Azure, didn’t design streaming services with kappa in mind. The cost of running streams with TTL greater than 24 hours is more expensive, and generally, the max TTL tops out around 7 days. If you want to run kappa, you’re going to have to run Platform as a Service (PaaS) or Infrastructure as a Service (IaaS), which adds more administration to your architecture. So, what might this look like in Azure?

Read the whole thing.

Lambda Architecture In Azure

Jared Zagelbaum describes the Lambda architecture pattern and explains how you can use tooling in Azure to implement it:

Lambda is an organic result of the limitations of existing tools. Distributed systems architects and developers commonly criticize its complexity – and rightly so. Those of us that have worked extensively in Extract-Transform-Load and symmetric multiprocessing systems see red flags when code is replicated in multiple services. Ensuring data quality and code conformity across multiple systems, whether massively parallel processing (MPP) or symmetrically parallel system (SMP), has the same best practice: the least amount of times you reproduce code is always the correct number of times.

We reproduce code in lambda because different services in MPP systems are better at different tasks. The maturity of tools historically hasn’t allowed us to process streams and batch in a single tool. This is starting to change, with Apache Spark emerging as a single preferred compute service for stream and batch querying, hence the timing of Azure Databricks. However, on the storage side, what was meant to be an immutable store that is the data lake in practice, can become the dreaded swamp when governance or testing fails; which is not uncommon. A fundamentally different assumption to how we process data is required to combat this degradation. Enter: the kappa architecture, which we’ll examine in the next post of this series.

Interesting reading.

Data Lake Zones

Melissa Coates walks us through the different layers of a data lake:

As we are approaching the end of 2017, many people have resolutions or goals for the new year. How about a goal to get organized…in your data lake?

The most important aspect of organizing a data lake is optimal data retrieval.

Click through for a great visual showing the various zones in a data lake.

Functional Programming And Microservices

Bobby Calderwood might win me over on microservices with talk like this:

This view of microservices shares much in common with object-oriented programming: encapsulated data access and mutable state change are both achieved via synchronous calls, the web of such calls among services forming a graph of dependencies. Programmers can and should enjoy a lively debate about OO’s merits and drawbacks for organizing code within a single memory and process space. However, when the object-oriented analogy is extended to distributed systems, many problems arise: latency which grows with the depth of the dependency graph, temporal liveness coupling, cascading failures, complex and inconsistent read-time orchestration, data storage proliferation and fragmentation, and extreme difficulty in reasoning about the state of the system at any point in time.

Luckily, another programming style analogy better fits the distributed case: functional programming. Functional programming describes behavior not in terms of in-place mutation of objects, but in terms of the immutable input and output values of pure functions. Such functions may be organized to create a dataflow graph such that when the computation pipeline receives a new input value, all downstream intermediate and final values are reactively computed. The introduction of such input values into this reactive dataflow pipeline forms a logical clock that we can use to reason consistently about the state of the system as of a particular input event, especially if the sequence of input, intermediate, and output values is stored on a durable, immutable log.

It’s an interesting analogy.

Caching Strategy

Kevin Gessner explains some caching concepts used at Etsy:

A major drawback of modulo hashing is that the size of the cache pool needs to be stable over time.  Changing the size of the cache pool will cause most cache keys to hash to a new server.  Even though the values are still in the cache, if the key is distributed to a different server, the lookup will be a miss.  That makes changing the size of the cache pool—to make it larger or for maintenance—an expensive and inefficient operation, as performance will suffer under tons of spurious cache misses.

For instance, if you have a pool of 4 hosts, a key that hashes to 500 will be stored on pool member 500 % 4 == 0, while a key that hashes to 1299 will be stored on pool member 1299 % 4 == 3.  If you grow your cache by adding a fifth host, the cache pool calculated for each key may change. The key that hashed to 500 will still be found on pool member 500 % 5 == 0, but the key that hashed to 1299 be on pool member 1299 % 5 == 4. Until the new pool member is warmed up, your cache hit rate will suffer, as the cache data will suddenly be on the ‘wrong’ host. In some cases, pool changes can cause more than half of your cached data to be assigned to a different host, slashing the efficiency of the cache temporarily. In the case of going from 4 to 5 hosts, only 20% of cache keys will be on the same host as before!

It’s interesting reading.


May 2018
« Apr