Category: Architecture

OLAP On Hadoop

Published 2016-11-03 by Kevin Feasel

Tim Spann discusses OLAP options on the Hadoop stack:

Apache Kylin

For an introduction to this interesting Hadoop project, check out this article. Apache Kylin originally from eBay, is a Distributed Analytics Engine that provides SQL and OLAP access to Hadoop datasets utilizing Hive and HBase. It can use called through SparkSQL as well making for a very useful project. This project let’s you work with PowerBI, Tableau and Excel with more tool support coming soon. You can doMOLAP cubes and support many users with fast queries over billions of rows. Apache Kylin provides JDBC and ODBC drivers.

There are a few interesting options here.

Comments closed

Synchronicity

Published 2016-10-25 by Kevin Feasel

Kenneth Fisher discusses synchronous versus asynchronous in programming terms:

Synchronous – Code that runs one one line at a time. Each line of code is completed before the next one starts. If an external call is made then it is completed before the next line of code runs.

Asynchronous – Code that is launched and runs separately from the initial code. If a SQL job is launched from inside a batch of code (using sp_start_job for example) then the job is running in parallel (at the same time as) to the remainder of the batch of code.

Understanding which operations are synchronous versus asynchronous, and which operations are blocking versus non-blocking versus semi-blocking, will do wonders for improving application performance.

Comments closed

Foresight

Published 2016-10-12 by Kevin Feasel

Anders Pedersen shares an easily-avoidable tale of woe:

ETL. Spec said only Address Line 1 is needed to be loaded, so the developers only bring that line in (plus name, city etc.). Fast forward 8 years, I get a request on my desk: “Please add Address Line 2 to import, and all tables. Oh, and we need historical data for previously loaded files. And for all address types”.

Groan.
No normalization in this database (which is just one of about 40 databases with SIMILAR structure, but not identical).

Read on for the damage done, as well as another example of foresight saving the day.

Comments closed

(Re-)Design For Today’s Needs

Published 2016-10-12 by Kevin Feasel

Andy Levy sees common problems when dealing with brownfield applications:

The primary system I deal with on a daily basis was originally developed as a DOS application and several of the above examples are drawn from it. Looking at the core tables and columns, it’s easy to identify those that began life in those early days – they all have 8-character names. Time moved on and the system grew and evolved. DOS to Windows. Windows to the web. But the database, and the practices and patterns used in the database, haven’t come along for the ride.

Data schema conversions can be hard and disruptive – you need to update your application, your stored procedures, and provide customers/users with a clean migration path. Code changes require testing. Complexity and cost grows every time you introduce changes. I get that.

There’s a lot of effort in Andy’s advice, but it’s well worth it.

Comments closed

Sequentially Increasing Indexes

Published 2016-10-10 by Kevin Feasel

Joe Chang discusses benchmarking and looks at a particular scenario around maximizing insert performance:

The test environment here is a single socket Xeon E3 v3, quad-core, hyper-threading enabled. Turbo-boost is disabled for consistency. The software stack is Windows Server 2016 TP5, and SQL Server 2016 cu2 (build 2164). Some tests were conducted on a single socket Xeon E5 v4 with 10 cores, but most are on the E3 system. In the past, I used to maintain two-socket systems for investigating issues, but only up to the Core2 processor, which were not NUMA.

The test table has 8 fixed length not null columns, 4 bigint, 2 guids, 1 int, and a 3-byte date. This adds up to 70 bytes. With file and row pointer overhead, this works out to 100 rows per page at 100% fill-factor.

Both heap and clustered index organized tables were tested. The indexes tested were 1) single column key sequentially increasing and 2) two column key leading with a grouping value followed by a sequentially increasing value. The grouping value was chosen so that inserts go to many different pages.

The test was for a client to insert a single row per call. Note that the recommended practice is to consolidate multiple SQL statements into a single RPC, aka network roundtrip, and if appropriate, bracket multiple Insert, Update and Delete statements with a BEGIN and COMMIT TRAN. This test was contrived to determine the worst case insert scenario.

With that setup in mind, click through to learn his results.

Comments closed

Provenance In Distributed Systems

Published 2016-09-30 by Kevin Feasel

Jessica Kerr discusses methods for determining data lineage, particularly in distributed systems:

Can you take a piece of data in your system and say what version of code put it in there, based on what messages from other systems? and what information a human viewed before triggering an action?

Me neither.

Why is this acceptable? (because we’re used to it.)
We could make this possible. We could trace the provenance of data. And at the same time, mostly-solve one of the challenges of distributed systems.

This is an interesting essay; read the whole thing.

Comments closed

Choosing A Data Platform

Published 2016-09-29 by Kevin Feasel

Lukas Eder discusses when to use a relational database versus some non-relational database:

This question obviously assumes that you’re starting out with an RDBMS, which is classically the database system that solves pretty much any problem decently enough not to be replaced easily. What does this mean? Simply put:

RDBMS have been around forever, so they have a huge advantage compared to “newcomers” in the market, who don’t have all the excellent tooling, community, support, maturity yet

E.F. Codd’s work may have been the single biggest influence on our whole industry. There has hardly been anything as revolutionary as the relational model ever since. It’s hard for an alternative database to be equally universal, i.e. they’re mostly solving niche problems

Having said so, sometimes you do have a niche problem. For instance a graph database problem. In fact, a graph is nothing fundamentally different from what you can represent in the relational model. It is easy to model a graph with a many-to-many relationship table.

If you want a checklist, here’s how I would approach this question (ceteris paribus and limiting myself to about 100 words):

Are you dealing with streaming millions of rows per second, or streaming from tens of thousands of endpoints concurrently? Kafka and the Hadoop streaming stack.
Is your problem something that you’ve already solved with a relational database, and your solution works well enough? Relational database.
Are there multiple “paths” to get to interesting data? Relational database.
Shopping carts (write-heavy, focused on availability over consistency)? Riak/Cassandra/Dynamo at large scale, else relational database.
Type duplication? Relational database.
Petabytes of data being analyzed asynchronously? Hadoop.
Other data platforms if they fit specific niche requirements around data structure.

There’s a lot more to this discussion than a simple numbered list, but I think it’s reasonable to start with relational databases and move away if and only if there’s a compelling reason.

Comments closed

Docker On Windows Server

Published 2016-09-28 by Kevin Feasel

Elton Stoneman walks us through how to run Docker on Windows Server 2016:

There are two Windows Base images on the Docker Hub – microsoft/nanoserver andmicrosoft/windowsservercore. We’ll be using an IIS image shortly, but you should start with Nano Server just to make sure all is well – it’s a 250MB download, compared to 4GB for Server Core.
docker pull microsoft/nanoserver  
Check the output and if all is well, you can run an interactive container, firing up PowerShell in a Nano Server container:

Docker will also run on Windows 10 Pro, Enterprise, or Education editions. That’s sad news for people who upgraded for free to Home Edition.

Comments closed

Creating Partitioned Views

Published 2016-09-23 by Kevin Feasel

Erik Darling describes partitioned views:

Hooray. Now you have to type less.

Partitioned views don’t need a scheme or a function, there’s no fancy syntax to swap data in or out, and there’s far less complexity in figuring out RIGHT vs LEFT boundaries, and leaving empty partitions, etc. and so forth. You’re welcome.

A lot gets made out of partition level statistics and maintenance being available to table partitioning. That stuff is pretty much automatic for partitioned views, because you have no choice. It’s separate tables all the way down.

Partitioned views, AKA SQL Server 2000 partitioning. I think my favorite use case for them today is to serve as a combination of hot data in a memory-optimized table and cold data on disk.

Comments closed

Immutable Servers

Published 2016-09-09 by Kevin Feasel

Diana Tkachenko describes a pattern for reducing “prod doesn’t look like stage” types of errors:

Immutable server pattern makes use of disposable components for everything that makes up an application that is not data. This means that once the application is deployed, nothing changes on the server – no scripts are run on it, no configuration is done on it. The packaged code and any deploy scripts is essentially baked into the server. No outside process is able to modify the contents after the server has been deployed. For example, if you were using Docker containers to deploy your code, everything the application needs would be in the Docker image, which you then use to create and run a container. You cannot modify the image once it’s been created, and if any changes do need to take place, you would create a new image and work with that one.

In our case, we use AWS Amazon Machine Images (AMIs) to accomplish the same thing. We make heavy use of Amazon Linux machines, which are Redhat-based, and thus package the code into RPMs^[2]. The RPMs define all the dependencies for running the application, the code itself, and any startup scripts to run on bootup. The RPM is then installed on a clean base image of Amazon Linux, and an image is taken, resulting in an AMI. This AMI is synonymous with “immutable server” – it cannot be changed once it is created. The AMI is then deployed into an Auto Scaling Group(ASG) and attached to the Elastic Load Balancer (ELB). In this post, I’ll guide you through for a closer look at every step of this Immutable Server deploy pipeline. I’ll then go into how and why we embedded planned failures into this system. At the end, I’ll share the insights we’ve gained into the pros and cons of deploying in this way.

This is a very interesting concept. I’ve heard of no-patch servers (where, instead of patching live servers, you spin up a new VM with the operating system updates and spin down the old one), but this takes the idea one step further.

Comments closed