Press "Enter" to skip to content

Category: Architecture

Choosing A Data Platform

Lukas Eder discusses when to use a relational database versus some non-relational database:

This question obviously assumes that you’re starting out with an RDBMS, which is classically the database system that solves pretty much any problem decently enough not to be replaced easily. What does this mean? Simply put:

  • RDBMS have been around forever, so they have a huge advantage compared to “newcomers” in the market, who don’t have all the excellent tooling, community, support, maturity yet
  • E.F. Codd’s work may have been the single biggest influence on our whole industry. There has hardly been anything as revolutionary as the relational model ever since. It’s hard for an alternative database to be equally universal, i.e. they’re mostly solving niche problems

Having said so, sometimes you do have a niche problem. For instance a graph database problem. In fact, a graph is nothing fundamentally different from what you can represent in the relational model. It is easy to model a graph with a many-to-many relationship table.

If you want a checklist, here’s how I would approach this question (ceteris paribus and limiting myself to about 100 words):

  1. Are you dealing with streaming millions of rows per second, or streaming from tens of thousands of endpoints concurrently?  Kafka and the Hadoop streaming stack.
  2. Is your problem something that you’ve already solved with a relational database, and your solution works well enough?  Relational database.
  3. Are there multiple “paths” to get to interesting data?  Relational database.
  4. Shopping carts (write-heavy, focused on availability over consistency)?  Riak/Cassandra/Dynamo at large scale, else relational database.
  5. Type duplication?  Relational database.
  6. Petabytes of data being analyzed asynchronously?  Hadoop.
  7. Other data platforms if they fit specific niche requirements around data structure.

There’s a lot more to this discussion than a simple numbered list, but I think it’s reasonable to start with relational databases and move away if and only if there’s a compelling reason.

Comments closed

Docker On Windows Server

Elton Stoneman walks us through how to run Docker on Windows Server 2016:

There are two Windows Base images on the Docker Hub – microsoft/nanoserver andmicrosoft/windowsservercore. We’ll be using an IIS image shortly, but you should start with Nano Server just to make sure all is well – it’s a 250MB download, compared to 4GB for Server Core.

docker pull microsoft/nanoserver  

Check the output and if all is well, you can run an interactive container, firing up PowerShell in a Nano Server container:

Docker will also run on Windows 10 Pro, Enterprise, or Education editions.  That’s sad news for people who upgraded for free to Home Edition.

Comments closed

Creating Partitioned Views

Erik Darling describes partitioned views:

Hooray. Now you have to type less.

Partitioned views don’t need a scheme or a function, there’s no fancy syntax to swap data in or out, and there’s far less complexity in figuring out RIGHT vs LEFT boundaries, and leaving empty partitions, etc. and so forth. You’re welcome.

A lot gets made out of partition level statistics and maintenance being available to table partitioning. That stuff is pretty much automatic for partitioned views, because you have no choice. It’s separate tables all the way down.

Partitioned views, AKA SQL Server 2000 partitioning.  I think my favorite use case for them today is to serve as a combination of hot data in a memory-optimized table and cold data on disk.

Comments closed

Immutable Servers

Diana Tkachenko describes a pattern for reducing “prod doesn’t look like stage” types of errors:

Immutable server pattern makes use of disposable components for everything that makes up an application that is not data. This means that once the application is deployed, nothing changes on the server – no scripts are run on it, no configuration is done on it. The packaged code and any deploy scripts is essentially baked into the server. No outside process is able to modify the contents after the server has been deployed. For example, if you were using Docker containers to deploy your code, everything the application needs would be in the Docker image, which you then use to create and run a container. You cannot modify the image once it’s been created, and if any changes do need to take place, you would create a new image and work with that one.

In our case, we use AWS Amazon Machine Images (AMIs) to accomplish the same thing. We make heavy use of Amazon Linux machines, which are Redhat-based, and thus package the code into RPMs[2]. The RPMs define all the dependencies for running the application, the code itself, and any startup scripts to run on bootup. The RPM is then installed on a clean base image of Amazon Linux, and an image is taken, resulting in an AMI. This AMI is synonymous with “immutable server” – it cannot be changed once it is created. The AMI is then deployed into an Auto Scaling Group(ASG) and attached to the Elastic Load Balancer (ELB). In this post, I’ll guide you through for a closer look at every step of this Immutable Server deploy pipeline. I’ll then go into how and why we embedded planned failures into this system. At the end, I’ll share the insights we’ve gained into the pros and cons of deploying in this way.

This is a very interesting concept.  I’ve heard of no-patch servers (where, instead of patching live servers, you spin up a new VM with the operating system updates and spin down the old one), but this takes the idea one step further.

Comments closed

Instance Segregation

Kennie Pontoppidan discusses why you don’t want your transactional system on the same instance as your warehouse:

Mixing OLTP and data warehouse workload types in the same SQL Server instance is not a good idea. Why? Well, because the workload types don’t mix. As described above, an ideal buffer cache for an OLTP system contains pages used in searches as well as pages for the most used customer/items etc. Running a few data warehouse queries will change the buffer cache to contain (mostly) all dimension data. This will make OLTP operations slow. On the other hand, the ideal buffer cache in a data warehouse will contain all dimension data, which makes aggregations fast because all dimension lookups will be done in memory. OLTP operations might flush out parts of this dimension data from the cache, hurting the performance for data warehouse queries. The two workload types will keep on fighting about the content of the buffer cache, making both systems suffer.

Except in very small systems (where it doesn’t much matter), this is sound advice.

Comments closed

Cache Eviction Policies

Dan Luu has a great article from a couple years ago on when a random cache eviction policy might be preferable to Least Recently Used:

Once upon a time, my computer architecture professor mentioned that using a random eviction policy for caches really isn’t so bad. That random eviction isn’t bad can be surprising – if your cache fills up and you have to get rid of something, choosing the least recently used (LRU) is an obvious choice, since you’re more likely to use something if you’ve used it recently. If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it’s going to cause a miss every time if the loop doesn’t fit. A random eviction policy degrades gracefully as the loop gets too big.

In practice, on real workloads, random tends to do worse than other algorithms. But what if we take two random choices and just use LRU between those two choices?

Here are the relative miss rates we get for SPEC CPU1 with a Sandy Bridge-like cache (8-way associative, 64k, 256k, and 2MB L1, L2, and L3 caches, respectively). These are ratios (algorithm miss rate : random miss rate); lower is better. Each cache uses the same policy at all levels of the cache.

Dan writes at a depth I appreciate and on topics I often don’t understand (particularly when he gets into CPU engineering details).

Comments closed

Over-Engineering

Dave Copeland discusses over-engineering problems:

The main problem with an over-engineered solution is that it takes longer to ship than is necessary. By definition, we are doing more than is necessary, and that will take longer to ship. There’s almost never a reason to prefer longer ship-times over shorter ones, all things being equal.

The more serious problem with over-engineering is the carry cost.

A carrying cost is a cost the team bears for having to maintain software and infrastructure. Each feature requires tests, monitoring, and maintenance. Each new feature is made in the context of those that came before it. This is why a feature that might’ve taken one week when the project was new requires a month to make in more mature project.

Read the whole thing and simplify your solutions.

Comments closed

Locality

Kyle Kingsbury explains that sequential, serializable, and strictly serializable consistency models cannot provide locality:

We often speak of locality as a property of subhistories for a particular object x: “H|x is strictly serializable, but H is not”. This is a strange thing to say indeed, because the transactions in H may not meaningfully exist in H|x. What does it mean to run [(A enq y 1) (A enq x 1)] on x alone? If we restrict ourselves to those transactions that doapply to a single object, we find that those transactions still serialize in the full history.

So in a sense, locality is about the scope of legal operations. If we take single enqueue and dequeue operations over two queues x and y, the space of operations on the composite system of x and y is just the union of operations on x and those on y. Linearizability can also encompass transactions, so long as they are restricted to a single system at a time. Our single-key, multi-operation transactions still satisfied strict serializability even in the composite system. However, the space of transactions on a composite system is more than the union of transactions on each system independently. It’s their product.

Here’s the part where I pretend that of course I understand what Kyle wrote…  Seriously, though, this is a very interesting read.

Comments closed

Technical Debt

Daniel Hutmacher takes on the idea of technical debt:

When you think of technical debt, you may think only of classic shortcuts like making assumptions about the data, not using a TRY-CATCH block or perhaps hard-coding a manual correction into a stored procedure or view.

But I would argue that not paying attention to performance is just as much a technical debt. And rather than just crashing with an error message, performance issues are not always easy to just fix in production when your business users are working late to meet their deadlines, or when your web request are timing out. Start thinking of performance as an important part of your development process – half the job is getting the right data in the right place, the other half is making sure that your solution will handle double or triple the workload, preferably under memory pressure conditions with other workloads running at the same time.

Read the whole thing.

Comments closed

Lambda Architecture Primer

James Serra explains the Lambda architecture:

A brief explanation of each layer:

Data Consumption: This is where you will import the data from all the various source systems, some of which may be streaming the data.  Others may only provide data once a day.

Stream Layer: It provides for incremental updating, making it the more complex layer.  It trades accuracy for low latency, looking at only recent data.  Data in here may be only seconds behind, but the trade-off is the data may not be clean.

Batch Layer: It looks at all the data at once and eventually corrects the data in the stream layer.  It is the single version of the truth, the trusted layer, where there is usually lots of ETL and a traditional data warehouse.  This layer is built using a predefined schedule, usually once or twice a day, including importing the data currently stored in the stream layer.

Presentation Layer: Think of it as the mediator, as it accepts queries and decides when to use the batch layer and when to use the speed layer.  Its preference would be the batch layer as that has the trusted data, but if you ask it for up-to-the-second data, it will pull from the stream layer.  So it’s a balance of retrieving what we trust versus what we want right now.

I hate the fact that this is named “lambda.”  That’s a term which is way too overloaded in computer science.  You have the architecture, lambda functions, and AWS lambda, all of which are utterly different and yet end up in the same conversation.  This ends up confusing people unless you very specifically say things like “We’re going to use the AWS lambda service to create lambda functions to feed data from sensors into our lambda architecture.”  And even then people still get confused.

Comments closed