Dan Luu has a great article from a couple years ago on when a random cache eviction policy might be preferable to Least Recently Used:
Once upon a time, my computer architecture professor mentioned that using a random eviction policy for caches really isn’t so bad. That random eviction isn’t bad can be surprising – if your cache fills up and you have to get rid of something, choosing the least recently used (LRU) is an obvious choice, since you’re more likely to use something if you’ve used it recently. If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it’s going to cause a miss every time if the loop doesn’t fit. A random eviction policy degrades gracefully as the loop gets too big.
In practice, on real workloads, random tends to do worse than other algorithms. But what if we take two random choices and just use LRU between those two choices?
Here are the relative miss rates we get for SPEC CPU1 with a Sandy Bridge-like cache (8-way associative, 64k, 256k, and 2MB L1, L2, and L3 caches, respectively). These are ratios (algorithm miss rate : random miss rate); lower is better. Each cache uses the same policy at all levels of the cache.
Dan writes at a depth I appreciate and on topics I often don’t understand (particularly when he gets into CPU engineering details).
The main problem with an over-engineered solution is that it takes longer to ship than is necessary. By definition, we are doing more than is necessary, and that will take longer to ship. There’s almost never a reason to prefer longer ship-times over shorter ones, all things being equal.
The more serious problem with over-engineering is the carry cost.
A carrying cost is a cost the team bears for having to maintain software and infrastructure. Each feature requires tests, monitoring, and maintenance. Each new feature is made in the context of those that came before it. This is why a feature that might’ve taken one week when the project was new requires a month to make in more mature project.
Read the whole thing and simplify your solutions.
We often speak of locality as a property of subhistories for a particular object x: “H|x is strictly serializable, but H is not”. This is a strange thing to say indeed, because the transactions in H may not meaningfully exist in H|x. What does it mean to run
[(A enq y 1) (A enq x 1)]on x alone? If we restrict ourselves to those transactions that doapply to a single object, we find that those transactions still serialize in the full history.
So in a sense, locality is about the scope of legal operations. If we take single enqueue and dequeue operations over two queues x and y, the space of operations on the composite system of x and y is just the union of operations on x and those on y. Linearizability can also encompass transactions, so long as they are restricted to a single system at a time. Our single-key, multi-operation transactions still satisfied strict serializability even in the composite system. However, the space of transactions on a composite system is more than the union of transactions on each system independently. It’s their product.
Here’s the part where I pretend that of course I understand what Kyle wrote… Seriously, though, this is a very interesting read.
When you think of technical debt, you may think only of classic shortcuts like making assumptions about the data, not using a TRY-CATCH block or perhaps hard-coding a manual correction into a stored procedure or view.
But I would argue that not paying attention to performance is just as much a technical debt. And rather than just crashing with an error message, performance issues are not always easy to just fix in production when your business users are working late to meet their deadlines, or when your web request are timing out. Start thinking of performance as an important part of your development process – half the job is getting the right data in the right place, the other half is making sure that your solution will handle double or triple the workload, preferably under memory pressure conditions with other workloads running at the same time.
Read the whole thing.
A brief explanation of each layer:
Data Consumption: This is where you will import the data from all the various source systems, some of which may be streaming the data. Others may only provide data once a day.
Stream Layer: It provides for incremental updating, making it the more complex layer. It trades accuracy for low latency, looking at only recent data. Data in here may be only seconds behind, but the trade-off is the data may not be clean.
Batch Layer: It looks at all the data at once and eventually corrects the data in the stream layer. It is the single version of the truth, the trusted layer, where there is usually lots of ETL and a traditional data warehouse. This layer is built using a predefined schedule, usually once or twice a day, including importing the data currently stored in the stream layer.
Presentation Layer: Think of it as the mediator, as it accepts queries and decides when to use the batch layer and when to use the speed layer. Its preference would be the batch layer as that has the trusted data, but if you ask it for up-to-the-second data, it will pull from the stream layer. So it’s a balance of retrieving what we trust versus what we want right now.
I hate the fact that this is named “lambda.” That’s a term which is way too overloaded in computer science. You have the architecture, lambda functions, and AWS lambda, all of which are utterly different and yet end up in the same conversation. This ends up confusing people unless you very specifically say things like “We’re going to use the AWS lambda service to create lambda functions to feed data from sensors into our lambda architecture.” And even then people still get confused.
AWS RDS MySQL: Ideally we would have gone with MySQL RDS as our backend, considering Amazon does a great job in managing and upgrading relational database as a service, providing multi-AZ support for high availability. However, the main drawback to RDS was the storage limit of 6TB. Our requirement at the time, was closer to 10TB.
AWS Aurora: AWS Aurora would have met the storage needs, but it was in beta at that time.PostgreSQL: PostgreSQL is a powerful open source, object-relational database system, but we did not have much in house expertise using PostgreSQL. In the DC, our primary backend databases were Oracle and MySQL. Moreover, choosing PostgreSQL would have eliminated the option of a seamless migration to Aurora in future, as Aurora is based on the MySQL engine.
From there, they also explain some technical issues they found in migrating data. Read the whole thing. If you’re coming into this series blind, they also have part 1 and part 2 of the series, giving more of an architectural overview of their billing system.
Now the only (and most important) thing remaining in the Data Center was the Oracle database. The dataset that remained in Oracle was highly relational and we did not feel it to be a good idea to model it to a NoSQL-esque paradigm. It was not possible to structure this data as a single column family as we had done with the customer-facing subscription data. So we evaluated Oracle and Aurora RDS as possible options. Licensing costs for Oracle as a Cloud database and Aurora still being in Beta didn’t help make the case for either of them.While the Billing team was busy in the first two acts, our Cloud Database Engineering team was working on creating the infrastructure to migrate billing data to MySQL instances on EC2. By the time we started Act III, the database infrastructure pieces were ready, thanks to their help. We had to convert our batch application code base to be MySQL-compliant since some of the applications used plain jdbc without any ORM. We also got rid of a lot of the legacy pl-sql code and rewrote that logic in the application, stripping off dead code when possible.Our database architecture now consists of a MySQL master database deployed on EC2 instances in one of the AWS regions. We have a Disaster Recovery DB that gets replicated from the master and will be promoted to master if the master goes down. And we have slaves in the other AWS regions for read only access to applications.
Read the whole thing. Their architectural requirements probably won’t be yours (unless you’re working at a company at the scale of Netflix), but it’s quite interesting seeing how they solve their problems.
You create VMs for each tenant, essentially doing a “lift and shift” of the current on-premise solution. This provides the best isolation possible and it’s regularly done on-premises, but it’s also the one that doesn’t enable cutting costs, since each tenant has it’s own server, sql, license and so on. Sometimes this is the only allowable option if you have in your client contract that their data will be hardware-isolated from other clients. Some cons: table updates must be replicated across all the servers (i.e. updating reference tables), there is no resource sharing, and you need multiple backup strategies across all the servers.
Read on for a few other strategies. There aren’t any cloud-only details here; you could implement the same strategies on-premises.
Unlike most SQL databases, which default to weaker isolation levels for performance reasons, VoltDB chooses to provide strong serializable isolation by default: the combination of serializability’s multi-object atomicity, and linearizability’s real-time constraints.
Serializability is the strongest of the four ANSI SQL isolation levels: transactions must appear to execute in some order, one at a time. It prohibits a number of consistency anomalies, including lost updates, dirty reads, fuzzy reads, and phantoms.
If you use VoltDB, it sounds like upgrading to 6.4 is a good idea.
During a data center failover like the exampleabove, we could have a “late arrival,” i.e. the stream processor might see the AdClickEvent possibly a few minutes after the AdViewEvent. A poorly written stream processor might deduce that the ad was a low-quality ad when instead the ad might have actually been good. Another anomaly is that the stream processor might see the AdClickEvent before it sees the corresponding AdViewEvent. To ensure that the output of the stream processor is correct there has to be logic to handle this “out of order message arrival.”
In the example above, the geo-distributed nature of the data centers makes it easy to explain the delays. However delays can exist even within the same data center due to GC issues, Kafka cluster upgrades, partition rebalances, and other naturally occurring distributed system phenomena.
This is a pretty long article and absolutely worth a read if you are looking at streaming data.