For an early- and mid-stage startup, a monolithic database is absolutely the appropriate architecture choice. With a small team and a small company, a single shared database made it simple to get started. Moving fast meant being able to make rapid changes across the entire system. A shared database made it very easy to join data between different tables, and it made transactions across multiple tables possible. These are pretty convenient.
As we have gotten larger, those benefits have become liabilities. It has become a single point of failure, where issues with the shared database can bring down nearly all of our applications. It has become a performance bottleneck, where long-running operations from one application can slow down others. Finally, and most importantly, the shared database has become a coupling point between teams, slowing down our ability to make changes.
I have my misgivings (as you’d expect from a database snob), particularly because I value highly the benefits of normalization and see sharded systems as a step backwards in that regard. But even with that said, there are absolutely benefits to slicing out orthogonal sections of data; the point of disagreement is in those places in which two teams’ entities and attributes overlap.
An evolution of the three previous scenarios that provides multiple options for the various technologies. Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control. ELT is usually used instead of ETL (see Difference between ETL and ELT). The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.
Hub-and-spoke should be your ultimate goal. See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
Check it out for a high-level architectural view of contemporary warehousing choices. I prefer having both systems in play: the EDW answers known business questions and gives you back report information relatively quickly; whereas the Hadoop cluster allows you to do spelunking, data cleansing, and answer unanticipated business questions.
We are in process of doing a migration from an ancient creaky server to a shiny new VM. Rather than just rebuild it and restore everything, we’re taking the (painful) opportunity to clean things up and improve several systems.
As part of this, we’re replicating data from the old server to the new server, so that we can migrate processes piecemeal, so that rollback is not “OH CRAP TURN IT OFF TURN IT OFF ROLL BACK TO THE OLD SERVER”.
But we ran into a weird problem. On the target server, we had a many-to-many table that sits between, let’s say, stores and orders. We have a stores table, we have an orders table, and this one (call it STORE_ORDERS for simplicity) is just a linking table between the two. ID, stores_id, orders_id. Everything scripted identically between the two databases (aside from the NOT FOR REPLICATION flag).
This is a case where action A works fine and action B works fine, but the combination of actions A and B leads to sadness.
The organization will need to take a step back to understand better their existing status. Are they just starting out? Are other departments which are doing the same thing, perhaps in the local organization or somewhere else in the world? Once the organization understands their state better, they can start to broadly work out the strategy that the Data Lake is intended to provide.
As part of this understanding, the objective of the Data Lake will need to be identified. Is it for data science? Or, for example, is the Data Lake simply to store data in a holding pattern for data discovery? Identifying the objective will help align the vision and the goals, and set the scene for communication to move forward.
I would like to popularize the term Data Swamp for “that place you store a whole bunch of data of dubious origin and value.” It’s the place that you promise management of course you can get the data back…as long as they never actually ask for it or are okay with reading terabytes of flat files from backup tapes. The Data Swamp is the Aristotelian counterpart to the Data Lake, Goofus to its Gallant. It will also, to my estimate, be the more common version.
Today, it is time to consider the astonishing next step, that a single socket system is the best choice for a transaction processing systems. First, with proper database architecture and tuning, 12 or so physical cores should be more than sufficient for a very large majority of requirements.
We should also factor in that the second generation hyper-threading (two logical processors per physical core) from Nehalem on has almost linear scaling in transactional workloads (heavy on index seeks involving few rows). This is very different from the first generation HT in Willamette and Northwood which was problematic, and the improved first generation in Prescott which was somewhat better in the positive aspects, and had fewer negatives.
Joe knows a lot more about this than I do, but I’m very hesitant about this for two reasons. First, scale. When we start looking at hundreds of concurrent requests, would a single-socket machine really work? I don’t know to answer to that, but in my simplistic “more is better than fewer” rule of thumb, I’d err on the side of caution, especially if it isn’t my money paying for this.
Second, there are batch processes and large background activities which occur even on extremely transactional OLTP systems. Think about running CHECKDB or ETL processing or troubleshooting/monitoring processes. These are going to be processes which benefit from parallelism, and if you’re seriously limiting core counts (which a single socket would necessarily do), you might end up in a bad way when they run even if your “normal” workload performs a little better.
I went and looked on the server, and sure enough in Management Studio I saw one database named “FinanceDB” and a database named “[FinanceDB]”.
This was on a SQL 2008R2 instance, but as a test I created a database named [test] on my local SQL 2014 instance and sure enough it worked!
The source of the problem at the client was the LiteSpeed maintenance plan. Even though the backup task was set to backup all user databases, it wasn’t picking up the square-bracketed database.
I’d go a bit further and say that you should avoid everything but alpha-numeric characters and maybe underscore for databases, tables, views, and all other database objects.
Before I jump into the technical details of the Service Broker architecture, I think it helps to have a real-world analogy of what Service Broker is and does. In the last installment, I used the example of ordering something from Amazon.com. This time, I’d like to use an analogy that’s somewhat timely: taxes.
Each year, we fill out that 1040 or 1040EZ form and we send it to the Internal Revenue Service. Maybe we eFile, maybe we mail it in, it doesn’t matter. That form is received by the IRS and goes into a queue, awaiting review. At some point, days, maybe weeks later, our tax return is processed. If all goes well, our return is approved and the IRS cuts us a check. That is a Service Broker application.
When I first started learning Service Broker, it seemed like there were a lot of abstract notions (mostly because I didn’t know anything about message queues). The pieces all start to come together once you get into an application.
For this example, let’s assume that we have a physical server with 512 Gigs of RAM and two physical NUMA nodes (and two CPU sockets). We have a VM running in that machine which has a low CPU requirement, but a large working set. Because of this we have 4 cores and 360 Gigs of RAM presented to the VM.
The answer is not trivial, making this an interesting question.
Looking into what was needed, I quickly realized there was a LOT of data, guess 50+ years of news broadcasts will do this. Consider this was in the early 2000s, some innovation was needed from anything I had coded before. Obviously LIKE searches was out of the question, full text search was not available. So what to do?
Basically I decided to break down each broadcast to words into a separate table, the entire application fit in 2 tables: Story and Words.
This is a case in which thinking about the grain of data helps solve an otherwise-intractable problem.
You can’t, and shouldn’t log everything, because logging events can slow you down. And you shouldn’t always log to a database, either– you can keep logs in the application tier as well, no argument here.
But most applications periodically do ‘heavy’ or batch database work. And when those things happen, it can make a lot of sense to log to the database. That’s where this logging comes in.
Bonus points if you feed this kind of logging into Splunk (or your logging analysis tool of choice) and integrate it with application-level logging.