We are in process of doing a migration from an ancient creaky server to a shiny new VM. Rather than just rebuild it and restore everything, we’re taking the (painful) opportunity to clean things up and improve several systems.
As part of this, we’re replicating data from the old server to the new server, so that we can migrate processes piecemeal, so that rollback is not “OH CRAP TURN IT OFF TURN IT OFF ROLL BACK TO THE OLD SERVER”.
But we ran into a weird problem. On the target server, we had a many-to-many table that sits between, let’s say, stores and orders. We have a stores table, we have an orders table, and this one (call it STORE_ORDERS for simplicity) is just a linking table between the two. ID, stores_id, orders_id. Everything scripted identically between the two databases (aside from the NOT FOR REPLICATION flag).
This is a case where action A works fine and action B works fine, but the combination of actions A and B leads to sadness.
The organization will need to take a step back to understand better their existing status. Are they just starting out? Are other departments which are doing the same thing, perhaps in the local organization or somewhere else in the world? Once the organization understands their state better, they can start to broadly work out the strategy that the Data Lake is intended to provide.
As part of this understanding, the objective of the Data Lake will need to be identified. Is it for data science? Or, for example, is the Data Lake simply to store data in a holding pattern for data discovery? Identifying the objective will help align the vision and the goals, and set the scene for communication to move forward.
I would like to popularize the term Data Swamp for “that place you store a whole bunch of data of dubious origin and value.” It’s the place that you promise management of course you can get the data back…as long as they never actually ask for it or are okay with reading terabytes of flat files from backup tapes. The Data Swamp is the Aristotelian counterpart to the Data Lake, Goofus to its Gallant. It will also, to my estimate, be the more common version.
Today, it is time to consider the astonishing next step, that a single socket system is the best choice for a transaction processing systems. First, with proper database architecture and tuning, 12 or so physical cores should be more than sufficient for a very large majority of requirements.
We should also factor in that the second generation hyper-threading (two logical processors per physical core) from Nehalem on has almost linear scaling in transactional workloads (heavy on index seeks involving few rows). This is very different from the first generation HT in Willamette and Northwood which was problematic, and the improved first generation in Prescott which was somewhat better in the positive aspects, and had fewer negatives.
Joe knows a lot more about this than I do, but I’m very hesitant about this for two reasons. First, scale. When we start looking at hundreds of concurrent requests, would a single-socket machine really work? I don’t know to answer to that, but in my simplistic “more is better than fewer” rule of thumb, I’d err on the side of caution, especially if it isn’t my money paying for this.
Second, there are batch processes and large background activities which occur even on extremely transactional OLTP systems. Think about running CHECKDB or ETL processing or troubleshooting/monitoring processes. These are going to be processes which benefit from parallelism, and if you’re seriously limiting core counts (which a single socket would necessarily do), you might end up in a bad way when they run even if your “normal” workload performs a little better.
I went and looked on the server, and sure enough in Management Studio I saw one database named “FinanceDB” and a database named “[FinanceDB]”.
This was on a SQL 2008R2 instance, but as a test I created a database named [test] on my local SQL 2014 instance and sure enough it worked!
The source of the problem at the client was the LiteSpeed maintenance plan. Even though the backup task was set to backup all user databases, it wasn’t picking up the square-bracketed database.
I’d go a bit further and say that you should avoid everything but alpha-numeric characters and maybe underscore for databases, tables, views, and all other database objects.
Before I jump into the technical details of the Service Broker architecture, I think it helps to have a real-world analogy of what Service Broker is and does. In the last installment, I used the example of ordering something from Amazon.com. This time, I’d like to use an analogy that’s somewhat timely: taxes.
Each year, we fill out that 1040 or 1040EZ form and we send it to the Internal Revenue Service. Maybe we eFile, maybe we mail it in, it doesn’t matter. That form is received by the IRS and goes into a queue, awaiting review. At some point, days, maybe weeks later, our tax return is processed. If all goes well, our return is approved and the IRS cuts us a check. That is a Service Broker application.
When I first started learning Service Broker, it seemed like there were a lot of abstract notions (mostly because I didn’t know anything about message queues). The pieces all start to come together once you get into an application.
For this example, let’s assume that we have a physical server with 512 Gigs of RAM and two physical NUMA nodes (and two CPU sockets). We have a VM running in that machine which has a low CPU requirement, but a large working set. Because of this we have 4 cores and 360 Gigs of RAM presented to the VM.
The answer is not trivial, making this an interesting question.
Looking into what was needed, I quickly realized there was a LOT of data, guess 50+ years of news broadcasts will do this. Consider this was in the early 2000s, some innovation was needed from anything I had coded before. Obviously LIKE searches was out of the question, full text search was not available. So what to do?
Basically I decided to break down each broadcast to words into a separate table, the entire application fit in 2 tables: Story and Words.
This is a case in which thinking about the grain of data helps solve an otherwise-intractable problem.
You can’t, and shouldn’t log everything, because logging events can slow you down. And you shouldn’t always log to a database, either– you can keep logs in the application tier as well, no argument here.
But most applications periodically do ‘heavy’ or batch database work. And when those things happen, it can make a lot of sense to log to the database. That’s where this logging comes in.
Bonus points if you feed this kind of logging into Splunk (or your logging analysis tool of choice) and integrate it with application-level logging.
I think one of the single biggest challenges I face as I attempt to warehouse data that originates as a SharePoint list is the handful of miscellaneous descriptive fields, such as approval status, request status, or something similar. Typically, this fields are setup as Choice fields in the SharePoint list so they have a known range of values, but its still a pain to have to build a dimension for each one.
Enter the junk dimension. Ever since I learned about this concept it has made my life so much easier. What the junk dimension does is perform a cross join against the different fields and creates a row for every possible combination of fields.
Junk dimensions are nice for those low-cardinality attributes which are important but don’t really fit anywhere else. The important thing to remember about a junk dimension is that you don’t want it to be too large: if you have 5 attributes, each of which has 8 possible values, you have 8^5 (32,768) rows. That’s not so bad, but make it 10 attributes and now your table has 1,073,741,824 rows, and that’s a lot of rows for a single dimension. If you find yourself in that scenario, you might want to create two junk dimensions (bringing you back to 2 dimensions with 32K rows), review your design to see if all those attributes are necessary, or review your design to see if your “junk” dimension is hiding a real dimension.
In a recent blog post entitled Is Logical Data Modeling Dead?, Karen Lopez (b | t) comments on the trends in the data modeling discipline and shares her own processes and preferences for logical data modeling (LDM). Her key point is that LDMs are on the decline primarily because they (and their creators) have failed to adapt to changing development processes and trends.
I love all things data modeling. I found data models to be a soothing and reassuring roadmap that underpinned the requirements analysis and spec writing of the Dev team, as well as a supremely informative artifact of the Dev process which I would constantly refer to when writing new T-SQL code and performing maintenance. However, as time has passed, I have been surprised by how far it has fallen out of favor.
This is an interesting discussion. I’m not sure I’ve ever created a true logical data model. I’ve worked with systems which could potentially take advantage of them, but they never hit the top of the priority list.