Looking into what was needed, I quickly realized there was a LOT of data, guess 50+ years of news broadcasts will do this. Consider this was in the early 2000s, some innovation was needed from anything I had coded before. Obviously LIKE searches was out of the question, full text search was not available. So what to do?
Basically I decided to break down each broadcast to words into a separate table, the entire application fit in 2 tables: Story and Words.
This is a case in which thinking about the grain of data helps solve an otherwise-intractable problem.
You can’t, and shouldn’t log everything, because logging events can slow you down. And you shouldn’t always log to a database, either– you can keep logs in the application tier as well, no argument here.
But most applications periodically do ‘heavy’ or batch database work. And when those things happen, it can make a lot of sense to log to the database. That’s where this logging comes in.
Bonus points if you feed this kind of logging into Splunk (or your logging analysis tool of choice) and integrate it with application-level logging.
I think one of the single biggest challenges I face as I attempt to warehouse data that originates as a SharePoint list is the handful of miscellaneous descriptive fields, such as approval status, request status, or something similar. Typically, this fields are setup as Choice fields in the SharePoint list so they have a known range of values, but its still a pain to have to build a dimension for each one.
Enter the junk dimension. Ever since I learned about this concept it has made my life so much easier. What the junk dimension does is perform a cross join against the different fields and creates a row for every possible combination of fields.
Junk dimensions are nice for those low-cardinality attributes which are important but don’t really fit anywhere else. The important thing to remember about a junk dimension is that you don’t want it to be too large: if you have 5 attributes, each of which has 8 possible values, you have 8^5 (32,768) rows. That’s not so bad, but make it 10 attributes and now your table has 1,073,741,824 rows, and that’s a lot of rows for a single dimension. If you find yourself in that scenario, you might want to create two junk dimensions (bringing you back to 2 dimensions with 32K rows), review your design to see if all those attributes are necessary, or review your design to see if your “junk” dimension is hiding a real dimension.
In a recent blog post entitled Is Logical Data Modeling Dead?, Karen Lopez (b | t) comments on the trends in the data modeling discipline and shares her own processes and preferences for logical data modeling (LDM). Her key point is that LDMs are on the decline primarily because they (and their creators) have failed to adapt to changing development processes and trends.
I love all things data modeling. I found data models to be a soothing and reassuring roadmap that underpinned the requirements analysis and spec writing of the Dev team, as well as a supremely informative artifact of the Dev process which I would constantly refer to when writing new T-SQL code and performing maintenance. However, as time has passed, I have been surprised by how far it has fallen out of favor.
This is an interesting discussion. I’m not sure I’ve ever created a true logical data model. I’ve worked with systems which could potentially take advantage of them, but they never hit the top of the priority list.
SQL Server offers two flavors of optimistic locking for traditional disk-based tables: Read Committed Snapshot Isolation (RCSI), and Snapshot Isolation. They are each great tools to reduce blocking and make applications faster, particularly in databases that have lots of tiny reads and writes that need to be quick, or mixed-use patterns (lots of little reads and writes + larger reporting queries).
Both isolation levels are available in SQL Server Standard Edition and Web Edition as well as Enterprise. That’s right, they’re cheap and easy. Each of them are controlled by database level settings, and default to disabled when you install SQL Server.
The moral of the story: both of these are awesome, both have potential drawbacks, and both need testing. I’ve had good experiences with RCSI, but even then, maybe about 1% of procedures need specific locking hints (either NOLOCK or an explicit lock) to maintain previous application behavior and to deal with the problem Kendra brought up. Moral of the story: test, test, test.
This architecture can be used when your organization does not value their secondary data center the same as the primary. It is a best practice to have matched or similar hardware between your primary and disaster recovery sites but that is not always possible. When costs need to be reduced it is better to have one failover server that you know can handle the work load rather than two servers which are under powered. Under powered hardware can easily become an effective outage if they cause timeouts as soon as a production work load is placed on them.
For many small to medium-sized enterprises, this might be the easiest route to sell to management—it’s hard to get management to have “redundant” servers which normally don’t get used.
Here I look at five tools that provide online modeling services, ranging from commercial products to free, open-source solutions. They include Vertabelo, GenMyModel, dbDiffo, WWW SQL Designer, and DbDesigner.net. Each one takes a different approach to delivering its services and completing basic tasks, such as adding tables, columns, or relationships. You’ll find that some tools are more feature-rich and user-friendly than others, but each one has its own advantages and charms.
That said, none of the tools provide the level of functionality you get with an advanced on-premises solution such as PowerDesigner or ER/Studio, but not everyone needs such an extensive set of features—or the price tag that goes with them. In fact, all five online tools come either completely free or have a free version available, making it possible to try all of them without having to commit one way or the other. Not surprisingly, the free versions associated with the commercial products have limitations on their use.
The hat I’d throw into the ring is draw.io. It’s a Visio look-alike, so it’s more useful for high-level strategic diagrams than a true model.
The Availability Group will handle the data synchronization between the data centers. Asynchronous mode is recommend due to potential network latency hindering primary site performance. With the combination of these two features, you meet HA with the FCI’s automatic failover locally and DR with manual failover of the Availability Group between sites.
Derik’s conclusion is important: this isn’t the type of thing a brand new accidental DBA should try to build; there are a lot of moving parts here. But if you want a robust solution and have the support people available to handle the relative complexity, this is a good option.
In conclusion, we see a significant reduction in disk space and IO when sparse columns are used, and they perform slightly better than non-sparse columns in our simple data modification tests (note that retrieval performance should also be considered; perhaps the subject of another post).
Sparse columns have a lot of potential value, but in my experience, they fall short in one huge way: you cannot compress tables with sparse columns. Given that both sparse columns and data compression are things which benefit from scale, it’s important to make the right choice upfront.
With that said, when I setup a Scrub server, it is in a very secure area where the data access is very, very limited. For example, in my current company, the server is in a separate domain from Production and QA/Dev. Only DBAs are allowed to access this server. If you have multiple DBAs at your location, you may want to even limit which DBAs have access to this server. Our goal is to automate the entire scrubbing process so no one has to access the data including copying backup files from Production and to a shared scrub location for QA/Dev to retrieve.
Scrub servers are a way of stripping personally identifiable or sensitive information from production data so developers can safely use the data in lower, less secure environments.