For an early- and mid-stage startup, a monolithic database is absolutely the appropriate architecture choice. With a small team and a small company, a single shared database made it simple to get started. Moving fast meant being able to make rapid changes across the entire system. A shared database made it very easy to join data between different tables, and it made transactions across multiple tables possible. These are pretty convenient.
As we have gotten larger, those benefits have become liabilities. It has become a single point of failure, where issues with the shared database can bring down nearly all of our applications. It has become a performance bottleneck, where long-running operations from one application can slow down others. Finally, and most importantly, the shared database has become a coupling point between teams, slowing down our ability to make changes.
I have my misgivings (as you’d expect from a database snob), particularly because I value highly the benefits of normalization and see sharded systems as a step backwards in that regard. But even with that said, there are absolutely benefits to slicing out orthogonal sections of data; the point of disagreement is in those places in which two teams’ entities and attributes overlap.
I may blog about that solution in the future, but with the Future of SharePoint event rapidly coming up, my BI Focal fellow collaborator, Jason Himmelstein convinced me that there was something more interesting that we could do with this. How about near real time monitoring of Twitter conversations for the event? All of the pieces were in place.
We rolled up our sleeves, and in relatively short order, had a solution. Jason has written about the experience on his SharePoint Longhorn blog, and he has included the videos that we put together, so I can be a little less detailed in this post.
The question is what is the right time period to use? The answer is it depends on the size of your partitions. Generally, for managed tables in U-SQL, you want to target about 1 GB per partition. So, if you are bringing in say 800 mb per day then daily partitions are about right. If instead you are bringing in 20 GB per day, you should look at hourly partitions of the data.
In this post, I’d like to take a look at two common scenarios that people run into. The first is full re-compute of partitions data and the second is a partial re-compute of a partition. The examples I will be using are based off of the U-SQL Ambulance Demo’s on Github and will be added to the solution for ease of your consumption.
The ability to reprocess data is vital in any ETL or ELT process.
When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same
build.sbt, the same imports, and the skeleton application looks the same. All that really changes is the main entry point, that is the fully qualified class. Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.
Check these out if you’re interested in Spark.
Good-bye, Business Intelligence Edition
The biggest surprise to me was the removal of the Business Intelligence edition that was initially introduced in SQL Server 2012. Truthfully, it never seemed to fit in the environments where I worked, so I guess it makes sense. Hopefully, fewer licensing options will make it easier for people to understand their licensing and pick the edition that works best for them.
2016 looks to be a great version for BI.
Learn why SQL Server’s table partitioning feature doesn’t make your queries faster– and may even make them slower.
In this 20 minute video, I’ll show you my favorite articles, bugs, and whitepapers online to explain where table partitioning shines and why you might want to implement it, even though it won’t solve your query performance problems.
Check out the video.
I am not taking into account mirroring or AGs. I honestly am not sure how that would affect the process.
Like any time you run DBCC SHRINKFILE this is going to shred your indexes. Take that into account and re-index as needed.
Kenneth shows screen shots, has a step-by-step checklist, and includes common errors. This is a great explanation.
In my circles, there are number of people who are complaining about the lack of features in standard edition. While I do agree that Always Encrypted should be in every version, as lack of strong data encryption is a problem that continues to confound IT. Putting Always Encrypted in all editions would be a good start to having wide ISV adoption of the Always Encrypted feature.
However, even without Always Encrypted, Microsoft added a LOT of new features to Standard Edition. Let’s list them (no specific order here):
There’s a pretty good amount of value in upgrading, even if you’re living on Standard Edition.