The Lambda Architecture is pretty well documented – online1 as well as in the book I just mentioned2. For a quick overview, Lambda Architecture is basically a system where the raw data is always stored, and never thrown away. All information that’s derived from this raw data is always recomputed – often stated as query = function(all data). This provides for a fool-proof architecture that’s rigorously simple (compared to classic RDBMS solutions), made up of three layers:
Admittedly, about half of this went over my head, but there are some good book and webpage recommendations to learn more about lambda architecture and Data Vault.
Last month’s post “For Technical Interviews, Don’t Ask Questions, Show Screenshots” was a surprise hit, and lots of folks asked for more details about the types of screenshots I’d show. Over the next few weeks, I’ll share a few more.
Normally I’d show this query as a screenshot, but for easier copy/pasting into comments, I’m showing it as code here.
I’d say to the job candidate, “You’ve been asked to take a quick look at this code as part of a deployment. Explain what the business purpose of the code is, and tell me if there’s anything that concerns you.”
I like where Brent is going with this series and plan to incorporate some of these into my in-person interviews.
SQL Server 2016 will be generally available on June 1, 2016. This will allow you to build mission-critical, and business critical intelligent applications with the most secure database1, the highest performance data warehouse2, end-to-end mobile BI on any device, in-database advanced analytics, in-memory capabilities optimized for all workloads, and a consistent experience from on-premises to cloud. These capabilities are built-in to SQL Server for industry-leading low cost of ownership.
Interesting Standard versus Enterprise bits for me:
An evolution of the three previous scenarios that provides multiple options for the various technologies. Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control. ELT is usually used instead of ETL (see Difference between ETL and ELT). The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.
Hub-and-spoke should be your ultimate goal. See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
Check it out for a high-level architectural view of contemporary warehousing choices. I prefer having both systems in play: the EDW answers known business questions and gives you back report information relatively quickly; whereas the Hadoop cluster allows you to do spelunking, data cleansing, and answer unanticipated business questions.
Over the weekend there was a discussion on the MVP distribution list about the sys.dm_tran_database_transactions DMV and how one cannot use it to accurately determine how much log an operation has generated because it doesn’t provide a roll-up of the sub-transaction metrics to the outer transaction. This makes the output somewhat non-intuitive.
The discussion prompted me to write some code I’ve been meaning to do since 2012, when SQL Server 2012 introduced a field in LOP_BEGIN_XACT log records that tracks the transaction ID of the parent transaction, allowing the hierarchy of transactions to be investigated.
The actual code is at the bottom of the article, and is available in a zip file here.
It provides two stored procs, sp_SQLskillsAnalyzeLog and sp_SQLskillsAnalyzeLogInner, with the former making use of the latter, and the latter calling itself recursively.
I’d consider this squarely in the advanced troubleshooting realm. Definitely read the whole thing and Paul’s code.
The SQLPS Trello board worked out so well, that when the topic of getting overdue some improvements into SSMS was brought up, Microsoft suggested that setting up another board for SQL Server Management Studio might work well too.
If you use either of these products, I strongly suggest you join us and give Microsoft your feedback. They actively participate on both boards which is just amazing.
I’ve never been a huge Trello fan, but I do admit that I’m looking for a good tool for personal planning, so maybe I’ll give it another try.
There’s a really nice mathematical way that I found on Stack Overflow, using the golden ratio (from Wikipedia). And it’s actually set-based. I don’t have the requisite mathematical skills to evaluate the correctness or precision. Note that it returns afloat value result, the upside of which is that you can calculate much higher values. The downside is the loss of precision that comes with float.
This is a bit of a brain teaser but if you learn the techniques, they can definitely come in handy in the future. I like the third solution because it’s a reminder that writing code is just as much about understanding the domain as it is understanding the syntax.
Let’s have a look at the biggest news: DocumentDB now has protocol support for MongoDB. From the industry perspective, this is great news. MongoDB is one of the most easily-recognised NoSQLs databases. Now that DocumentDB can be a great supporting act for MongoDB, it means that architects have a broader range of tools to support business needs in the enterprise data architecture, whilst increasing business capability using the Azure cloud. How does it work? Well, or MongoDB at the wire protocol level, which means that the existing MongoDB drivers will function against DocumentDB. For IT departments, it means that enterprises can persist data in DocumentDB behind the scenes. With it, it brings the reliability, stability and resilience of the Azure cloud. It also means that these technologies are accessible for small to medium enterprises in a way that they can afford, backed up with the stability and support from Microsoft Azure that they may find difficult to service on their own.
DocumentDB has a long way to go before it catches up to MongoDB, but these are improvements that close the gap a little bit.
We have a lovely Availability Group that holds A LOT of data that is broken into partitions. We have 42 partitions and they are usually moving information around daily between them. The index rebuilds on them were making our logs HUGE because the the Availability Group was taking too long to catch up, we tried both Synchronous and Asynchronous mode. We would see all kinds of errors. We were doing horrible things like auto shrinking our transaction log after the indexing finished and ignoring alarms during the time the database was rebuilding. We had requested more and more space from our storage team and sometimes the job wouldn’t even finish because it ran out of space. Our first idea was to split out the index rebuilds so that we could do one partition at a time.
I’d call this a bit of a hack, but if it’s documented and it does the job better than any known alternative, it’s a good decision.