Since we are prognosticating, I want to take a guess at one of the constraints limiting the future. I present you with Meidinger’s law:
An industry’s growth is constrained by how much your junior dev can learn in two years.
Let me explain. On my team, one of our developers’ just left for a different company. We also have a college student who will be going full time in May, upon graduation. How long do you think it’s going to take the new guy to get up to speed?
And how long do you think he’s going to stay?
This I think is a useful dictum which explains a pretty good amount of industry movement.
The worst are by far the string representation, as there is no actual check on the contents. It can literally contain everything. And is ’01/02/2018′ the first of February 2018 (like any sane person would read, because days come before months), or the 2nd of January? So if you have to store dates in your data warehouse, avoid strings at all costs. No excuses.
The integer representation – e.g. 20171208 – is really popular. If I recall Kimball correctly, he said it’s the one exception where you can use smart keys, aka surrogate keys that have a meaning embedded into them. I used them for quite some time, but I believe I have found a better alternative: using the actual date data type.
I bounce back and forth, but I’m sympathetic to Koen’s argument, which you can read by clicking through.
Yes, you have read it… I see dates stored as NVARCHAR(10) and NCHAR(10) on daily basis, please don’t ask me why.
This case is even worse, because DATE takes 3 bytes where NCHAR(10) takes 20 bytes, yes Ladies and Gentlemen more than 6 times more space to store the same data.
But wait! how can you be certain that those ten characters are actually a valid date? You can’t, unless you reinvent the wheel and validate that those dates are obviously valid dates and pay the performance penalty of doing it.
You’d think that picking the right data type for something would be fairly easy and then you find a table with a few dozen NVARCHAR(MAX) columns.
Thank you to everyone that took the time to write and contribute, I enjoyed reading about how you conquered your challenges, here is a round-up in no particular order.
There is, as always, plenty of reading available.
Thank you to all of you for taking time to contribute. I like the suggestion made by Glenda Gable on being partners to help with accountability on our goals. If anyone feels up to this just leave a comment below, we can set up a slack channel or something similar to help each other.
There were quite a few entries this month.
Firstly, I want to thank every person who took part. SIXTY TWO blog posts got generated, including a few first-time #tsql2sday contributors as well as first-time bloggers. I am fairly glowing to have been a part of it, and I hope the other contributors are too.
Secondly, from my own experience in writing a post, I know it feels terrible when you start worrying about who to pick. There are many people I could have included, but I hope I have made my appreciation for them clear elsewhere. Not that I want to speak on your behalf, but I’ll assume that the same applies to many other contributors.
That’s a lot of reading.
This month’s T-SQL Tuesday was all about Big Data. See what the community has to say about Big Data with this collection of articles ranging from deep technical walk-throughs to musings about Big Data’s impact on our industry and the data professional.
Click through to see the participants.
But now we run into a problem: there are certain ports which need to be open for Polybase to work. This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs. This goes back to the issue we see with spinning up data nodes in Docker: ports are not available. If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.
As I mention in the post, I’d much rather build my own Hadoop cluster; I don’t think you save much maintenance time in the long run going with HDInsight.
Querying Cosmos DB is more powerful and versatile. The CreateDocumentQuery method is used to create an IQueryable<T> object, a member of System.Linq, which can output the query results. The ToList() method will output a List<T> object from the System.Collections.Generic namespace.
Derik also shows how to import the data into Power BI and visualize it. It’s a nice article if you’ve never played with CosmosDB before.
When CLR came in, people said it was a T-SQL killer. I remember a colleague of mine telling me that he didn’t need to learn T-SQL, because CLR meant that he would be able to do it all in .Net. Over time, we’ve learned that CLR is excellent for all kinds of things, but it’s by no means a T-SQL killer. It’s excellent for a number of reasons – CLR stored procedures or functions have been great for things like string splitting and regular expressions – and we’ve learned its place now.
I don’t hear people talking about NoSQL like they once did, and it’s been folded somehow into BigData, but even that seems to have lost a little of its lustre from a year or two ago when it felt like it was ‘all the rage’. And yet we still have data which is “Big”. I don’t mean large, necessarily, just data that satisfies one of the three Vs – volume, velocity, variety.
Rob brings an interesting perspective to the topic, particularly as one of the early Parallel Data Warehouse bloggers.