We’ll show how the Hive/Druid integration delivers ultra-fast SQL analytics that can be consumed from your favorite BI tool to get accelerated business results. And we will show benchmark results of BI queries running in just milliseconds over a 1TB dataset.
WHAT IS DRUID?
Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is included as a technical preview in HDP 2.6 and you can read more about Druid on our project page, or at the project website.
This first post is mostly about Druid, which sounds like it might eventually become a very interesting technology for implementing Kimball-style warehouse models but for the whole “Joins? We don’t need no steenkin’ joins” philosophy. But when used as one engine component (as mentioned in the post), I can see it being quite useful.
Although most analytics applications today still leverage older data warehouse and OLAP technologies on-premises, the pace of the cloud shift is significantly increasing. Infrastructure is getting better and is almost invisible in mature markets. Cloud fears are subsiding as more organizations witness the triumphs of early adopters. Instant, easy cloud solutions continue to win the hearts and minds of non-technical users. Cloud also accelerates time to market allowing for innovation at faster speeds than ever before. As data and analytics professionals, be sure to make time to learn a variety of cloud and hybrid analytics tools.
Exploring novel technologies across various ecosystems in the cloud world is usually as simple as spinning up a cloud image or service to get started. There are literally zillions of free and low cost resources for learning. As you dive into a new world of data, you will find common analytics architectures, design patterns, and types of technologies (hybrid connectivity, storage, compute, microservices, IoT, streaming, orchestration, database, big data, visualization, artificial intelligence, etc.) being used to solve problems.
It’s worth reading the whole thing.
Looking at this data, the first thing I thought was untidy. There has to be a better way. When I think of tidy data, I think of the tidyr package, which is used to help make data tidy, easier to work with. Specifically, I thought of the
spread()function, where I could break things up. Once data was spread into appropriate columns, I figure I can operate on the data a bit better.
Sarah has also made the data set available in case you’re interested in following along.
When we read the list of coefficients, here is how we interpret them:
The intercept is the starting point – so if you knew no other information it would be the best guess.
Each coefficient multiplies the corresponding column to refine the prediction from the estimate. It tells us how much one unit in each column shifts the prediction.
When you use a categorical variable, in R the intercept represents the default position for a given value in the categorical column. Every other value then gets a modifier to the base prediction.
Linear regression is easy, but the real value here is Steph’s explanation of logistic regression coefficients.
One of the promises of Azure SQL Data Warehouse is the ability to have petabyte scale. The ability to quickly scale data, and have that data scale independently of compute resources. So when one I my clients emailed me yesterday with this screenshot, needless to say I was concerned.
As you can see, when the properties screen shows a max size of 5 GB.
Click through for the reason why.
The error message is a little misleading but let’s save the debugging debate for another day. The key observation is “Guildford” data is not available, simply because it comes after “Camberley” in the list. Whilst we want to see errors in a Query, we do not want them causing data loss.
As I mentioned at the beginning of this article, using the Remove Errors function would prevent the loss of Guildford data. However, the user needs to handle errors as Unknown Members and conform to a typical Kimball Data Warehouse.
I am sure there are many ways to fulfil the requirement, but here is how I approached it:
Read on for the resolution.
What does all that mean? No idea. Clearly there is JSON that’s returned here and can be deserialized to gather meanings. Is this useful? I think graphs solve a certain set of problems very well, and more efficiently than relational systems. Certainly I could implement a graph structure relationally, but at scale I’m not sure the queries would be as easy to write or run as quickly.
I don’t know if I’d use a graph structure in any of the problems we try to solve in the SQLServerCentral app, but who knows. Maybe we would if we could.
Steve leaves this with more questions than answers, but he does give a very simple repro script if you want to futz about with graphs.
Basically, it escapes any occurrence of the second parameter within the first parameter. So when would we be using it in dynamic SQL? Well, probably the most common way I’ve used it is when I’m building a list of commands I want to run.
Click through for more details, including valid quote characters.
We have three types of physical join algorithms in SQL Server: hash, nested loops and merge. Adaptive join allows SQL Server automatically choose an actual physical algorithm on the fly between the first two – hash (HM) and nested loops (NL).
NL has two join strategies – naive nested loops join (inner loop scans the whole inner table or index) and index nested loops join (index on the join column of the inner table is used to find necessary rows and then those rows are applied to the outer row, also called Nested Loops Apply). Typically, the second one performs very well if you have rather small input on the outer side and indexed rather big input on the inner side.
HM is more universal and uses hash algorithms to match rows, so no indexes are necessary. You may refer to my blog post Hash Join Execution Internals for more details.
Adaptive Join starts execution as a Hash Join. It consumes all the input of the build phase and looks at the adaptive join threshold, if the number of rows is more or equal this threshold it will continue as a hash join. However, if the number of rows is less than this threshold, it will switch to a NL.
If you want to get a better understanding of how adaptive joins works, Dmitry’s post is a great start.