Apache Druid is a distributed, high-performance columnar store for real-time analytics on a large dataset. Druid core design combines the OLAP analytics, time series database and search system to create a single operational analysis. Druid is most suitable for data with high cardinality column or queries having higher aggregation or group by.
Druid has very specific use cases. If you don’t fit one of the use cases, it’s not a good solution at all; but if you do fit one of the use cases, it’s excellent.
Naturally, I would expect my model to be unbiased, at least in intention, and hence any leftovers on either side of the regression line that did not make it on the line are expected to be random, i.e. without any particular pattern.
That is, I expect my residual error distributions to follow a bland, normal distribution.
In R, you can do this elegantly with just two lines of code.
1. Plot a histogram of residuals
2. Add a Quantile-Quantile plot with a line that passes through, namely, the first and third quantiles.
There are several more techniques in here to analyze residuals, so check it out.
This post describes how to generate big datasets with Hive in HDInsight, specifically TPC-DS benchmarking datasets. There are many tools for generating sample data, and this one is particularly nice due to its familiarity and ability to generate massive datasets up to 100 terabytes in size. The intended purpose of TPC data is for benchmarking purposes, but big sample datasets are also very useful for learning big data tools, proofs of concept, testing, etc.
The TPC (Transaction Processing Performance Council) provides tools for generating the benchmarking data, but using them to generate big data is not trivial, and would take a very long time on modest hardware. Thankfully someone has written a nice utility that uses Hive and Python to run the generator on a Hadoop cluster. While Hadoop clusters are not easy to setup, using a Hadoop cloud service like Azure HDInsight is remarkably easy. With HDInsight, you can use a powerful cluster of machines to generate the data quickly, and when you’re done you can delete the cluster, leaving the data in place.
Most of the instructions should follow through to work with on-prem or non-HDInsight Hadoop clusters, though there will be some changes to accommodate differences in HDInsight.
I love it when someone sends me a repro script, but in this case I didn’t need to run it. The first thing I did was to look at the two numbers given for row size, and to subtract the smaller one from the larger one: 724 – 710 = 14 bytes of difference.
That bit of information alone gave me an immediate guess of what was going on.
Click through for the solution as well as a more detailed explanation of one of the trickier scenarios.
Recently I had to look up the definition for a bunch of SQL objects and didn’t want to manually retrieve them manually in SSMS (with Create Scripts) or Visual Studio (by searching the object name in my TFS repository).
Since lazyness and automation are the basis of a well done engineering work, I wanted to create a list, where I could basically click on the object that I needed and see the definition right away, without any tool or having to code something externally, of course.
Click through for the solution, which is short and sweet.
In SSMS, you’re used to being able to click the “Cancel” button on your query, and having your work rolled back.
You’re also used to being able to kill a query, and have it automatically roll back.
Neither of those are true with resumable index creations. In both cases, whether you kill the index creation statement or just hit the Cancel button in SSMS to abort your request, your index creation statement is simply paused until you’re ready to come back to it. (Or, it’s ready to come back to haunt you, as we saw above.)
There are some good things to keep in mind here.
This is going to be a bit of a brain storming post that comes from an interesting question that I was asked today…
“I’ve got a table with a ID code field, now some of the rows have a value in that field and some are NULL. How can I go about filling in those NULL values with a valid code but at the same time avoid introducing duplicates?”
Click through for David’s solution.
The first step in any type of analysis is to understand the dataset itself. A Databricks dashboard can provide a concise format in which to present relevant information about the data to clients, as well as a quick reference for analysts when returning to a project.
To create this dashboard, a user can simply switch to Dashboard view instead of Code view under the View tab. The user can either click on an existing dashboard or create a new one. Creating a new dashboard will automatically display any of the visualizations present in the notebook. Customization of the dashboard is easily achieved by clicking on the chart icon in the top right corner of the desired command cells to add new elements.
This isn’t quite a step-by-step guide but does spur on ideas.
I used to go to the Emoticons (Emoji) 1F600—1F64F page of unicode-table.com to copy and paste characters, code points, or check the encoding chart at the bottom of each character page (the “hex” column of both “UTF-16BE” and “UTF-16LE” rows have proven most useful).
But not anymore. Now, I just hit: Ctrl + 0.
When I do that, I get a list of 188,657 code points. Each row contains the official code point value (“U+HHHH”), the integer value, the hex value (“0xHHHH”), the character itself, the UTF-16 Little Endian byte sequence (how it is actually stored, and what you get if you convert an
VARBINARY), the surrogate pair values, the T-SQL notation (which does not require using an
_140_collation), the HTML notation (“&#xHHHH;”), and finally the C-style notation (“\xHHHH” ; used for C / C++ / C# / Java / etc). I can copy and paste any of those values and use them in queries, emails, blog posts, .NET code, and so on.
Click through to see how Solomon does this.