Hortonworks announced the general availability of HDP 3.0 this year. You may read more about it here. Bundled with HDP 3.0, Apache Hive 3 with LLAP took a significant leap as a Enterprise Ready Real time Database Warehouse with transactional capabilities that continues to serve BI workloads with lower latencies. HDP 3.0 comes with exciting new capabilities – ACID support, materialized views, SQL constraints and Query result cache to name a few. Additionally, we continued to build and improve on the performance enhancements introduced in earlier releases.
In this blog, we will provide an update on our performance benchmark blog, comparing performance of HDP 3.0 to HDP 2.6.5. The noteworthy difference in benchmark is that all tables are by default transactional and written in ACID format, which means there are additional metadata (ROW_ID) columns to uniquely identify each row and support transactional semantics. Another key database capability used and tested here is SQL constraints. The hive-testbench schema has been enhanced to declare Primary-Foreign key, not null and unique constraints.
Their headline is that Hive 3 is up to 2x faster than Hive 2, with huge gains in a few of the queries.
Part of the
vtreatphilosophy is to assume after the
vtreatvariable processing the next step is a sophisticated supervised machine learningmethod. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.
However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).
The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.
Read on to see a couple quick functions which help with this screening.
Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly.
Since both Spark and Sqoop are based on the Hadoop map-reduce framework, it’s clear that Spark can work at least as good as Sqoop, I only needed to find out how to do it. I decided to look closer at what Sqoop does to see if I can imitate that with Spark.
By turning on the verbose flag of Sqoop, you can get a lot more details. What I found was that Sqoop is splitting the input to the different mappers which makes sense, this is map-reduce after all, Spark does the same thing. But before doing that, Sqoop does something smart that Spark doesn’t do.
Read on to see what in particular Sqoop does, and how you can use that in your Spark code.
Just in case you haven’t try to install PowerShell Core Preview.3, in Windows, which became available on the evening of the 10th of December. If are doing a clean installation, meaning that it was previously manually uninstall, or that this is your first installation, then you are fine.
The issue with installing PowerShell Core Preview.3 is when you do an upgrade over a previous version: either Preview.1 or Preview.2.
That’s a common occupational hazard when working with preview tools. But Max has your solution.
Minikube is a good learning tool and Microsoft provides instructions for deploying a big data cluster to this ‘Platform’. However, its single node nature and the fact that application pods run on the master node means that this does not reflect a cluster that anyone would run in production. Kubernetes-as-a-service is probably by far the easiest option for spinning a cluster up, however it relies on an Aws, Azure or Google Cloud Platform account, hence there is a $ cost associated with this. This leaves a vanilla deployment of Kubernetes on premises. Based on the assumption that most people will have access to Windows server version 2008 or above, a relatively cheap and way of deploying a Kubernetes cluster is via Linux virtual machines running on Hyper-V. This blog post will provide step by step instructions for creating the virtual machines to act as the master and worker nodes in the cluster.
This is going on my “try this out when I have time” list. So expect a full report sometime in the year 2023.
In my current project, one of the user requirement was to have a filter on the Year Slicer. To explain in detail, we have various measures to show metrics for current and earlier measures. For example 2016,2017 and 2018. In 2016, we always have blank values for Last Year metrics, having empty values don’t tell the story well. So to tell the story, we need to pull three years worth of data but display only two years in the Slicer. The easiest way to handle this situation would have had a visual level filter on the Year slicer.
Power BI Slicers doesn’t support Visual Slicers. However, with the help of Selection Pane and Sync slicers, I did a quick workaround.
Read on to see how to use hidden slicers to control what’s displayed to the user in visible slicers.
The base algorithm works as follows: SQL Server compares the first rows from both sorted inputs. It then continues comparing the next rows from the second input as long as the values match the first input’s value.
Once the values no longer match, SQL Server increments the row of whichever input has the smaller value – it then continues performing comparisons and outputting any joined records. (For more detailed information, be sure to check out Craig Freedman’s post on merge joins.)
Beware: Bert has animations which might cause you to space out for a few hours.
In my previous post I shared a SQL Server 2017 graph database of US capitals. Graphs are a computer science core competency and present some interesting challenges for programmers. Most notable of these challenges is finding the shortest path between nodes. Dijkstra’s algorithm is a commonly taught algorithm for finding shortest path. Dijkstra’s is often asked about during entry level developer interviews and it is a great algorithm to implement when learning a new language since it requires utilizing loops, logic, and data structures.
Here’s my implementation of Dijkstra’s algorithm using PowerShell, traversing a graph of US capitals. Rather than manage our own graph nodes and edges, we’ll utilize graph tables and queries in SQL Server. There’s a lot of different ways to implement this in PowerShell, my first cut of this ended up using a hash table so I could perform random access. There’s a give-and-take with custom PowerShell objects, which sacrifice random access for some other benefits.
Click through for the code.
Since a full backup reads every data page it makes sense that we can ask SQL Server to calculate and compare each checksum during this operation. Even a differential backup reads all of the pages for any extent (a group of 8 data pages) if any one page in it changes so it will validate some pages that didn’t change.
Read on for a demonstration.
As I mentioned, we have implemented KSQL on top of the Kafka Streams API. This means that every KSQL query is compiled into a Kafka Streams application. Therefore, KSQL queries follow the same execution model of Kafka Streams applications.
A query can be executed on multiple instances, and each instance will process a portion of the input data from the input topic(s) as well as generate portions of the output data to output topic(s). Based on this execution model and depending on how we want to run our queries, currently, KSQL provides two deployment options.
Read on for those options.