Ex-Googler (and current Amazon Web Services employee) Tim Bray argues “there is a real cost to this continuous widening of the base of knowledge a developer has to have to remain relevant.” RedMonk analyst Stephen O’Grady takes this a step further: “It could be that we’re approaching the too-much-of-a-good-thing stage. In which case, the logical outcome will be a gradual slowing of fragmentation followed by gradual consolidation.”
In other words, niche data stores that do one thing really well are giving way to more generally applicable databases that can serve a broader range of enterprise needs.
The second part of Keep’s sentence above, however, spells out another reason HBase is struggling: It’s really hard to use.
I have a statement which is 90% serious and 10% joke: a database product is truly mature once it supports SQL. So what’s the answer for HBase? The current attempt at an answer is Phoenix, which is…SQL for HBase.
Support subscription revenue during the quarter was up sharply from $13.1 million to $27.6 million, an increase of 110 percent compared to the first quarter of 2015, which was Hortonworks’ first quarter as a public company following an IPO in late 2014. Professional services revenue accounted for $13.7 million in revenue, a 49 percent increase.
Hortonworks holds about 40% of the Hadoop market share, with Cloudera holding another 40%.
That’s the basics. Peeling back the onion more reveals other distinct differences, further making the case more strongly for a Hadoop-RDBMS coexistence strategy. RDBMS has the backing of the biggest names in the software industry, and as such has fostered an install base of IT talent probably second to none. RDBMS integrate very well with other systems, and represent a very mature technology having venerable, 40-year old roots. RDBMS are baked into the very fabric of just about every mid-to large sized IT organization in the world. Believe it – RDBMS aren’t going away any time soon, nor should they.
Relational databases have a strong mathematical footing which provides unparalleled data integrity. Hadoop has a strong mathematical footing which provides near-linear scale out. The key is knowing the problem you need to solve and how to integrate the results.
The question is what is the right time period to use? The answer is it depends on the size of your partitions. Generally, for managed tables in U-SQL, you want to target about 1 GB per partition. So, if you are bringing in say 800 mb per day then daily partitions are about right. If instead you are bringing in 20 GB per day, you should look at hourly partitions of the data.
In this post, I’d like to take a look at two common scenarios that people run into. The first is full re-compute of partitions data and the second is a partial re-compute of a partition. The examples I will be using are based off of the U-SQL Ambulance Demo’s on Github and will be added to the solution for ease of your consumption.
The ability to reprocess data is vital in any ETL or ELT process.
When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same
build.sbt, the same imports, and the skeleton application looks the same. All that really changes is the main entry point, that is the fully qualified class. Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.
Check these out if you’re interested in Spark.
An evolution of the three previous scenarios that provides multiple options for the various technologies. Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control. ELT is usually used instead of ETL (see Difference between ETL and ELT). The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.
Hub-and-spoke should be your ultimate goal. See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
Check it out for a high-level architectural view of contemporary warehousing choices. I prefer having both systems in play: the EDW answers known business questions and gives you back report information relatively quickly; whereas the Hadoop cluster allows you to do spelunking, data cleansing, and answer unanticipated business questions.
HDFS is a distributed file system that works differently than what we’re used to in the Windows OS side of things; the general principle is to use cheap commodity hardware that replicates data in order to account for availability and to prevent loss of data. With that in mind, it makes a great use case to store a lot of data cheaply for archiving purposes or can be used to store large quantities of data that been to be processed in large quantities as well.
For more information please visit: https://msdn.microsoft.com/en-us/library/mt143171.aspx
Now if you want to try it out for yourself, make sure you install the PolyBase Engine (from the SQL Server setup) and feel free to try the modified code sample below.
Polybase is, without a doubt, my favorite SQL Server 2016 feature. I am excited to put this through its paces in a production environment.
As Mario Inchiosa and Roni Burd demonstrate in this recorded webinar, Microsoft R Server can now run within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR (pdf) take advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. And if your data grows or you just need more power, you can dynamically add nodes to the HDInsight cluster using the Azure portal.
I don’t normally link to webinars (because they tend to violate my “should be viewable in a coffee break” rule of thumb) but I have a soft spot in my heart for these technologies. If you want to dig into more “mainstream” (off the Microsoft platform) Spark + R fun, check out SparkR.
Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.
To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.
If you’re looking for a data set for exploration, this is certainly a good one.
We are committed to continuously updating the JDBC driver to bring more feature support for connecting to SQL Server, Azure SQL Database, and Azure SQL DW. Please stay tuned for upcoming releases that will have additional feature support. This applies to our wide range of client drivers including PHP 7.0, Node.js, ODBC, and ADO.NET which are already available.
Don’t forget Hadoop integration (e.g., via Sqoop) while you’re at it…