Star-Schema Benchmark With Hive + Druid

Carter Shanklin and Slim Bouguerra run a Hadoop OLAP system running Hive and Druid against the Star-Schema Benchmark battery of queries:

How did we arrive at the query used to build the OLAP index? There is a systematic procedure:

  1. The union of all dimensions used by the SSB queries is included in the index.
  2. The union of all measures is included in the index. Notice that we pre-compute some products in the index.
  3. Druid requires a timestamp, so the date of the transaction is used as the timestamp.

You can see that building the index requires knowledge of the query patterns. Either an expert in the query patterns architects the index, or a tool is needed to analyze queries or to dynamically build indexes on the fly. A lot of time can be spent in this architecture phase, gathering requirements, designing measures and so on, because changing your mind after-the-fact can be very difficult.

One thing I don’t like so much is that they removed the ORDER BY clauses from some of the queries, as making this change makes it more difficult to use these results for “it’s totally not a comparison so don’t sue us Oracle” purposes.

Related Posts

Testing Kafka Streams Applications

Yeva Byzek continues her series on testing Kafka-based streaming applications: When you create a stream processing application with Kafka’s Streams API, you create a Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running […]

Read More

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem: H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, […]

Read More

Categories