Here is how we can go about combining the Products and the newly created SubCategory table for a more usable and elegant solution. I have highlighted the key steps, you can examine the solution file to go step by step in the Power BI Query.
Note: Don’t be scared of the M code, 99% of these steps were generated using the Query ribbon – I have included those screenshots as well.
When developing BI models, it’s important to keep things as simple as possible. Your desired end user likely does not have the necessary skill level to wade through normalized table designs, so make it easy for them to get their jobs done.
The other way to automate this stored procedure is by attaching it to the queue itself so that it Service Broker directly executes the procedure in a process called activation. With activation Service Broker starts an application whenever there is work to do (e.g. when messages are in the queue). When we enable activation for a queue, SQL Server spawns a queue monitor process to keep an eye on the queue and check every few seconds to see whether there are messages available for processing. There are 2 types of activation:
Internal activation – A stored procedure like the one above is attached to the queue to process messages automatically as they arrive.
External activation – An external program is notified that messages need to be processed.
Colleen focuses on internal activation, which calls a stored procedure for each message received. Check her post out.
Your DataFrame is skewed if most of its rows are located on a small number of partitions, while the majority of the partitions remain empty. You really should avoid such a situation. Why? This makes your application virtually not parallel – most of the time you will be waiting for a single task to finish. Even worse, in some cases you can run out of memory on some executors or cause an excessive spill of data to a disk. All of this can happen if your data is not evenly distributed.
To deal with the skew, you can repartition your data using distribute by. For the expression to partition by, choose something that you know will evenly distribute the data. You can even use the primary key of the DataFrame!
It’s interesting to see how cluster by, distribute by, and sort by can have such different performance consequences.
Kafka Streams: Kafka Streams was introduced as part of thetech preview release of the Confluent Platform few months ago and is now available through Apache Kafka 0.10.0.0. Kafka Streams is a library that turns Apache Kafka into a full featured, modern stream processing system. Kafka Streams includes a high level language for describing common stream operations (such as joining, filtering, and aggregating records), allowing developers to quickly develop powerful streaming applications. Kafka Streams offers a true event-at-a-time processing model, handles out-of-order data, allows stateful and stateless processing and can easily be deployed on many different systems— Kafka Streams applications can run on YARN, be deployed on Mesos, run in Docker containers, or just embedded into existing Java applications.
There are some nice improvements in this latest version of Kafka.
In order for SQL Server to read and update data, the data must be in the buffer. SQL Server does not work directly with the data in the files on disk. Once the pages of data are in the buffer, they can be used for multiple queries. This means that the data doesn’t have to be retrieved from disk every time it’s needed, thereby decreasing the amount of I/O work required.
You may have seen this yourself when selecting all the rows of a large table twice. The second time, the query can run much faster because the data does not have to be copied from the disk to the buffer. If you run another query from a different large table, it may cause the pages from the first table to be removed to make room. If there is not enough memory, pages will have to be read from disk more frequently causing your queries to be slow.
Even with extremely fast SSDs and flash storage arrays, RAM is still typically an order of magnitude faster, so having enough RAM and using it wisely is critical to a well-functioning SQL Server instance.
Unlike most other statistical software packages, R doesn’t have a native data file format. You can certainly import and export data in any number of formats, but there’s no native “R data file format”. The closest equivalent is the
loadRDSfunction pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don’t hew to a standardized format (it’s essentially a dump of R in-memory representation of the object), and so you can’t read the data with any software other than R.
The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it’s a column-oriented file format, which matches R’s internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.
Given the big speedup in read time, I can see this file format being rather useful. I just can’t see it catching on as a common external data format, though, unless most tools get retrofitted to support the file. So instead, it’d end up closer to something like Avro or Parquet: formats we use in our internal tools because they’re so much faster, but not formats we send across to other companies because they’re probably using a different set of tools.
It’s not fast. The above piece of T-SQL took ~4 seconds to execute. This is on an Azure A3 VM. Not a great machine admittedly, but the R code, which just returns the first 6 rows of a built-in data set, ran in under a second on my desktop. This is likely not something you’ll be doing as part of an OLTP process.
I hope this external_script method is temporary. It’s ugly, hard to troubleshoot, and it means I have to write my R somewhere else, probably R Studio, maybe Visual Studio, and move it over once tested and working. I’d much rather see something like
I agree with the sp_execute_external_script mess. It’s the worst of dynamic SQL combined with multiple languages (T-SQL for the stored procedure & R for the contents, but taking care to deal with T-SQL single-quoting). Still, even with these issues, I think this will be a very useful tool for data analysts, particularly when dealing with rather large data sets on warehouse servers with plenty of RAM.
The code is executed as an external script, specifying that the language used should be R. @script contains the R code, which is a simple command to take the mean of the data coming from the InputDataSet. @Input_Data_1 contains the location of the data to be processed. In this case the data set is a table containing Amazon review data, where the overall field is the rating field. Of course the R code could of course be more complicated, but I was hoping that this example was generic enough that many people would be able to duplicate it and run their first R code.
This is quite a bit easier to install in RTM(ish) than it was back in CTP 3, so good job Microsoft.
From the above observation, a natural next step for us was to explore the possibility of automatically generating this handwritten code at runtime, which we are calling “whole-stage code generation.” This idea is inspired by Thomas Neumann’s seminal VLDB 2011 paper onEfficiently Compiling Efficient Query Plans for Modern Hardware. For more details on the paper, Adrian Colyer has coordinated with us to publish a review on The Morning Paper blog today.
The goal is to leverage whole-stage code generation so the engine can achieve the performance of hand-written code, yet provide the functionality of a general purpose engine. Rather than relying on operators for processing data at runtime, these operators together generate code at runtime and collapse each fragment of the query, where possible, into a single function and execute that generated code instead.
The possibility of getting an order of magnitude better performance is certainly enticing.
Grafana provides a powerful and customizable dashboard builder for visualizing time series data. Ambari installs Grafana v2.6 as a Master Component of AMS and adds a datasource for AMS to Grafana. The dashboard builder is supported through a Metadata API in AMS that allows easy discovery of metrics, applications and hosts which are the key components that formalize an API call to AMS. There has been significant work put into creating templated dashboards for Hadoop ecosystem services tailored towards analyzing issues and performance bottlenecks on the Hadoop cluster. The following is an image of the dashboard builder highlighting the metric name drop down with type ahead and auto complete along with options to apply aggregate functions as needed based on whether the metric is a GAUGE or a COUNTER.
This is the beginning of a good visualization system for Hadoop metrics.