The second preconception I hear the most is the hype. Many yet-to-be practitioners expect deep nets to give them a mythical performance boost just because it worked in other fields. Others are inspired by impressive work in modeling and manipulating images, music, and language – three data types close to any human heart – and rush headfirst into the field by trying to train the latest GAN architecture. The hype is real in many ways. Deep learning has become an undeniable force in machine learning and an important tool in the arsenal of any data modeler. Its popularity has brought forth essential frameworks such as tensorflow and pytorch that are incredibly useful even outside deep learning. Its underdog to superstar origin story has inspired researchers to revisit other previously obscure methods like evolutionary strategies and reinforcement learning. But it’s not a panacea by any means. Aside from lunch considerations, deep learning models can be very nuanced and require careful and sometimes very expensive hyperparameter searches, tuning, and testing (much more on this later in the post). Besides, there are many cases where using deep learning just doesn’t make sense from a practical perspective and simpler models work much better.
It’s a very interesting article, pointing out that deep learning solutions work better than expected on smaller data sizes, but there are areas where it’s preferable to choose something else.
With many options available in the market (Presto, Spark SQL, etc.) for doing interactive SQL over data that is stored in Amazon S3 and HDFS, there are several reasons why using Hive and LLAP might be a good choice:
For those who are heavily invested in the Hive ecosystem and have external BI tools that connect to Hive over JDBC/ODBC connections, LLAP plugs in to their existing architecture without a steep learning curve.
It’s compatible with existing Hive SQL and other Hive tools, like HiveServer2, and JDBC drivers for Hive.
It has native support for security features with authentication and authorization (SQL standards-based authorization) using HiveServer2.
LLAP daemons are aware about of the columns and records that are being processed which enables you to enforce fine-grained access control.
It can use Hive’s vectorization capabilities to speed up queries, and Hive has better support for Parquet file format when vectorization is enabled.
It can take advantage of a number of Hive optimizations like merging multiple small files for query results, automatically determining the number of reducers for joins and groupbys, etc.
It’s optional and modular so it can be turned on or off depending on the compute and resource requirements of the cluster. This lets you to run other YARN applications concurrently without reserving a cluster specifically for LLAP.
Read on for more details, including the bootstrap action you need to take and how to use LLAP once you have it configured.
The previous post looked at how to process SSAS Tabular models with TMSL. Since SQL Server adds new JSON capabilities in 2016, let’s look at how to convert TMSL JSON to a Table with OPENJSON. OPENJSON is a new function in SQL Server 2016 that, per Microsoft:
OPENJSON is a table-valued function that parses JSON text and returns objects and properties from the JSON input as rows and columns.
In short, OPENJSON converts JSON text to a table. Since TMSL is JSON, this function can convert a SSAS Tabular processing script into a table. This could be useful if you wanted to document a SSAS processing schedule.
That’s an interesting use of OPENJSON.
Hey, there is a time to go level up and instead of drawing Spirals, Fractals and other cool stuff I decided to go 3D!
So, the first my try will be drawing 3D cubes.
As you know, SQL is not an Object Orienting Programming language, and I can’t just simply create an Object “Cube” with certain properties. To create a Cube I need a Stored Procedure:
Click through for a touch of Picasso in your database.
Wolf only had 700 complaints, but 166,900 records were estimated for return. He is looking much worse than reality shows.
So, what is happening is that there are 3 possible employee results for complaints. It is rather simple. CE is taking the total amount of records(500,701) and dividing by 3 assuming that all 3 will have roughly the same amount of records. We see that along with the estimated number of records being the same, the execution plan operators are the same. For such a variation in amount of records, there must be a better way.
I rarely create filtered statistics, in part because I don’t have a good idea of exactly which values people will use when searching. But one slight change to Wolf’s scenario might help: having a filter where name = Sunshine and a filter where name <> Sunshine (or name is null). That might help a case where there’s extreme skew with one value and the rest are much closer to uniformly distributed.
As changes are made to rows in the temporal memory-optimized table, before being transferred to the history table on disk, they are first migrated to an internal memory-optimized staging table. That means when you query the “history table”, you could be retrieving rows from both the on-disk history table, and internal staging table. Because no custom indexing was possible on the internal staging table, there could be performance implications when executing queries against historical data. Microsoft addressed these potential performance issues in SQL 2016 SP1 (detailed in this CAT blog post).
The internal staging table only gets flushed to the on-disk history table when it reaches 8% of the size of the temporal table. Given the current capacities of Windows Server 2016 (24TB memory), it’s now possible to have very large memory-optimized tables. 8% of one of those large memory-optimized tables could be quite large, which will affect query performance, if predicates don’t match available indexes.
Read on for some sobering thoughts on the topic.
We can then use this file in the
setup.exefrom the SQL Server install media. To put a little more color on that: the
.inifile is really just a collection of command-line arguments to
setup.exe; you could also list them all out in-line, but that would be tedious and silly. Here’s a couple major selling points of creating your own config file:
Slipstream updates (SP’s, CU’s), instead of having it go out to MSFT update servers (or *aghast* sticking with the original RTM bits, you heathen you!)
Specify drive letters / default file locations: sure, this may be considered old-hat if you’re running super slick storage, but I still find it makes management a bit easier if I know where my MDFs, LDFs, TempDB, & backups will always be.
Take advantage of 2016’s better TempDB setup options (# files, size & growth)
Read the whole thing.