Press "Enter" to skip to content

Author: Kevin Feasel

HDInsight 3.6 Available

Ashish Thapliyal points out some Hive improvements in HDInsight 3.6:

2 Create a new Hive table from scratch or alter Table

Create a new table by, clicking on the ‘+’ icon, which opens the create table wizard. Enter table name, column name and choose a data type from the dropdown. You can pick folloiwng advanced hive settings directly from the UI

  • Transactional : Turn on transaction support in Hive, by checking this flag. Note that the table must be bucketed and stored using an ACID compliant format (such as ORC).

  • Location : Hive stores the table data for managed tables in the Hive warehouse directory in HDFS which is configured in hive-site.xml with property hive.metastore.warehouse.dir. The default location is /apps/hive/warehouse. The location can be changed using the Location text field.

  • File Format : The default file format for CREATE TABLE statement is ORC. choose a format from the file format dropdown.

  • Row Format : Select a row format such as Field terminator, Lines terminator, and Stored File type.

  • Table can be altered to add new columns or change the column name or column datatype.

  • Tables can also be renames and altred

Read on for more improvements, including a graphical plan viewer and improved autocomplete.

Comments closed

MERGE In Hive

Carter Shanklin notes that Hive now has the ability to run MERGE statements:

As scalable as Apache Hadoop is, many workloads don’t work well in the Hadoop environment because they need frequent or unpredictable updates. Updates using hand-written Apache Hive or Apache Spark jobs are extremely complex.  Not only are developers responsible for the update logic, they must also implement all rollback logic, detect and resolve write conflicts and find some way to isolate downstream consumers from in-progress updates. Hadoop has limited facilities for solving these problems and people who attempted it usually ended up limiting updates to a single writer and disabling all readers while updates are in progress.

This approach is too complicated and can’t meet reasonable SLAs for most applications. For many, Hadoop became just a place for analytics offload — a place to copy data and run complex analytics where they can’t interfere with the “real” work happening in the EDW.

This post mostly describes the gains rather than showing code, but it does show that Hive developers are looking at expanding beyond common Hadoop warehousing scenarios.

Comments closed

More Isn’t Better With Data Collection

Andy Leonard argues that more data is not better data:

The Problem I am Trying To Solve

Is more data better? In his 2012 book, Antifragile, Nassim Nicholas Taleb (fooledbyrandomness.com | @nntaleb) – the first data philosopher I encountered – states:

“The fooled-by-data effect is accelerating. There is a nasty phenomenon called ‘Big Data’ in which researchers have brought cherry-picking to an industrial level. Modernity provides too many variables (but too little data per variable), and the spurious relationships grow much, much faster than real information, as noise is convex and information is concave.” – Nassim Nicholas Taleb, Antifragile, p. 416

According to Taleb, there’s a bias for error embedded in big data; more is not better, it’s worse. I’ve experienced this with business intelligence solutions and spoken about data quality in data warehouse solutions, saying:

“The ratio of good:bad data in a useless / inaccurate data warehouse is surprisingly high; almost always north of 95% and often higher than 99%.”

Taleb states more data includes a disproportionate amount of bad data, and that bigger data results in more spurious correlations. In other words, more is not better – it’s worse.

It’s an idea worth grappling with.  The other side of the argument is that for some problems, you won’t know what you need until you need it.

Comments closed

Guessing At Foreign Key Relationships

Daniel Hutmacher has put together a script to try to find hidden foreign key relationships in a database:

Now, before you go crazy with this stuff, remember, it’s not a magic bullet, but rather some automation help to save you some coding and to help you review your data model. The script doesn’t change the database, it only prints out its suggestions, and this is totally by design.

  • For this to work, you’ll obviously need proper primary keys or unique indexes on your referenced tables.

  • We’re working on the assumption that the referencing and referenced column names are the same. Go ahead and change the script to suit your naming standards (look for the comment in the CTE)

  • The script has no domain knowledge of your database, some of the suggestions are probably going to be downright silly.

This is a good first pass approach, especially if you have a larger database completely lacking in relational integrity.

Comments closed

Recompiling Memory-Optimized Procedures On AGs

Ned Otter takes note of how natively compiled stored procedures differ from traditional stored procedures, especially on Availability Group secondary nodes:

As of SQL 2016, the database engine automatically updates statistics for memory-optimized tables (documentation here), but recompilation of native modules must still be performed manually. But hey, that’s way better than SQL 2014, when you couldn’t recompile at all; you had to drop/recreate the native module. And natively compiled stored procedures don’t reside in the plan cache, because they are executed directly by the database engine.

This post attempts to determine if the requirement to manually recompile native modules is any different for AG secondary replicas.

The results are interesting.

Comments closed

Thinking About NULL

Duncan Greaves on NULL:

NULL exists because the following general conditions apply:

Existence –The attribute does not exist in the domain, or domain understanding is wrong. This means there is a missing entity in our domain model or entites are mixed in a table. E.g table contains hair colour for a car entity, Number of pregnancies for male patients.

Missing – The information has not been given at the time a row was created. E.g. A customer may decline to give their age.

Not Yet – Data is contingent upon an unknown event in the future, E.g. Termination date or Date of death.

Does not apply– Is not applicable for this instance of a record. E.g. Hair colour for bald people.

Placeholders – Indicates that we know that a bit of data exists, but we don’t know what it is, in this case keeping a NULL is useful for CUBE or ROLLUP queries.

In the real world applications of data structures NULLs are often unavoidable. However, it confuses users, designers and DBA’s (generally) hate it. It complicates Reporting, ETL, Business Intelligence and Data Science initiatives. As such, users need to be aware of the design and query compromises they need to use.

I think there’s significance in what NULL represents, but it’s a concept with its fair share of complexity.  Read the whole thing.

Comments closed

Collations

Robert Sheldon has an article on collations:

The ideal solution is to choose a collation when setting up SQL Server that can be used for all your user databases and character columns. Using one collation removes any issues you might encounter when querying the data in different ways. It can also be the best approach in terms of performance if multiple collations impact your queries. However, this approach works only if the same language and collation settings are appropriate for all your users and applications—or at least a good majority of them.

If you support multi-cultural environments, you’ll need to take into account a number of considerations. To begin with, you should pick collations that support the most users, and you should use Unicode data types where possible because they can help avoid code page conversion issues. Just keep in mind the storage requirements that come with Unicode’s two bytes per character.

My inclination is to say Unicode everywhere possible.  There are cases in which Unicode doesn’t fit, but it’s easy to do and if you have enough data to worry about the extra bytes Unicode characters take up, Unicode compression is available.

Comments closed

Microsoft R Open 3.3.3

David Smith reports that Microsoft R Open 3.3.3 is now available:

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.3.3, and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to R 3.3.3, upgrades the installer, and updates the bundled packages.

R 3.3.3 makes just a few minor fixes compared to R 3.3.2 (see the full list of changes here), so you shouldn’t encounter any compatibility issues when upgrading from MRO 3.3.2. For CRAN packages, MRO 3.3.3 points to CRAN snapshot taken on March 15, 2017 but as always, you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

Click through for more details.  As a side note, CRAN R 3.4 is scheduled for release this month, so given their recent cadence, I’d guess MRO 3.4 to be out late this year.

Comments closed

Network Navaigator Custom Visual

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the Network Navigator Power BI Custom Visual.  You may find the need to use the Network Navigator when you’re trying to find links between different attributes in a dataset. It does this by visualizing each attribute as a node and the strength of activity between those nodes can be represented in multiple ways.

Click through to get to Devin’s video.  This visual looks interesting for graphical analysis, like trying to tease out common connections or discovering dependencies.

Comments closed