Press "Enter" to skip to content

Author: Kevin Feasel

Spark 2.0 Technical Preview

Reynold Xin gives a preview of Apache Spark 2.0:

One thing we are proud of in Spark is creating APIs that are simple, intuitive, and expressive. Spark 2.0 continues this tradition, with focus on two areas: (1) standard SQL support and (2) unifying DataFrame/Dataset API.

On the SQL side, we have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features. Because SQL has been one of the primary interfaces Spark applications use, this extended SQL capabilities drastically reduce the porting effort of legacy applications over to Spark.

There’s some great stuff coming out of DataBricks.  Spark 2.0 looks to be an exciting product.

Comments closed

Comparing Table Contents

Kenneth Fisher discusses methods for comparing table contents:

Roll your own
Ie write one T-SQL query to compare the structure of the two tables (this one will always be the same) and another to compare the data (new each time).

  • Benefits

    • Free.
    • You get practice in writing T-SQL (yes, I’m stretching there.)
  • Flaws

    • Can take quite a bit of time to write. (Don’t forget to check for NULLs separately.)

    • Mistakes happen.

    • Performance may not be spectacular.

Good comparison.  Check it out.

Comments closed

Data Types

Grant Fritchey implores you to use the correct data type:

Far too often then, the easy answer, just change the column to varchar. You can trim the time and output in exactly the format needed by the business. Problem solved and it was easy…

Well, until someone inputs “Janry 20, 2009” slightly mangling the spelling and suddenly your report looks all messed up. Or, they ask you to start filtering just the last two weeks, regardless of when the report was run and you now can’t easily dodate math on the column. Even after you get over that problem with a little formatting using CAST (along with ISDATE to try to catch all those other typos that are in the system now)  you notice that the performance is really slow so you go to put an index on the column and now you have an index key that is 50 bytes wide instead of the 3 bytes that the DATE data type would have been, making the index less efficient (not to mention, sorting the data is going to put February ahead of January, more formatting).

Read the whole thing.

Comments closed

New Ambari Version

Paul Codding announces Ambari 2.2.2:

Grafana is deployed, managed and pre-configured to work with the Ambari Metrics service. We are including a curated set dashboards for core HDP components, giving operators at-a-glance views of the same metrics Hortonworks Support & Engineering review when helping customers troubleshoot complex issues.

Metrics displayed on each dashboard can be filtered by time, component, and contextual information (YARN queues for example) to provide greater flexibility, granularity and context.

Ambari is really shaping up to be a nice framework for managing a Hadoop cluster.  I’m excited to see improved monitoring capabilities.

Comments closed

Query Store Storage Options

Erin Stellato looks at Query Store storage options, specifically MAX_STORAGE_SIZE:

Now, there are catalog views that allow you to view the Query Store data.  You can copy that data into another database using SELECT INTO, and then do comparisons, but wouldn’t it be nice to have some kind of export option?  There’s a Connect item for that:

Export Query Store tables separately from the database tables: https://connect.microsoft.com/SQLServer/feedback/details/2620017/export-query-store-tables-separately-from-the-database-tables

If you think this is something that would be useful for Query Store, please up vote it!  Again, Query Store is available in ALL editions of SQL Server 2016, this is definitely a feature you CAN use and will want to use!  This potential option won’t make it into RTM, but with the change in how SQL Server is releasing CUs, the more important the SQL Server team sees this option (as a result of votes), the faster it might make it in a release.

Query Store is one of the most exciting features for database administrators to hit in quite a while.  There will be some V1 pains, but this feature is well worth the upgrade to 2016.

Comments closed

Setting Up A Linked Server To Oracle

Jon Morisi steps in to show how to set up a linked server connection to an Oracle database:

In this dialog box, the “TNS Service Name” drop down box should display your entries from the tnsnames.ora file.  Next, enter your Oracle User ID and click “Test Connection”, at which point you’ll be prompted for your password.  Everything should test successfully at this point.

Now would be a good time to restart.  Unfortunately, yes you need to restart…

You can do an additional test via sqlplus.  Open a windows command prompt and enter the following:

sqlplus user/pass@[addressname]

(Where addressname is one of your connections from tnsnames.ora)

I readily admit that I’m glad I don’t need to work with Oracle.  Nonetheless, if you do need to integrate the two, this step-by-step guide will show you how.

Comments closed

Spark Accumulators

Prithviraj Bose explains accumulators in Spark:

However, the logs can be corrupted. For example, the second line is a blank line, the fourth line reports some network issues and finally the last line shows a sales value of zero (which cannot happen!).

We can use accumulators to analyse the transaction log to find out the number of blank logs (blank lines), number of times the network failed, any product that does not have a category or even number of times zero sales were recorded. The full sample log can be found here.
Accumulators are applicable to any operation which are,
1. Commutative -> f(x, y) = f(y, x), and
2. Associative -> f(f(x, y), z) = f(f(x, z), y) = f(f(y, z), x)
For example, sum and max functions satisfy the above conditions whereas average does not.

Accumulators are an important way of measuring just how messy your semi-structured data is.

Comments closed

Database Detachments And File Permissions

Daniel Hutmacher looks at what happens when you detach a database:

On most database servers, the SQL Server service account is granted full control of the directories that host the database files. It goes without saying that the service account that SQL Server runs on should be able to create, read, write and delete database files. Looking at a sample database on my local server, the .mdf and .ldf files don’t actually inherit permissions from their folder, although the permissions are very similar to that of the folder.

This all makes sense once you read the explanation, but it’s not intuitive behavior.  Read Daniel’s gotcha near the end of the post.

Comments closed