Press "Enter" to skip to content

Author: Kevin Feasel

Converting CSV To Parquet

Praveen Sripati shows how to use Spark Dataframes to convert a CSV file into a Parquet format:

In the previous blog, we looked at on converting the CSV format into Parquet format using Hive. It was a matter of creating a regular table, map it to the CSV data and finally move the data from the regular table to the Parquet table using the Insert Overwrite syntax. In this blog we will look at how to do the same thing with Spark using the dataframes feature.

Most of the code is basic setup; writing to Parquet is really a one-liner.

Comments closed

Comparing File Formats In Hadoop

Andrew Peterson points out performance comparisons for various Hadoop file formats:

According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity.  (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).

There are several considerations around picking the correct file format, and it’s probably best to experiment with them in your specific environment.

Comments closed

Returning Defult Rows

Christopher Huntley wants to return a default record when there are no results:

Or if you’re ready to take it to the ╰[ ⁰﹏⁰ ]╯level then change the column to NChar and use the hex of your favorite emoji like:

DECLARE @testtable1 TABLE (
testid int identity (1,1),
testvalue  nchar (255))

–use the below for the final query

SELECT
ISNULL((SELECT testvalue from @testtable1 where testvalue > 101),  NCHAR(0xD83D)+ NCHAR(0xDE20) ) as testvaluethatworks

There are a few other alternatives, such as loading results into a temp table and inserting a default row if the temp table is empty.

Comments closed

Emoji In SQL Server

Daniel Janik wants to have a ninja cat riding a T-Rex for a database name:

There it is! The ninja cat database! You can see that even IntelliSense shows the ninja cat. Cool, right? How does it show in Object Explorer?

DOH! There’s obviously something strange going on here. Let’s validate the sys.databases table:

If full emoji support is the thing keeping you from moving to SQL Server, you might have to wait until the next version.

Comments closed

Columnstore Improvements

Warner Chaves discusses improvements in columnstore indexes in SQL Server 2016:

SQL Server first introduced Columnar Storage with the SQL 2012 Enterprise release. In this release, Columnstores were read-only indexes, so it required to drop the index, load the table or partition and then rebuild the index to refresh it with the latest data.

SQL Server 2014 upgraded Columnstores with full read-write capabilities, allowing the Columnstore to become the ‘clustered’ index for the table and hold all the data instead of just being one more index on top of row-organized data. 2014 also introduced many improvements to batch operations so more pieces of an execution plan could take advantage of this faster processing mode.

Read on to see changes in 2016.

Comments closed

Resource Governor

Mark Wilkinson walks through Resource Governor:

Now that we know which statistics we can gather, we need to actually start gathering them. While the DMVs for Resource Governor are great, they will only give you an aggregate of the usage information since the last time the statistics were reset, or the last time services were restarted.

In most cases it makes sense to store your data in a separate table so you can calculate differentials between two time periods. For our example we are only going to be interested in request counts and CPU usage. For this, we will create the following table

Read the whole thing.

Comments closed

Bugfixes To SQLCover

Ed Elliott notes that there is a new version of SQLCover out:

There have been a couple of fixes in SQLCover this week, kindly submitted by John Mclusky (https://github.com/jmclusky):

Code coverage not reported correctly for CTEs at the end of a stored procedure if the ‘with’ is immediately preceded with a semicolon

and

DeclareTableVariableStatement statements cannot be covered, so report falsely as missing coverage

Go check out SQLCover.

Comments closed

Data Lake Planning

Melissa Coates discusses some of the planning involved with creating a data lake:

Does a Data Lake Replace a Data Warehouse?

I’m biased here, and a firm believer that modern data warehousing is still very important. Therefore, I believe that a data lake, in an of itself, doesn’t entirely replace the need for a data warehouse (or data marts) which contain cleansed data in a user-friendly format. The data warehouse doesn’t absolutely have to be in a relational database anymore, but it does need a semantic layer which is easy to work with that most business users can access for the most common reporting needs.

On this question, my answer is “Absolutely not.”  Data warehouses are designed to answer specific, known business questions.  They’re great for regulatory reporting, quarterly reports to shareholders, weekly reports to management, etc.  Data lakes are designed for ad hoc analysis of information.  Read the whole thing.

Comments closed

Datetime Conversion Change

Dan Guzman notes a change in behavior in how datetime fields are upconverted:

SQL Server 2016 and Azure SQL Database V12 use the raw datetime internal value without rounding during conversion to another temporal type. The value is rounded only once during conversion, to the target type precision. The end result will be the same as before SQL Server 2016 when the target type precision is 3 or less. However, the converted value will be different when the target type precision is greater than 3 and the internal time unit interval is not evenly divisible by 3 (i.e. rounded source datetime millisecond value is 3 or 7). Note the non-zero microseconds and nanoseconds in the script results below and that rounding is based on the target type precision rather than the source.

This is a good thing on net, but be aware of this if you try to compare datetime versus datetime2 values.

Comments closed

SQL Server 2016 CU2 And Spinlocks

Arvind Shyamsundar has a nice post on SQL Server 2016’s Cumulative Update 2 and how it reduces spinlock contention:

The developers then took a long hard look at how to make this more efficient on such large systems. As described before, since operations on the cache are read-intensive, there was a thought to leverage reader-writer primitives to optimize locking. However, any changes to this spinlock had to be validated extensively before releasing publicly as they may have a drastic impact if incorrectly implemented.

The implementation of the reader / writer version of this spinlock was an intricate effort and was done carefully to ensure that we do not accidentally affect any other functionality. We are glad to say that the final outcome, of what started as a late night investigation in the SQLCAT lab, has finally landed as an improvement which you can use! If you download and install Cumulative Update 2 for SQL Server 2016 RTM, you will observe two new spinlocks in the sys.dm_os_spinlock_stats view:

  • LOCK_RW_CMED_HASH_SET
  • LOCK_RW_SECURITY_CACHE

These are improved reader/writer versions of the original spinlocks. For example, LOCK_RW_CMED_HASH_SET is basically the replacement for CMED_HASH_SET, the spinlock which was the bottleneck in the above case.

Click through for the full story.

Comments closed