Flattening JSON Data With Databricks

Ivan Vazharov gives us a Databricks notebook to parse and flatten JSON using PySpark:

With Databricks you get:

  • An easy way to infer the JSON schema and avoid creating it manually
  • Subtle changes in the JSON schema won’t break things
  • The ability to explode nested lists into rows in a very easy way (see the Notebook below)
  • Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe.

Click through for the notebook.

JSON Output And SSIS

Stacia Varga works around an oddity in the way SSIS reads JSON outputs:

What happened? The T-SQL produces the correct results in SQL Server Management Studio (SSMS). However, in SSIS, the same T-SQL statement in an OLE DB Source in a Data Flow Task produces two rows of data which adds a line feed into the flat file and renders the JSON unusable.

The problem is visible even before sending output to the flat file.

Click the link to see how Stacia solves this problem.

JSON Data In SSIS

Stacia Varga shows a few methods for handling JSON data in SQL Server Integration Services:

And then I had to write about it in my book Introducing Microsoft SQL Server 2016 (which is free to download) when JSON support was added to SQL Server 2016. But I still didn’t have clients using JSON. It was interesting to me that I could use SQL Server to work with JSON data, but it was still theoretical to me rather than practical.

Therefore, I never thought much about how I would handle it in SQL Server Integration Services (SSIS). I just didn’t have a reason.

Until now. This seems to be the year that I am bumping into JSON left and right. It’s everywhere!

Read on for those methods as well as Stacia’s recommendation.

Loading JSON-Based Data Into SQL Server From .NET

Chris Koester has a quick example demonstrating one way take JSON data from .NET code and load it into SQL Server:

Next we need to create a stored procedure that will accept JSON text as a parameter and insert it into the table. Two important points here:

  • JSON text must use the NVARCHAR(MAX) data type in SQL Server in order to support the JSON functions.

  • The OPENJSON function is used to convert the JSON text into a rowset, which is then inserted into the previously created table.

The whole process is quite easy; check it out.

Using JSON_MODIFY To Modify Existing JSON

Jovan Popovic shows off the JSON_MODIFY function in SQL Server:

Recently I found this question on stack overflow. The problem was in appending a new JSON object to the existing JSON array:

UPDATE TheTable
SET TheJSON = JSON_MODIFY(TheJSON, 'append $', N'{"id": 3, "name": "Three"}')
WHERE Condition = 1;

JSON_MODIFY function should take the array value from TheJSON column (the first argument), append the third argument into the first argument, and write the appended array back in TheJSON column.

However, the unexpected results in this case is the fact that JSON_MODIFY didn’t appended a JSON object {"id": 3, "name": "Three"}to the array. Instead, JSON_MODIFY appended a new JSON string literal  "{\"id\": 3, \"name\": \"Three\"}" to the end of the array.

This might be surprising result if you don’t know how JSON_MODIFY function works.

Read on to see how JSON_MODIFY works and why this doesn’t quite do what the poster thought.

Multi-Object JSON Arrays In SQL Server

Kevin Feasel

2017-12-19

JSON

Bert Wagner shows how to build JSON arrays in SQL Server:

When using FOR JSON PATH, ALL rows and columns from that result set will get converted to a single JSON string.

This creates a problem if, for example, you want to have a column for your JSON string and a separate column for something like a foreign key (in our case, HomeId). Or if you want to generate multiple JSON strings filtered on a foreign key.

The way I chose to get around this is to use CROSS APPLY with a join back to our Home table — this way we get our JSON string for either Cars or Toys created but then output it along with some additional columns.

Impedance mismatch?  What impedance mismatch?

DataRow To JSON With Powershell

Rob Sewell shows how to convert a .NET DataRow into its JSON form using Powershell:

I wanted to be able to Mock $variable. I wrapped the code above in a function, let’s call it Run-Query

Which meant that I could easily separate it for mocking in my test. I ran the code and investigated the $variable variable to ensure it had what I wanted for my test and then decided to convert it into JSON using ConvertTo-Json

Read on to see the fun mess that ConvertTo-Json made and then Rob’s simplification.

Matrix Transposition In T-SQL

Phil Factor has some fun transposing a matrix using T-SQL:

What I’m doing is simply converting the table into its JSON form, and then using this to create a table using the multi-row VALUES  syntax which paradoxically allows expressions. The expression I’m using is JSON_Value, which allows me do effectively dictate the source within the table, via that JSON Path expression, and the destination. As it is an expression, I can do all sorts of manipulation as well as a transpose.  I could, if I wanted, (in SQL 2017)provide that path parameter as a variable. This sort of technique can be used for several other reporting purposes, and it is well-worth experimenting with it because it is so versatile.

That is not at all what I would have thought up; very interesting approach.  I’d probably just be lazy and shell out to R Services.

JSON Parsing Performance

Kevin Feasel

2017-11-14

JSON

Jovan Popovic has a couple posts showing JSON parsing performance in SQL Server 2017.  First up, he makes the statement that JSON parsing can be 10x faster than XML:

I’m running 10000 iterations and getting the average spent time.

Average time to parse JSON and get the value using JSON_VALUE() function is around 3 microseconds while the equivalent action with XML typed variable and value() method takes between 30 and 40 microseconds.

This shows that parsing JSON “plain text” is 10x faster than parsing strongly typed XML variable.

He also compares JSON parsing to spatial type parsing:

I’m running 100.000 iterations and getting the average spent time in both cases. The results are:

  1. Spatial type needs around 11-14 microseconds to get the X coordinate
  2. JSON parser needs around 1-2 microseconds to get the value from X property (including CAST to float)

We can see that parsing JSON text is much faster than the equivalent operation in Spatial type.

JSON uses NVARCHAR type that is the most optimized type in SQL Server because it is used in most of the queries in SQL Server. JSON parser is based on T-SQL parser that is also very optimized. Spatial has a type, but it is still some binary serialized object that needs to be deserialized when we need to access the fields. In this case, scanning NVARCHAR is much faster than deserializing binary data into spatial.

On this side, Jovan does note that there are benefits to using spatial types, like performing spatial calculations.

Multi-Structured Data In U-SQL

Kevin Feasel

2017-09-06

JSON, U-SQL

Melissa Coates shows us how to use U-SQL to normalize JSON files in which different rows may have differing structures:

Handling the varying formats in U-SQL involves a few steps if it’s the first time you’ve done this:

  1. Upload custom JSON assemblies  [one time setup]

  2. Create a database   [one time setup]

  3. Register custom JSON assemblies   [one time setup]

  4. Upload JSON file to Azure Data Lake Store [manual step as an example–usually automated]

  5. Run U-SQL script to “standardize” the JSON file(s) into a consistent CSV column/row format

Melissa then shows us how to do this step-by-step.

Categories

June 2018
MTWTFSS
« May  
 123
45678910
11121314151617
18192021222324
252627282930