Spark: DataFrame To RDD For Data Cleansing

Gilad Moscovitch walks us through a common data cleansing problem with Spark data frames:

A problem can arise when one of the inner fields of the json, has undesired non-json values in some of the records.
For instance, an inner field might contains HTTP errors, that would be interpreted as a string, rather than as a struct.
As a result, our schema would look like:
root
 |– headers: struct (nullable = true)
 |    |– items: array (nullable = true)
 |    |    |– element: struct (containsNull = true)
 |– requestBody: string (nullable = true)
Instead of
root
 |– headers: struct (nullable = true)
 |    |– items: array (nullable = true)
 |    |    |– element: struct (containsNull = true)
 |– requestBody: struct (nullable = true)
 |    |– items: array (nullable = true)
 |    |    |– element: struct (containsNull = true)
When trying to explode a “string” type, we will get a miss type error:
org.apache.spark.sql.AnalysisException: Can’t extract value from requestBody#10

Click through to see how to handle this scenario cleanly.

Visualization Over Kafka And KSQL

Shant Hovsepian shows off a data visualization tool which can read Kafka Streams data:

KSQL is a game-changer not only for application developers but also for non-technical business users. How? The SQL interface opens up access to Kafka data to analytics platforms based on SQL. Business analysts who are accustomed to non-coding, drag-and-drop interfaces can now apply their analytical skills to Kafka. So instead of continually building new analytics outputs due to evolving business requirements, IT teams can hand a comprehensive analytics interface directly to the business analysts. Analysts get a self-service environment where they can independently build dashboards and applications.

Arcadia Data is a Confluent partner that is leading the charge for integrating visual analytics and BI technology directly with KSQL. We’ve been working to combine our existing analytics stack with KSQL to provide a platform that requires no complicated new skills for your analysts to visualize streaming data. Just as they will create semantic layers, build dashboards, and deploy analytical applications on batch data, they can now do the same on streaming data. Real-time analytics and visualizations for business users have largely been a misnomer until now. For example, some architectures enabled visualizations for end users by staging Kafka data into a separate data store, which added latency. KSQL removes that latency to let business users see the most recent data directly in Kafka and react immediately.

Click through for a couple repos and demos.

Using The Public Role

Kenneth Fisher explains the public role in SQL Server:

A common misunderstanding is that the CONNECT permission lets you do more than just connect to a database. It doesn’t. Connection only. So how come there are some things that everyone can do once they are connected to a database? Well, it’s the public role. Everyone is a member and that can’t be changed. In fact, you can’t even disable it. Oh, and I should point out that every database has one.

So what does that mean? If you have a table that you want everyone to have read access to you could grant the permission in public.

I never use the public role for anything, and so it’s a benign role.  I strongly dislike database security tools which flag the public role as a risk, mostly because I made the mistake once of believing the tool and had to start granting things like CONNECT to each new login.

Executing Dynamic SQL: EXEC vs sp_executesql

Kevin Feasel

2018-05-24

T-SQL

Tara Kizer gives a scenario where using sp_executesql can result in lower CPU utilization than EXEC:

For frequently executed queries, “EXEC sp_executesql @sql”is a good choice but at the risk of encountering parameter sniffing issues.

For queries that aren’t executed very often, “EXEC (@sql)” can be used, but you could end up with a bloated plan cache with lots of single-use execution plans.

Let’s compare the performance difference between “EXEC (@sql)” and “EXEC sp_executesql @sql” for a frequently executed query.

For application code, I tend to lean on sp_executesql very heavily to create parameterized queries.

Some Waits Just Need Ignoring

Paul Randal explains that not all SQL Server wait types are pernicious:

Wait statistics analysis is one of my favorite things to talk about because it’s so incredibly useful for performance tuning and can dramatically shorten the time it takes to zero in on the root cause of a performance problem. But you have to do it correctly. You can’t just do a SELECT * FROM sys.dm_os_wait_stats. Various people have published scripts online to aggregate and display wait statistics in an actionable way, and my script is one of the most popular (latest version is always in this post).

One question I’m often asked is why does my script have a list of wait types that it specifically filters out? The answer is that those wait types are what I call ‘benign’ – they’re usually not a problem but happen frequently enough from regular SQL Server operations that they would show up as the top waits and so would obscure the waits that you can do something about.

Read on for the rest of the story.

Scoping And Powershell Script Blocks

Shane O’Neill looks at variable scope within script blocks in Powershell:

Now what happens if we don’t want to use switch? What happens if we tried to do this with script blocks instead?

I’ve created 3 different script blocks here just for ease of use. If you want to create a single script block to do this then I encourage it! Let me know how you get on.

We’re also passing in the parameter into the script block by passing it in the brackets of the .InvokeReturnAsIs() method.

Read on and learn along with Shane.

VLFs: How Many Are Too Many?

Randolph West looks at a baseline for the maximum number of Virtual Log Files for a database:

In a transaction log with too many or too few VLFs we might experience performance issues under a normal workload, as well as during the backup and restore process.

So what is the “right” amount? In customer engagements, I follow a guideline proposed by Glenn Berry of SQLskills.com in his Diagnostic Information Queries, to keep the number of VLFs at or below 200. In my opinion, any number higher than that is cause for concern.

On the other hand, Brent Ozar Unlimited has a popular script called sp_Blitz which proposes a maximum VLF count of 1000. To Brent, a number higher than that is cause for concern.

I tend toward the lower number, but if you have a smoothly-functioning environment and some databases have 700 or 900 VLFs, I probably wouldn’t give it a second thought.

Tracking Deployment Details

Andy Leonard tells a story whose moral is that you need to keep track of what you deploy:

But this had to be done.
Right now.

I thanked Geoff and hung up the phone. I then made another judgment call and exercised yet more of my ETL Architect authority. I assigned the PrUAT ticket to myself, logged into PrUAT, executed the patch, copied the output of the execution to the Notes field of the ticket (as we’d trained all DBAs and Release Management people to do), and then manually verified the patch was, in fact, deployed to PrUAT.

I closed the ticket and called my boss. “Done. And verified,” I said. My boss replied, “Good,” and hung up. He passed the good news up the chain.

A funny thing happened the next morning. And by “funny,” I mean no-fun-at-all. My boss called and asked, “Andy? I thought you said the patch was was deployed to PrUAT.” I was a little stunned, grappling with the implications of the accusation. He continued, “The process failed again last night and vendor checks were – again – not cut.” I finally stammered, “Let me check on it and get back to you.”

It’s a good story and really sells the idea that you have to track deployment details, including when you’re doing manual deployments.

Categories

May 2018
MTWTFSS
« Apr Jun »
 123456
78910111213
14151617181920
21222324252627
28293031