Press "Enter" to skip to content

Author: Kevin Feasel

ETL with Spark and Hive

Emrah Mete gives us an example of using Apache Spark for ETL into Apache Hive:

Now let’s go to the construction of the sample application. In the example, we will first send the data from our Linux file system to the data storage unit of the Hadoop ecosystem (HDFS) (for example, Extraction). Then we will read the data we have written here with Spark and then we will apply a simple Transformation and write to Hive (Load). Hive is a substructure that allows us to query the data in the hadoop ecosystem, which is stored in this environment. With this infrastructure, we can easily query the data in our big data environment using SQL language.

Most of the things relational database professionals do are pretty much the same things that you do with Spark and Hive. There are differences in implementation and level of programming familiarity, but they’re pretty similar.

Comments closed

Choosing Between Management Studio and Azure Data Studio

Brent Ozar gives us the lay of the tooling land:

Every time there’s a new release of SQL Server or SQL Server Management Studio, you can grab the latest version of SSMS and keep right on keepin’ on. Your job still functions the same way using the same tool, and the tool keeps getting better.

And it’s free. You don’t have to ask the boss for upgrade money. You can just download it, install it, and take advantage of things like the cool new execution plan est-vs-actual numbers (which also cause presenters all over to curse, knowing that they have to redo a bunch of screenshots.)

I spend a lot of time jumping back & forth between SQL Server and Postgres, and lemme just tell you, the tooling options on the other side of the fence are a hot mess.

Yeah, Management Studio is the best of the bunch. I’m using Azure Data Studio more at home but still need a couple of plugins to use it often at work. And those two beat pretty much every other tool I’ve ever worked with.

Comments closed

Scripting with Variables in Bash

Kellyn Pot’vin-Gorman shows how easy it can be to write Bash scripts with variables:

Let’s start with a use case of deploying a Azure database. When a customer is making the decision to build it out, there are specific information needed to deploy and this will continue to change as the Azure catalog is updated with new offerings. For our example, we’ll stick to a very small snippet of code, as the values we dynamically create will be reused throughout the script. This example will skip past the actual server creation, etc. and just focus on the user database creation. The Server, zone and subscription are all set in the default steps earlier on so as not to have to repeat it throughout each resource deployment step.

There’s a lot to Bash and its programming guide is a lot of sheets of paper (ask me how I know), but this is one of those places where you can get a nice benefit easily.

Comments closed

Using dbatools for Inventory Analysis

Andreas Schubert gives us a way to learn more about our SQL Server inventories with dbatools:

With the multitude of environments that I am operating, it’s impossible to remember every server, every database or the multiple different ways they are interacting with each other. Therefore, one of the first things I do when taking over a consulting engagement is mapping out all those different bits of information.

Since the environments usually change pretty fast, my goal is to automate this process as much as possible.

In this series of posts, I will try to show you how I am implementing this. Of course, your requirements or implementations may differ, but hopefully this blog post can give you some ideas about your tasks too.

Click through for a script. There are also some good comments.

Comments closed

READPAST In Action

Erik Darling shows how READPAST is no panacea:

Locking hints can be really handy in these situations, especially the READPAST hint. The documentation for it says that it allows you to skip over row level locks (that means you can’t skip over page or object level locks).

What it leaves out is that your READPAST query may also need to try to take row level shared locks.

Read on for an example as well as an alternative which ends up being better in this case.

Comments closed

SQL Server 2019 CTP 2.5

The SQL Server team has a new CTP out:

We’re excited to announce the monthly release of SQL Server 2019 community technology preview (CTP) 2.5. SQL Server 2019 is the first release of SQL Server to closely integrate Apache Spark™ and the Hadoop Distributed File System (HDFS) with SQL Server in a unified data platform.

This is a big one for me: lots of changes in Big Data Clusters, PolyBase on Linux, and a Java SDK. Looks like I am going to be pretty busy.

Comments closed

Why Optimize for Ad Hoc Workloads

Randolph West explains why optimize for ad hoc workloads should be enabled by default:

Enabling the optimize for ad hoc workloads configuration setting will reduce the amount of memory used by all query plans the first time they are executed. Instead of storing the full plan, a stub is stored in the plan cache. Once that plan executes again, only then is the full plan stored in memory. What this means is that there is a small overhead for all plans that are run more than once, on the second execution.

Read the whole argument. I don’t know that I’ve seen an instance yet where this setting was a really bad choice.

Comments closed

Testing an Event-Driven System

Andy Chambers takes us through how to test an event-driven system:

Each distinct service has a nice, pure data model with extensive unit tests, but now with new clients (and consequently new requirements) coming thick and fast, the number of these services is rapidly increasing. The testing guardian angel who sometimes visits your thoughts during your morning commute has noticed an increase in the release of bugs that could have been prevented with better integration tests.

Finally after a few incidents in production, and with velocity slowing down due to the deployment pipeline frequently being clogged up by flaky integration tests, you start to think about what you want from your test suite. You set off looking for ideas to make really solid end-to-end tests. You wonder if it’s possible to make them fast. You think about all the things you could do with the time freed up by not having to apply manual data fixes that correct for deploying bad code.

At the end of it all, hopefully you’ll arrive here and learn about the Test Machine.

Check it out. Testing these types of system is certainly possible, but can be a bit difficult because of the additional layers of complexity.

Comments closed

Monads and Monoids and Functors

Anmol Sarna explains the concept of a monad:

In functional programming, a monad is a design pattern that allows structuring programs generically while automating away boilerplate code needed by the program logic.

To simplify the above definition a bit more, We can think of monads as wrappers. You just take an object and wrap it with a monad.

Let’s just be clear on one thing: A Monad is not a class or a trait; Neither is it only dedicated to the Scala language. It is a concept related to functional programming.

This also includes a few examples in Scala.

Comments closed

CosmosDB Continuation Tokens

Hasan Savran walks us through the idea of a continuation token in CosmosDB:

In CosmosDB, TOP option is required and its default value is 100. You can change the default value by sending a different value using the request header “x-ms-max-item-count“. If you have 40000 rows in your Orders table, and run the same query in CosmosDB, you will get 100 rows(documents) rather than 40000 rows(documents). CosmosDB returns all kind of metadata with the data. You can find this metadata in the response headers. One of those responses is, “x-ms-continuation” and it is responsible to display the rest of the rows of your query. If you like to get the next set of results, you can take “x-ms-continuation” value from the response headers and attach it to your next request to get the next set of rows. CosmosDB SDK does this automatically for you. SDK checks for the x-ms-continuation value when you check HasMoreResults property. If this property is true, that means CosmosDB returned a continuation token.

I have fanciful notions of SQL Server offering something similar—think of a grid built from a query. Get the first 50 rows from the result set and store that off in tempdb somewhere, using the “continuation token” (which might just be the full name in tempdb) and auto-trashing after a certain amount of time.

Comments closed