Press "Enter" to skip to content

Curated SQL Posts

Stored Functions in MySQL

Robert Sheldon continues a series on MySQL:

In the previous three articles in this series, I focused on creating basic database objects that you can use to get started with MySQL. You learned how to build an initial database and then add tables, views, and stored procedures. In this article, I cover one more important type of object, the stored function, a routine that is stored in a database and can be invoked on-demand, similar to a user-defined scalar function in SQL Server or other database systems.

I’m not familiar enough with stored functions to know if they have the same performance limitations as what we have in SQL Server (specifically around needing to run everything in the function once for each row) but based on a comment at the end of Robert’s post, it does seem that way.

Comments closed

Partial Update Operations in Cosmos DB

Hasan Savran partially deflates the partial update bubble:

Partial Update was one of the most wanted features by Cosmos DB customers. In a regular update operation, you need to send the whole JSON document to Cosmos DB. This can be silly if your data model is large and you want to update one field in it. With a regular update, your request object will be large because you need to send the whole data model. Regular Update operation needs more resources from the client/SDK and network bandwidth.

    You might think that partial updates might cost fewer request units. Unfortunately, this is not the case. Because Cosmos DB still needs to open the JSON document, change the necessary properties and save the data. Cosmos DB uses almost the same amount of CPU and memory for this operation for a regular update or a partial update.

That it costs just about as much as a full write does reduce the value of partial updates. Still, there is some value in reducing bandwidth requirements or making changes where you don’t know the entire contents of the document up-front.

Comments closed

Thoughts on Technical Interview Questions

Steve Jones shares some thoughts:

Redgate had a discussion recently among our developers about our interview process and questions. There has been a standard question asking candidates about 2D arrays, but as one developer pointed out, we don’t use these in our code base. So, why do we ask candidates about this topic?

The developers came up with a different question, actually a series of questions that ask about a class and then how to test parts of this class. We mostly work in C# in a DevOps culture, so this seemed like a good idea. They proposed a scenario with a few questions and then asked current developers to solve the questions and give feedback on the language, structure, and difficulty of the problem.

Read on for Steve’s thoughts. It’s been a minute since I’ve given an interview (a plus side to having a really stable pair of database teams the past couple of years) but one of the things I enjoy doing is taking screenshots of Management Studio in various phases of work and ask “What do you see here? There are no right or wrong answers.” I say the latter because I don’t want you to enumerate through every string you see on the screen; I want you to explain what information of importance you’ve caught.

One big tip for interviewers: instead of algorithmic or gotcha questions, show actual code at the 25th, 50th, 75th, and 95th percentiles of difficulty within your code base, focusing on things a person could understand with about 20-30 lines of code and zero context. “Difficulty” can mean that this code was tough to write, is tough to maintain, or that you have included common (and sometimes uncommon) errors to an otherwise real segment of code. For database developers, that might include things like invalid NULL checks, incorrect assignments, etc. Ask the person to perform a code review and point out what they see that is interesting. That way, you get an opportunity to check their technical bona fides in a realistic but relatively low-pressure scenario by simulating the activities that a person actually would do in the job.

I have more advice but I’ll save that for another day.

Comments closed

Movie Color Swaps in R

Mark White does some coloration switcharoos:

I also love film, and I started thinking about ways I could generate color palettes from films that use color beautifully. There are a number of packages that can generate color palettes from images in R, but I wanted to try writing the code myself.

I also wanted to not just generate a color palette from an image, but then swapping it with a different color palette from a different film. This is similar to neural style transfer with TensorFlow, but much simpler. I’m one of those people that likes to joke how OLS is undefeated; I generally praise the use of simpler models over more complex ones. So instead of a neural network, I use k-means clustering to transfer a color palette of one still frame from a film onto another frame from a different movie.

There are some interesting outcomes in the post, including a mashup of 2001: A Space Odyssey’s color scheme onto Arrival, as well as Kill Bill and Dr. Strangelove. The latter reminds me of a still from the credits sequence to a 1970s movie. H/T R-Bloggers.

Comments closed

File Format Throwdown

Tomaz Kastrun tries out several file formats in Azure Data Lake Storage (Gen2):

CSV data format is an old format and very common for data tasks, like import, export or storing. And when it comes performance of creating CSV file, reading and writing CSV files, how does it still stand against some other formats.

We will be looking at benchmarking the CRUD operations with different data formats; from CSV to ORC, Parquet, AVRO and others with the simple Azure data storage operations, like Create, Write, read and transform.

It’s important to remember that Parquet and ORC are intended to solve radically different problems than Avro. Parquet and ORC are columnar datasets intended to aggregate quickly and efficiently, whereas Avro is intended for efficient row storage. CSV is intended for easy-to-work-with row storage.

Then, Tomaz follows up with some R:

we have created Azure blob storage, connected secure connection using Python and started uploading files to blob store from SQL Server. Alongside, we compared the performance of different file types. ORC, AVRO, Parquet, CSV and Feather. Coming to conclusion, CSV is great for its readability, but not suitable (as a file format) for all types of workloads.

We will be doing a similar benchmark with R language. The goal is to see, if CSV file format can be replaced by a file type that better, both in performance and storage.

The Feather file format, by the way, comes from Apache Arrow and works especially well with Python and R. You might not get the same performance benefits in other languages, depending on its library support.

Comments closed

Reading Cosmos DB Data into Power BI

Gauri Mahajan loads Cosmos DB data into Power BI:

As we are going to report Cosmos DB data with Power BI, the two items we need in place are a Cosmos DB instance and well as an installation of Power BI. It is assumed that an instance of Cosmos DB – SQL API is already created with some sample data. It is also assumed that the latest version of Power BI Desktop is already installed on the local machine. One can create some sample data using the built-in scripts in a Cosmos DB instance. One can explore the data using the Data Explorer on the dashboard of the Cosmos DB instance as shown below.

Read on for the process. Stories like this are why I discount the ability of document databases to change fluidly from one document to the next—as soon as you want to analyze things across documents, you suddenly need schema and structure.

Comments closed

Azure Redis Cache Geo-Replication

Arun Sirpal shows how to set up geo-replication in Azure Redis Cache:

The concept of a geo-replicated partnership between a primary and secondary node is very similar to that of something you may have seen with Azure SQL DB, where the primary handles all R/W and then the changes are pushed to secondary ( async). This is no different with Redis.

Read on to see what limitations exist and how you can set up geo-replication.

Comments closed