Press "Enter" to skip to content

Author: Kevin Feasel

Chesterton’s Fence in Development Terms

Pete Warden picked a blog post title I couldn’t refuse:

This script came to mind as I was thinking back over the year for a few reasons. One of them was that I spent a non-trivial amount of time writing and debugging it, despite its small size and the apparent simplicity of the problem it tackled. Even in apparently glamorous fields like machine learning, 90% of the work is nuts and bolts integration like this. If anything you should be doing more of it as you become more senior, since it requires a subtle understanding of the whole system and its requirements, but doesn’t look impressive from the outside. Save the easier-to-explain projects for more junior engineers, they need them for their promotion packets.

The reason this kind of work is so hard is precisely because of all the funky requirements and edge cases that only become apparent when code is used in production. As a young engineer my first instinct when looking at a snarl of complex code for something that looked simple on the surface was to imagine the original authors were idiots. I still remember scoffing at the Diablo PC programmers as I was helping port the codebase to the Playstation because they used inline assembler to do a simple signed to unsigned cast. My lead, Gary Liddon, very gently reminded me that they had managed to ship a chart-topping game and I hadn’t, so maybe I had something to learn from their approach?

I am a huge fan of the concept which, made brief, states that if you do not understand why something is the case, don’t change it. If you do understand it, maybe change it but be prudent about it. It’s also something I have often trouble with, as my natural inclination toward code bases is to use the cleansing power of fire to burn it all down.

Comments closed

Using ksqlDB to Read Twitter Data

Robin Moffatt has a quick demonstration of ksqlDB:

I’m going to show you how to use ksqlDB to do the following:

– Configure the live ingest of a stream of data from an external source (in this case, Twitter)
– Filter the stream for certain columns
– Create a new stream populated only by messages that match a given predicate
– Build aggregate materialised views, and use pull queries to directly fetch the state from these

Let’s dive in! As always, you’ll find the full test rig for trying this out yourself on GitHub.

Comments closed

Negative Port Numbers in SQL Server

Jason Brimhall takes us through a bit of a mystery:

Take quick note of the port number I have circled in red. This doesn’t match the original query at all. In fact, it doesn’t come anywhere close to the actual port number. In addition, the port number shown here is a negative value. Obviously a negative port is not correct as TCP/IP ports only range from 0-65535. So what is happening here?

Read on for the answer.

Comments closed

Azure Synapse Analytics Result Set Caching

Niko Neugebauer takes us through result set caching in Azure SQL Data Warehouse Azure Synapse Analytics:

I just put some result on the output, because as you can imagine there are some certain limits on the amount of the output that will be cached and that will be not. Besides the basic logical stuff, such as having deterministic functions only (functions which output will not be varying depending on the execution), not using System Objects or UDFs (and it seems that scalar UDF inlining is not a part of Azure SQL DW yet), no row-level security or column-level security enabled, the main thing and which seems to be pretty good decision as far as I am concerned – the row size larger than 64KB won’t be cached period.

Read on to see what Niko has learned, including cache performing and limitations. Between this and the data pools in SQL Server Big Data Clusters, Microsoft’s spent some time thinking about data caching in cloud-based versions of SQL Server.

Comments closed

Powershell 7 Release Candidate

Max Trinidad has a nice update for us:

The moment everyone has been waiting for some time is here, PowerShell Release Candidate is available for download. This a “Go Live” release officially supported in production by Microsoft.

Everyone in the Microsoft PowerShell Team, with the help of the community, has done an excellent job with the evolution of this new version of PowerShell. Read all about it on the PowerShell DevBlogs recent post “Announcing the PowerShell 7.0 Release Candidate“.

Make sure to read all previous posts as they perfectly outlined under the “Why is PowerShell 7 so awesome?” section of the release candidate post.

Click through for more details. One of the nice things in this RC is a consistent Out-Gridview experience, so it’s not just for Windows anymore.

Comments closed

Using the Power BI Color Picker

David Eldersveld walks us through the Power BI color picker:

The new color picker allows colors in RGB format in addition to the hex color format that Power BI has used exclusively until now.

The new one also easily allows users to choose from a wider selection of shades and tones. This builds upon the simpler selection of hues and tints in the original.

In case you don’t know what David means, there is an excellent explanation of each term.

Comments closed

A Diagnostic Book for SQL Server

Emanuele Meazzo has a new years’s gift for us:

Welcome to 2020! I wanted to start this year by giving to all my fellow consultants another way to troubleshoot our beloved SQL Servers; I’ve already talked about diagnostic notebooks in the past, and now, since Azure Data Studio has implemented the feature, I wanted to group them into a Diagnostic Book.

As the name implies, a jupyter book is no other than a collection of notebooks (and markdown files) that groups everything in a coherent space, with an index and navigation options alike.

I think this sort of collection of notebooks (a, uh, note-book), if put together well, makes it easier to learn a new environment and understand key problems than a big Scripts.txt file or a folder full of scripts.

Comments closed

Repartitioning and Coalescing in Spark

Divyansh Jain contrasts repartitioning and coalescing in Spark:

What is Coalesce?

The coalesce method reduces the number of partitions in a DataFrame. Coalesce avoids full shuffle, instead of creating new partitions, it shuffles the data using Hash Partitioner (Default), and adjusts into existing partitions, this means it can only decrease the number of partitions.

What is Repartitioning?

The repartition method can be used to either increase or decrease the number of partitions in a DataFrame. Repartition is a full Shuffle operation, whole data is taken out from existing partitions and equally distributed into newly formed partitions.

Read on to learn good reasons to use both.

Comments closed

Performance Tuning Load of Partitioned Hive Tables on S3 with Spark

Dmitry Tolpeko walks us through a performance problem in Spark:

I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow.

Initially it took about 4 hours to convert ~2,100 input .gz files (~1.9 TB of data) into Parquet, while the actual Spark job took just 38 minutes to run and the remaining time was spent on loading data into a Hive partition.

Let’s see what is the reason of such behavior and how we can improve the performance.

Read on to see how.

Comments closed