Press "Enter" to skip to content

Month: May 2022

Movie Color Swaps in R

Mark White does some coloration switcharoos:

I also love film, and I started thinking about ways I could generate color palettes from films that use color beautifully. There are a number of packages that can generate color palettes from images in R, but I wanted to try writing the code myself.

I also wanted to not just generate a color palette from an image, but then swapping it with a different color palette from a different film. This is similar to neural style transfer with TensorFlow, but much simpler. I’m one of those people that likes to joke how OLS is undefeated; I generally praise the use of simpler models over more complex ones. So instead of a neural network, I use k-means clustering to transfer a color palette of one still frame from a film onto another frame from a different movie.

There are some interesting outcomes in the post, including a mashup of 2001: A Space Odyssey’s color scheme onto Arrival, as well as Kill Bill and Dr. Strangelove. The latter reminds me of a still from the credits sequence to a 1970s movie. H/T R-Bloggers.

Comments closed

File Format Throwdown

Tomaz Kastrun tries out several file formats in Azure Data Lake Storage (Gen2):

CSV data format is an old format and very common for data tasks, like import, export or storing. And when it comes performance of creating CSV file, reading and writing CSV files, how does it still stand against some other formats.

We will be looking at benchmarking the CRUD operations with different data formats; from CSV to ORC, Parquet, AVRO and others with the simple Azure data storage operations, like Create, Write, read and transform.

It’s important to remember that Parquet and ORC are intended to solve radically different problems than Avro. Parquet and ORC are columnar datasets intended to aggregate quickly and efficiently, whereas Avro is intended for efficient row storage. CSV is intended for easy-to-work-with row storage.

Then, Tomaz follows up with some R:

we have created Azure blob storage, connected secure connection using Python and started uploading files to blob store from SQL Server. Alongside, we compared the performance of different file types. ORC, AVRO, Parquet, CSV and Feather. Coming to conclusion, CSV is great for its readability, but not suitable (as a file format) for all types of workloads.

We will be doing a similar benchmark with R language. The goal is to see, if CSV file format can be replaced by a file type that better, both in performance and storage.

The Feather file format, by the way, comes from Apache Arrow and works especially well with Python and R. You might not get the same performance benefits in other languages, depending on its library support.

Comments closed

Reading Cosmos DB Data into Power BI

Gauri Mahajan loads Cosmos DB data into Power BI:

As we are going to report Cosmos DB data with Power BI, the two items we need in place are a Cosmos DB instance and well as an installation of Power BI. It is assumed that an instance of Cosmos DB – SQL API is already created with some sample data. It is also assumed that the latest version of Power BI Desktop is already installed on the local machine. One can create some sample data using the built-in scripts in a Cosmos DB instance. One can explore the data using the Data Explorer on the dashboard of the Cosmos DB instance as shown below.

Read on for the process. Stories like this are why I discount the ability of document databases to change fluidly from one document to the next—as soon as you want to analyze things across documents, you suddenly need schema and structure.

Comments closed

Azure Redis Cache Geo-Replication

Arun Sirpal shows how to set up geo-replication in Azure Redis Cache:

The concept of a geo-replicated partnership between a primary and secondary node is very similar to that of something you may have seen with Azure SQL DB, where the primary handles all R/W and then the changes are pushed to secondary ( async). This is no different with Redis.

Read on to see what limitations exist and how you can set up geo-replication.

Comments closed

Explaining Key Terms in Category Theory

Gulshan Singh describes three tricky terms for newcomers to functional programming:

Monoid is based on an associative function. Formally, a functor is a type F[A] with an operation
map with type (A => B) => F[B]. In functional programming one typically only deals with one category, the category of types. A functor is an interface with one method i.e a mapping of the category to category. Monads basically is a mechanism for sequencing computations. A monad is a way to wrap stuff, then operate on the wrapped stuff without unwrapping it.

If that wasn’t too clear, check out the post for more detail. And if you want a whole lot more detail, Bartosz Milewski’s YouTube series (and book) on category theory are great resources for dozens of hours of learning.

Comments closed

Azure SQL DB ARM Template Conflicts with Azure AD Administration

Joao Antunes points out a potential timing issue around combining Azure Active Directory administration with Azure SQL Database ARM templates:

ARM templates are widely used when we need to repeatedly deploy solutions/infrastructures in the cloud. Leveraging the concept of infrastructure as code ARM templates are a powerful resource to ease our daily job, however we might face some challenges when using them.

When we are creating several resources within the same template – using Json or Bicep – it’s crucial to make sure that all resources are created in the right order, ensuring that all depending on resources are fully provisioned before you move to the next operation.

Error (internal server errors) and conflicts  can occur during our ARM template deployment and it could be difficult to troubleshoot or understand the root cause of them.

Read on for one annoying error and its fix.

Comments closed

Debugging Powershell Modules in Visual Studio Code

David Wilson takes us through debugging Powershell code in Visual Studio Code:

Microsoft has a created a great Powershell extension for VS Code that makes it easy to work with Powershell files. Once the plugin is installed you get intellisense, syntax highlighting, and even visual debugging for files that have a Powershell extension. This debugger can be easily started by pressing F5 when you’re in a Powershell script. Stopping execution on a particular line is as easy as setting a breakpoint by clicking next to a line number.

If you’re coming at debugging from Visual Studio, the VS Code is a little more complicated than what’s in Visual Studio, at least in terms of things to set up first. Once you’ve tried it out a couple of times, though, it’s a pretty convenient experience.

Comments closed

Data Products in Data Mesh

Paul Andrew takes us through a thought process:

In the context of an idealistic data mesh architecture, establishing a working definition of a data product seems to be very real problem for most. What constitutes a data product seems to be very subjective, circumstantial in terms requirements and interlaced with platform technical maturity. AKA, a ‘minefield’ to navigate in definitional terms.

To help get my thoughts in order (as always) here is my currently thinking and definition for a data mesh – data product.

Read on for Paul’s thoughts.

Comments closed