Press "Enter" to skip to content

Day: April 28, 2023

Understanding the Kafka Partitioner

Bill Bejeck talks partitions:

Apache Kafka is the de facto standard for event streaming today. Part of what makes Kafka so successful is its ability to handle tremendous volumes of data, with a throughput of millions of records per second, not unheard of in production environments. One part of Kafka’s design that makes this possible is partitioning.  

Kafka uses partitions to spread the load of data across brokers in a cluster, and it’s also the unit of parallelism; more partitions mean higher throughput. Since Kafka works with key-value pairs, getting records with the same key on the same partition is essential.  

Read on to learn a bit about how that partitioning works and why it’s important for application design, especially across multiple programming languages.

Comments closed

Creating a Clickable Word Cloud with Shiny

Mandy Norrbo builds a word cloud:

Word clouds are a visual representation of text data where words are arranged in a cluster, with the size of each word reflecting its frequency or importance in the data set. Word clouds are a great way of displaying the most prominent topics or keywords in free text data obtained from websites, social media feeds, reviews, articles and more. If you want to learn more about working with unstructured text data, we recommend attending our Text Mining in R course

Usually, a word cloud will be used solely as an output. But what if you wanted to use a word cloud as an input? For example, let’s say we visualised the most common words in reviews for a hotel. Imagine we could then click on a specific word in the word cloud, and it would then show us only the reviews which mention that specific word. Useful, right?

Read on to see how you can create one of these.

Comments closed

Hybrid ML and Rules-Based Fraud Detection

Ayodeji Ogunlami mixes approaches:

In developing this hybrid system, sets of rules are required as well as a machine learning model. I would be making use of a vehicle insurance dataset from Kaggle in this demonstration.

The dataset can be downloaded from this link: https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection

The ML model would be built using a random forest classifier on Azure Databricks using Pyspark.

This seems to be the most sensible approach, especially given how rare actual fraud incidents are and what that imbalance does to classification algorithms.

Comments closed

Knitting R-Markdown Files into Google Docs

Benjamin Smith makes a Google Doc:

RMarkdown is a powerful framework for writing a documents that contain a mixture of text, code, and the output of the code. Popular output formats for RMarkdown Documents (.Rmd) include HTML, PDF and Word Documents. It is also possible to output RMarkdown documents as part of a static website using blogdown package and is (still!) possible to publish RMarkdown documents to WordPress sites as well (like this one)!

Recently, I started to look into the possibility of outputting an .Rmd file as a Google Doc, but I was unable to locate any out-of-box solutions. After looking into the issue I developed a small function that makes it possible!

Click through for that function.

Comments closed

Portfolio Management for Creating a Technology Strategy

Kevin Sookocheff busts out the 2×2 matrix:

Application Portfolio Management (APM) draws inspiration from financial portfolio management, which has been around since at least the 1970s. By looking at all applications and services in the organization and analyzing their costs and benefits, you can determine the most effective way to manage them as part of a larger overall strategy. This allows the architect or engineering leader to take a more strategic approach to managing their application portfolio backed by data. Portfolio management is crucial for creating a holistic view of your team’s technology landscape and making sure that it aligns with business goals.

This is for C-levels and VPs rather than individual contributors, but acts as a good way of thinking about a portfolio of applications and what to do with each.

Comments closed

Converting JSON to a Relational Schema with KQL

Devang Shah does some flattening and moving:

In the world of IoT devices, industrial historians, infrastructure and application logs, and metrics, machine-generated or software-generated telemetry, there are often scenarios where the upstream data producer produces data in non-standard schemas, formats, and structures that often make it difficult to analyze the data contained in these at scale. Azure Data Explorer provides some useful features to run meaningful, fast, and interactive analytics on such heterogenous data structures and formats. 

In this blog, we’re taking an example of a complex JSON file as shown in the screenshot below. You can access the JSON file from this GitHub page to try the steps below.

Click through for the example, which is definitely non-trivial.

Comments closed