Month: October 2024

In this blog, I’m going to delve into ingesting and transforming application logs to log analytics workspace using Log Ingestion API technique. Before we jump into the details, let’s explore the different types of ingestion techniques based on the application log types.

If applications and services logs information to text files instead of standard logging services such as Windows Event log or Syslog, Custom Text Logs can be leveraged to ingest such text file logs in log analytics workspace.

If applications and services logs information to JSON files instead of standard logging services such as Windows Event log or Syslog, Custom JSON Logs can be leveraged to ingest such text file logs in log analytics workspace.

If an application understands how to send data to an API then you can leverage Log Ingestion API to send the data to Log Analytics Workspace.

Custom Text/JSON Logs are out of scope for this blog, I might write a separate blog dedicated to these techniques later. In this blog, my focus will be on streaming data to log analytics workspace using Log Ingestion API and transforming the data for optimal usage.

Note: This blog aims to demonstrate how to ingest logs using the log ingestion API. To keep things straightforward, I’ll refer to our public documentation.

Click through for details on the process.

Comments closed

Tablespaces in Oracle and PostgreSQL

Published 2024-10-07 by Kevin Feasel

Umair Shahid explains how tablespaces work in Oracle and PostgreSQL:

Tablespaces play an important role in database management systems, as they determine where and how database objects like tables and indexes are stored. Both Oracle and PostgreSQL have the concept of tablespaces, but they implement them differently based on the overall architecture of each database.

Oracle’s tablespaces are an integral part of the database that provide various functionalities, including separating data types, managing storage, and optimizing performance. PostgreSQL, on the other hand, takes a more simplified approach, using tablespaces primarily to control where physical files are stored.

This blog aims to provide a comprehensive comparison between Oracle and PostgreSQL tablespaces, covering their architecture, creation, and practical use cases, with the goal of helping DBAs better understand their capabilities and limitations

Read on to learn more about how tablespaces work in each platform and how they differ.

Comments closed

An Overview of Differential Privacy

Published 2024-10-04 by Kevin Feasel

Zachary Amos covers a topic of note:

Data analytics tools allow users to quickly and thoroughly analyze large quantities of material, accelerating important processes. However, individuals must ensure to maintain privacy while doing so, especially when working with personally identifiable information (PII).

One possibility is to perform de-identification methods that remove pertinent details. However, evidence has suggested such options are not as effective as once believed. People may still be able to extract enough information from what remains to identify particular parties.

Read on to learn a bit more about the impetus behind differential privacy and a few of the techniques you can use to get there. The real trick with differential privacy is adding the right kind of noise not to distort the distribution of the data, while still not allowing an end user to unearth enough information to identify a specific individual.

Comments closed

Splitting Data into Equally-Sized Groups in R

Published 2024-10-04 by Kevin Feasel

Steven Sanderson splits out some data:

As a beginner R programmer, you’ll often encounter situations where you need to divide your data into equal-sized groups. This process is crucial for various data analysis tasks, including cross-validation, creating balanced datasets, and performing group-wise operations. In this comprehensive guide, we’ll explore multiple methods to split data into equal-sized groups using different R packages and approaches.

Click through for methods and examples.

Comments closed

Spaces in Microsoft Fabric Delta Table Names

Published 2024-10-04 by Kevin Feasel

Sandeep Pawar is looking for a bit more space:

One of the annoying limitations of Direct Lake (rather of the SQL endpoint) was that you could not have spaces in table and column names in the delta table. It was supported in the delta table but the table was not query-able in the SQL endpoint which meant you had to rename all the tables and columns in the semantic model with business friendly names (e.g. rename customer_name to Customer Name). Tabular Editor and Semantic Link/Labs was helpful for that.

But at #FabConEurope, support for spaces in table names was announced and is supported in all Fabric engines. You have to use the backtick to include spaces, as show below.

Read on to learn more about how you can create these, what the limitations are, and then you can decide whether it’s worth it to have spaces in table names.

Comments closed

Transforming Queries Based on Human Intent

Published 2024-10-04 by Kevin Feasel

Andrei Lepikhov and Alena Rybakina ask a question:

As usual, this project was prompted by multiple user reports with typical complaints, like ‘SQL server executes the query times faster’ or ‘Postgres doesn’t pick up my index’. The underlying issue that united these reports was frequently used VALUES sequences, typically transformed in the query tree into an SEMI JOIN.

I also want to argue one general question: Should an open-source DBMS correct user errors? I mean optimising a query even before the search for an optimal plan begins, eliminating self-joins, subqueries, and simplifying expressions – everything that can be achieved by proper query tuning. The question is not that simple since DBAs point out that the cost of query planning in Oracle overgrows with the complexity of the query text, which is most likely caused, among other things, by the extensive range of optimisation rules.

My short answer is, yes. SQL is a 4th generation language, meaning that end users describe the results they need but leave it to the engine to determine how to get there. As performance tuners, we may understand some of the foibles of the database engine and how it does (or does not) perform these translations, but in an ideal world, every unique representation of an end state for a given query should have the same, maximally optimized internal way of getting there. This is impossible in practice, but it should be a guiding principle for engine behavior.

Comments closed

Testing a Stored Procedure with tSQLt

Published 2024-10-04 by Kevin Feasel

Olivier Van Steenlant runs a test:

In the previous data recipe, Create a Test Class & First Unit Test with tSQLt, we created our very first T-SQL Unit Test to test the database collation. In this data recipe, we will test the execution of a Stored Procedure. Specifically, we will validate what happens when a new User is added to the user dimension.

Click through to see how it all works.

Comments closed

Working with the Apache Flink Table API

Published 2024-10-03 by Kevin Feasel

Martijn Visser takes us through the Flink Table API:

Apache Flink® offers a variety of APIs that provide users with significant flexibility in processing data streams. Among these, the Table API stands out as one of the most popular options. Its user-friendly design allows developers to express complex data processing logic in a clear and declarative manner, making it particularly appealing for those who want to efficiently manipulate data without getting bogged down in intricate implementation details.

At this year’s Current, we introduced support for the Flink Table API in Confluent Cloud for Apache Flink® to enable customers to use Java and Python for their stream processing workloads. The Flink Table API is also supported in Confluent Platform for Apache Flink®, which launched in limited availability and supports all Flink APIs out of the box.

This introduction highlights its capabilities, how it integrates with other Flink APIs, and provides practical examples to help you get started. Whether you are working with real-time data streams or static datasets, the Table API simplifies your workflow while maintaining high performance and flexibility. If you want to go deeper into the details of how Table API works, we encourage you to check out our Table API developer course.

Read on to learn more information about how the Table API works in comparison to other interfaces.

Comments closed

An Overview of LightGBM

Published 2024-10-03 by Kevin Feasel

Vinod Chugani continues a series on tree-based classification techniques:

LightGBM is a highly efficient gradient boosting framework. It has gained traction for its speed and performance, particularly with large and complex datasets. Developed by Microsoft, this powerful algorithm is known for its unique ability to handle large volumes of data with significant ease compared to traditional methods.

In this post, we will experiment with LightGBM framework on the Ames Housing dataset. In particular, we will shed some light on its versatile boosting strategies—Gradient Boosting Decision Tree (GBDT) and Gradient-based One-Side Sampling (GOSS). These strategies offer distinct advantages. Through this post, we will compare their performance and characteristics.

Read on to learn more about LightGBM as an algorithm, as well as how to use it.

Comments closed

Plotting the ROC Curve in Microsoft Fabric

Published 2024-10-03 by Kevin Feasel

Tomaz Kastrun gets plotting:

ROC (Receiver Operation Characteristics) – curve is a graph that shows how classifiers performs by plotting the true positive and false positive rates. It is used to evaluate the performance of binary classification models by illustrating the trade-off between True positive rate (TPR) and False positive rate (FPR) at various threshold settings.

Read on to see how you can generate one in a Microsoft Fabric notebook. Tomaz also plots a density function for additional fun.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31