Press "Enter" to skip to content

Day: October 5, 2020

Stream Processing with ksqldb

Michael Drogalis takes us through how stream processing works with ksqldb:

ksqlDB, the event streaming database, is becoming one of the most popular ways to work with Apache Kafka®. Every day, we answer many questions about the project, but here’s a question with an answer that we are always trying to improve: How does ksqlDB work?

The mechanics behind stream processing can be challenging to grasp. The concepts are abstract, and many of them involve motion—two things that are hard for the mind’s eye to visualize. Let’s pop open the hood of ksqlDB to explore its essential concepts, how each works, and how it all relates to Kafka.

Click through for a demo with animations.

Comments closed

Reasons Data Science Projects Fail

Ryohei Fujimaki summarizes some of the reasons why data science projects can fail:

According to Gartner analyst Nick Heudecker, over 85% of data science projects fail.  A report from Dimensional Research indicated that only 4% of companies have succeeded in deploying ML models to production environment.

Even more critical, the economic downturn caused by the COVID-19 pandemic has placed increased pressure on data science and BI teams to deliver more with less. In this down market, organizations are reassessing which AI/ML models they should develop, how to optimize resources and how to best use valuable budget dollars for maximum impact. In this type of environment, AI/ML project failure is simply not acceptable.

That 85% sounds suspiciously like the percentage of failed business intelligence and data warehouse projects, as well as the percentage of failed big data projects. It’s close enough that it makes me want to come up with some overarching idea that projects based on the consolidation of multiple independent data systems across several business units are liable to fail about 5/6 of the time.

Comments closed

Delta Lake DML Internals

Tathagata Das, et al, take us through how Delta Lake handles update, delete, and merge operations:

`DELETE` works just like `UPDATE` under the hood. Delta Lake makes two scans of the data: the first scan is to identify any data files that contain rows matching the predicate condition. The second scan reads the matching data files into memory, at which point Delta Lake deletes the rows in question before writing out the newly clean data to disk.

After Delta Lake completes a `DELETE` operation successfully, the old data files are not deleted — they’re still retained on disk, but recorded as “tombstoned” (no longer part of the active table) in the Delta Lake transaction log. Remember, those old files aren’t deleted immediately because you might still need them to time travel back to an earlier version of the table. If you want to delete files older than a certain time period, you can use the `VACUUM` command.

Click through for a video as well as a blog post with the details.

Comments closed

View Native Analysis Services Queries in Power Query

Chris Webb gives us an update on Power Query:

If you’re familiar with the topic of query folding in Power Query, you’ll know that the View Native Query right-click option in the Applied Steps pane of the Power Query Editor can be used to show the native query that is run against the data source. You may also know that there are some data sources where query folding does take place but where View Native Query remains greyed out.

Read on to see which sources are now available and to see an example of this in action.

Comments closed

ADF and Self-Hosted Integration Runtime Config Errors

Teo Lachev points out a common issue with using the Azure Data Factory self-hosted integration runtime:

You’ve set up the Azure Data Factory self-hosted integration runtime to access on-prem data sources. You create a linked server, click Test Connection, and then get greeted with an error saying the security context can’t be passed. On the on-prem VM, you use the Integration Runtime Configuration Manager and get a similar error or something to the extent that JSON can’t be parsed. You spent a few hours in trying everything that comes to mind, such as checking firewalls, connectivity from SSMS, but nothing helps.

Read on for the solution.

Comments closed