Press "Enter" to skip to content

Day: February 27, 2025

Orchestrating Data Pipelines in R with maestro

Will Hipson moves some data:

If you look at data orchestration tools today you are bombarded with a dizzying array of software platforms that claim unsurpassed processing capability, AI-readiness, elegant UIs, etc. Apache Airflow is just one example of a popular orchestration platform that scales to meet virtually any orchestration need. And while these claims may be true, I argue it is rarely the case that these gargantuan platforms are needed in the first place. For most data engineers, you probably only need to process a moderate amount of data at a moderate time scale. Moreover, if you’re an R user, you don’t want to have to define your data pipelines using drag-and-drop tools or learn another programming language. Not only will this reduce cloud costs but also development time costs.

Click through to see why Will developed maestro and how it works. H/T R-Bloggers.

Leave a Comment

Table Compaction in Apache Spark

Miles Cole groups things together:

If there anything that data engineers agree about, it’s that table compaction is important. Often one of the first big lessons that folks will learn early on is that not compacting tables can present serious performance issues: you’ve gotten your lakehouse pilot approved and it’s been running for a couple months in production and you find that both reads and writes are increasingly getting slower and slower while your data volumes have not increased drastically. Guess what, you almost surely have a “small file problem”.

What engineers won’t always sing the same tune on is how and when to perform table compaction.

Read on for a dive into the power of compaction (converting a large number of small files into a small number of large files) and plenty of tips along the way.

Leave a Comment

Thoughts on Scaling Elasticsearch

Vivek Kumar can’t stop at one:

With the evolution of modern applications serving increasing needs for real-time data processing and retrieval, scalability does, too. One such open-source, distributed search and analytics engine is Elasticsearch, which is very efficient at handling data in large sets and high-velocity queries. However, the process for effectively scaling Elasticsearch can be nuanced, since one needs a proper understanding of the architecture behind it and of performance tradeoffs.

Click through for those considerations and the trade-offs you might see.

Leave a Comment

Step Outputs to Help Troubleshoot Failed SQL Agent Jobs

Jim Evans gives us a reminder:

When troubleshooting SQL Agent jobs, often the Job history output is truncated or poorly formatted, making it hard to read. This is especially true when calling SSIS Packages, running jobs like DBCC CheckDB or when running T-SQL code that returns a lot of output. Are there options to get more readable Job output to aid in troubleshooting?

There are a few settings here that we can use to make troubleshooting SQL Agent jobs a little bit easier. In addition to these, it’s also a good idea to retain more history for longer, especially if you’re not in a position to track those job outputs each day.

Leave a Comment

Microsoft Fabric February 2025 Feature Round-Up

Patrick LeBlanc tells us what’s new:

There are a lot of exciting features for you this month! Here are some highlights: In Power BI, Explore from Copilot visual answers which lets you do easy ad-hoc exploration. In Data Warehouse, Browse files with OPENROWSET (Preview) and Copilot for Data Warehouse Chat (Preview). For Data Science, AI Skill is now conversational.

These are just some of the great features this month, keep reading to learn about all of what’s happened in Fabric this month.

Click through for the full report.

Leave a Comment