Press "Enter" to skip to content

Author: Kevin Feasel

PayPal’s Data Contract Template Open Sourced

Jean-Georges Perrin makes an announcement:

A data contract is a binding agreement between the consumers and producers of data. You can see it as a data schema on steroids or data schema++. The goal of the contract is to set expectations between the parties. It can be built as fit-for-purpose where the consumers and producer agree on what it should contain or can serve as a brochure for any consumer willing to access the data offered by this (data) product.

Click through to learn more about data contracts and then check out the contract template itself on PayPal’s GitHub repo.

Comments closed

Diffify Updates

Myles Mitchell celebrates a year of diffify:

We’ve just passed an important milestone for diffify: our app for tracking Python and R package releases has just turned 1 year old! To mark this exciting occasion we are delighted to announce an “anniversary update” featuring numerous quality of life improvements. This post will outline the latest changes and tease at some exciting developments in the works…

Check out these recent changes and a little bit of what’s on the horizon.

Comments closed

Documenting Group Policy Objects with Powershell

Patrick Gruenauer builds a report:

Active Directory Group Policies (GPO) enables you to control user and computer settings. It is important to document them. In this blog post I am going to show you two PowerShell commands which create a GPO HTML Report. Let’s dive in.

To store all GPO Settings from all GPOs in one file run this command. Don’t forget to provide your domain name and the path of the report file.

Click through for that code snippet, as well as another one which builds an HTML report for each GPO.

Comments closed

Adding Help to Your Powershell Code

Robert Cain helps those who help themselves:

Having good help is vital to the construction of a module. It explains not only how to use a function, but the purpose of the module and even more.

Naturally I’ve included good help text in the ArcaneBooks module, but as I was going over the construction of the ArcaneBooks module I realized I’d not written about how to write help in PowerShell. So in this post and the next I’ll address this very topic.

Read on for Robert’s thoughts on the topic, including standard ways to add content comments so Powershell’s built-in Get-Help cmdlet works for you.

Comments closed

WaitTime in Power BI

Chris Webb explains what a new metric means:

What does WaitTime represent? Here’s the technical explanation: it’s the wait time on the query thread pool in the Analysis Services engine before the query starts to run. But what does this mean for you as someone trying to tune DAX queries in Power BI?

Chris provides an examplation of exactly that. This tracking of noisy neighbors is interesting, as it would provide insight if you’re noticing variance in dataset refresh times.

Comments closed

Azure Synapse Analytics April 2023 Updates

Ryan Majidimehr has an update for us:

Low Shuffle Merge optimization for Delta tables is now available in Apache Spark 3.2 and 3.3 pools. You can now update a Delta table with advanced conditions using the Delta Lake MERGE command. It can update data from a source table, view, or DataFrame into a target table. The current algorithm of the MERGE command is not optimized for handling unmodified rows. With Low Shuffle Merge optimization, unmodified rows are excluded from expensive shuffling execution and written separately.

To learn more about this new command, read Low Shuffle Merge optimization on Delta tables

Looks like a bit of work on Data Explorer pools and a little bit on Spark pools and Synapse Link to Cosmos DB to round out the month.

Comments closed

Paper Review: Moving Fast with Broken Data

Adnan Masood reviews a paper:

I recently came across an insightful research paper titled “Moving Fast With Broken Data” by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

Sounds like I have some beach reading.

Ed. Note: He’s kidding, right?

Ed. 2 Note: About going to the beach maybe.

Ed. & Ed. 2 Note: HAHAHAHAHAH.

Yeah, I hired Statler and Waldorf as my editors. Worst Best decision of my life.

Comments closed

Extending a tinyAML and shiny App

Steven Sanderson wraps up a series on shiny and tinyAML. Part 3 extends options for regression:

As data science continues to be a sought-after field, creating a reliable and accurate model is essential. While there are various machine learning algorithms available, the process of selecting the correct algorithm can be complex. The {tidyAML} package, part of the tidymodels suite, offers an easy-to-use, consistent interface for building machine learning models. In this post, we will explore a Shiny application that utilizes tidyAML to build a machine learning model.

Today I have updated the tidyAML shiny app to include the ability to set the parameter of the fast_regression() function .parsnip_fns and this is things like linear_reg.

And part 4 includes classification:

This is a Shiny app for building models using the {tidyAML} which is based on the tidymodels package in R. The app allows you to upload your own data or choose from one of two built-in datasets (mtcars or iris) and select the type of model you want to build (regression or classification).

Let’s take a closer look at the code.

This was an interesting series, for sure.

Comments closed

Performing a Cloud Adoption Security Review

Daniel Margetic takes a look:

Security is an ongoing journey of incremental progress and maturity, and not a static destination. The Cloud Adoption Framework provides security guidance for this journey by providing clarity to the processes and best practices. This guidance is based on real world experiences of our customers, of Microsoft’s own security journey and lessons learned, and the work with other organizations like NIST (National Institute of Standards and Technology) or CIS (Center for Internet Security).

The outcome is manifested in the Cloud Adoption Framework Secure Methodology which provides a vision of the complete end state of your security journey and follows the Zero Trust principle (assume breachverify explicitlyuse least privilege access).

This assessment gives you the opportunity to self-assess your security journey of your cloud adoption against this secure methodology.

Read on to learn more about how CASRs work and how you can perform one yourself.

Comments closed

Creating Your First PySpark Application

Dustin Vannoy gives us a primer on Apache Spark:

Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own PySpark application. Pro tip: Search for the Spark equivalent of functions you use in other programming languages (including SQL). Many will exist in the pyspark.sql.functions module.

In addition to the code listing, Dustin has a video walking us through the process.

Comments closed