Press "Enter" to skip to content

Day: May 2, 2023

Extending a tinyAML and shiny App

Steven Sanderson wraps up a series on shiny and tinyAML. Part 3 extends options for regression:

As data science continues to be a sought-after field, creating a reliable and accurate model is essential. While there are various machine learning algorithms available, the process of selecting the correct algorithm can be complex. The {tidyAML} package, part of the tidymodels suite, offers an easy-to-use, consistent interface for building machine learning models. In this post, we will explore a Shiny application that utilizes tidyAML to build a machine learning model.

Today I have updated the tidyAML shiny app to include the ability to set the parameter of the fast_regression() function .parsnip_fns and this is things like linear_reg.

And part 4 includes classification:

This is a Shiny app for building models using the {tidyAML} which is based on the tidymodels package in R. The app allows you to upload your own data or choose from one of two built-in datasets (mtcars or iris) and select the type of model you want to build (regression or classification).

Let’s take a closer look at the code.

This was an interesting series, for sure.

Comments closed

Paper Review: Moving Fast with Broken Data

Adnan Masood reviews a paper:

I recently came across an insightful research paper titled “Moving Fast With Broken Data” by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

Sounds like I have some beach reading.

Ed. Note: He’s kidding, right?

Ed. 2 Note: About going to the beach maybe.

Ed. & Ed. 2 Note: HAHAHAHAHAH.

Yeah, I hired Statler and Waldorf as my editors. Worst Best decision of my life.

Comments closed

Creating Your First PySpark Application

Dustin Vannoy gives us a primer on Apache Spark:

Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own PySpark application. Pro tip: Search for the Spark equivalent of functions you use in other programming languages (including SQL). Many will exist in the pyspark.sql.functions module.

In addition to the code listing, Dustin has a video walking us through the process.

Comments closed

Performing a Cloud Adoption Security Review

Daniel Margetic takes a look:

Security is an ongoing journey of incremental progress and maturity, and not a static destination. The Cloud Adoption Framework provides security guidance for this journey by providing clarity to the processes and best practices. This guidance is based on real world experiences of our customers, of Microsoft’s own security journey and lessons learned, and the work with other organizations like NIST (National Institute of Standards and Technology) or CIS (Center for Internet Security).

The outcome is manifested in the Cloud Adoption Framework Secure Methodology which provides a vision of the complete end state of your security journey and follows the Zero Trust principle (assume breachverify explicitlyuse least privilege access).

This assessment gives you the opportunity to self-assess your security journey of your cloud adoption against this secure methodology.

Read on to learn more about how CASRs work and how you can perform one yourself.

Comments closed

Modifying Dynamic Format Strings in Power BI

Gilbert Quevauvilliers tries out a neat feature:

I was recently testing out and using the great new Power BI feature using dynamic format strings in Power BI.

What I found that currently it is not possible or easy to modify an existing dynamic format string.

In this blog post below I show you how I managed to modify the dynamic format string, so that you do not need to DELTE it and re-create it!

Click through to learn how. This works really well for currency-style scenarios, like the one Gilbert shows. I could also see it providing different levels of precision based on how large the value is.

Comments closed

Sparse Columns in SQL Server

Chad Callihan occasionally inserts something:

Have you ever maxed out the SQL Server table column limit yet still needed more columns? Hopefully not considering SQL Server has a max limit of 1024 columns per table. But as I found out, it’s possible for someone to reach out and ask for even more. Sparse columns are an option to consider when you can’t get enough. Let’s take a look at what sparse columns are and how they can be used.

Sparse columns have very little utility, except in the most “I don’t think you’re doing it right” scenarios. Still, if you happen to end up in that scenario, there is a way out, though I’d really want to understand the nature of the data in that problem and, knowing just the amount of detail in the scenario that I do, would lean toward storing the data either in an unpivoted fashion (one row per entity * attribute in an EAV-style “‘additional attributes” table) or as a JSON string and let the client sort it out.

Comments closed

Log Tokenization and Reduction in Azure Data Explorer

Brian Bønk tries out some new functions:

Before the release described below – the ADX service had a good handfull of features to help with anomaly detection and clustering on semi structured data.

With the functions like basket() and autocluster() the service can find patterns based on common values across the columns. The problem with these functions, is that they are not able to parse free text columns and extract tokens and repeatable patterns.

Yes, you could use the diffpatterns_text() function – but that is not strong enough to cover real diversity of free text log records.

It’s interesting that the end result is looking for log entries whose shape differs from the norm. That’s a clever approach to log file analysis.

Comments closed

Tips for Using psql

Ryan Booz shows off a tool:

Having access to the psql command-line tool is essential for any developers or DBAs that are actively working with and connecting to PostgreSQL databases. In our first article, we discussed the brief history of psql and demonstrated how to install it on your platform of choice and connect to a PostgreSQL database.

In this article we’ll get you up and running with all of the essential things you need to know to start on your journey to becoming a psql power-user. From basic command syntax to the most common (and helpful) meta-commands, it’s all covered throughout the rest of the article.

Also check out the comments for a link to a pager which works with psql.

Comments closed