Press "Enter" to skip to content

Month: June 2022

PHI De-Identification in Databricks with NLP

Amir Kermany, et al, share a set of notebooks:

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.

To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

This is a really interesting scenario.

Comments closed

Building Custom ggplot2 Palettes

Nicola Rennie busts out the beret and fancy palette board:

Choosing which colours to use in a plot is an important design decision. A good choice of colour palette can highlight important aspects of your data, but a poor choice can make it impossible to interpret correctly. There are numerous colour palette R packages out there that are already compatible with {ggplot2}. For example, the {RColorBrewer} or {viridis} packages are both widely used.

If you regularly make plots at work, it’s great to have them be consistent with your company’s branding. Maybe you’re already doing this manually with the scale_colour_manual() function in {ggplot2} but it’s getting a bit tedious? Or maybe you just want your plots to look a little bit prettier? This blog post will show you how to make a basic colour palette that is compatible with {ggplot2}. It assumes you have some experience with {ggplot2} – you know your geoms from your aesthetics.

Click through to see how you can build a palette and use it across multiple ggplot2 charts.

Comments closed

Creating Line Charts in Excel

Amy Esselman builds a line chart:

A line chart is a simple graph that is familiar to most audiences. Lines are great for showing continuous data, such as plotting how the value of something changes over time. In this post, we will cover how to create a line chart in Excel, using a sample dataset from a community exercise: table takeaways. The information is about an annual corporate fundraiser to provide meals to those in need. You can download the file here to follow along as we build the line chart. 

It might be that I’ve spent too much time in Power BI but creating charts in Excel seems a lot harder than it needs to be. This is especially true once you throw some unused columns into the mix.

Comments closed

Determining Why Constraints Are Untrusted

Tom Zika adopts a zero-trust constraint architecture:

Okay, so you went through the effort of fixing them, but the next day your constraints are not trusted again. What gives?

If you are sure none of the DBAs or developers is doing this to spite you, the most common culprit is a BULK INSERT or bulk copy tool (bcp).

One of its parameters is -h(hints)
and one of those hints is CHECK_CONSTRAINTS

Read on to see how this can mess everything up, as well as how you can track and fix it. There are cases, particularly in extremely high-write systems, where you don’t necessarily need the constraint to exist but want it to be there for documentation purposes. In that case, the constraints are usually disabled rather than simply untrusted. The other time I see people purposefully using untrusted constraints is that old data is garbage and essentially unfixable but they want new data to be correct. Most of the time, though, constraints are untrusted because nobody noticed the problem.

Comments closed

Connecting to Azure SQL DB over VPN

Reitse Eskens has some routing issues:

To make sure the on-premises connection uses the VPN and the private endpoint, we need to make sure the on-premises DNS (it’s always DNS) recognizes the traffic and redirects it to the VPN connection. But whatever we tried on the firewall, the traffic kept going the wrong way. It did have something to do with the on-premises DNS setup in the end.

When we tried to connect to the Azure SQL instance on IP-address, it threw an error because the instance wasn’t found. You can only connect to it with the FQDN (dbname.database.windows.net)

Click through to see what the problem was and how Reitse solved it.

Comments closed

SSIS — RPC Server is Unavailable

Jon Morisi does some troubleshooting:

I just spent a long slog sorting out why I could not connect to my SSIS instance remotely.  I work in a very secure environment requiring network approval for any and all ports.  According to the following article, I was under the impression that a request to open incoming traffic on port 135, to a specific IP, would allow SQL Server Management Studio, on that specific IP, to connect remotely to SSIS:

https://docs.microsoft.com/en-us/sql/sql-server/install/configure-the-windows-firewall-to-allow-sql-server-access?redirectedfrom=MSDN&view=sql-server-ver16#BKMK_ssis

After opening port 135, I was receiving the error message in the title of this article:

If you find yourself in this situation, read on to see how Jon was able to solve the problem.

Comments closed

Git Native Support for Databricks Workflows

Vaibhav Sethi and Roland Faeustlin make an announcement:

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Read on to see how it works.

Comments closed

Understanding the Poisson Distribution

Achim Zeileis shows off my favorite statistical distribution:

The Poisson distribution has many distinctive features, e.g., both its expectation and variance are equal and given by the parameter λλ. Thus, E(Y)=λE(Y)=λ and Var(Y)=λVar(Y)=λ. Moreover, the Poisson distribution is related to other basic probability distributions. Namely, it can be obtained as the limit of the binomial distribution when the number of attempts is high and the success probability low. Or the Poisson distribution can be approximated by a normal distribution when λλ is large. See Wikipedia (2002) for further properties and references.

Here, we leverage the distributions3 package (Hayes et al. 2022) to work with the Poisson distribution in R. In distributions3, Poisson distribution objects can be generated with the Poisson() function. Subsequently, methods for generic functions can be used print the objects; extract mean and variance; evaluate density, cumulative distribution, or quantile function; or simulate random samples.

Read on for a detailed tutorial. H/T R-bloggers.

Comments closed

Saving and Loading a Keras Model

Jason Brownlee made it to a savepoint in time:

Given that deep learning models can take hours, days and even weeks to train, it is important to know how to save and load them from disk.

In this post, you will discover how you can save your Keras models to file and load them up again to make predictions.

After reading this tutorial you will know:

– How to save model weights and model architecture in separate files.

– How to save model architecture in both YAML and JSON format.

– How to save model weights and architecture into a single file for later use.

Read on for an updated step-by-step tutorial.

Comments closed