Press "Enter" to skip to content

Month: January 2020

Choosing Categorical Features with Python

Mesfin Gebeyaw shows how to use Multiple Correspondence Analysis to filter categorical variables for an analysis:

A general guide to interpreting the multiple correspondence analysis plot shown above for business insights would be to make a note as to how close input categorical features are to the target variable customer churn and to each other. For instance, senior citizens, customers with fiber optic internet service, those with month to month contractual agreements, and single customers or customers with no dependents are being related to a short tenure with the company and a propensity of high risk to churn. On the other hand, customers with more than a year contract, those with DSL internet service, younger customers, customers with multiple lines are being related to a long tenure with the company and a higher tendency to stay with company.

Read the whole thing.

Comments closed

Flink in Cloudera Streaming Analytics

Dinesh Chandrasekhar announces support for Apache Flink in Cloudera Streaming Analytics:

We cannot hold our excitement anymore! For the last few months, our Data-in-Motion engineering teams have been working hard to deliver a compelling and critical part of our Cloudera DataFlow (CDF) story. To enhance our Stream Processing and Analytics narrative within the overall Data-in-Motion platform, we give you support for Apache Flink with the general availability of Cloudera Streaming Analytics (CSA).

Cloudera Streaming Analytics, powered by Apache Flink, is a new product offering within the Cloudera DataFlow (CDF) platform that provides real-time stateful processing of IoT-scale data streams and complex events for predictive insights. Cloudera DataFlow (as seen in the picture below) is a comprehensive edge-to-cloud real-time streaming data platform. As one of the key pillars of CDF, stream processing & analytics is important for processing millions of data points and complex events coming from various streaming sources. Over the years, we have supported several streaming engines but the addition of Flink now makes CDF an extremely compelling platform for processing high-volumes of streaming data at high-scale. 

This is adding support for Flink; it looks like Spark Streaming and Kafka Streams are also supported, though they are pushing Flink as a first option rather than one among equals.

Comments closed

Using AI Builder in Power Automate

Leila Etaati takes us through a text classification problem:

Text classification is one of the important tasks for the aim of classifying the texts based on the allocated tags.
In the previous blog, the process of how to create Text classification in the Power Apps using AI builder has been explained,

In this Blog Post, you will see how to use the created Text classification model in the Power Automate (Microsoft Flow).

Read on for the demo.

Comments closed

Database Compatibility Level and Query Store

Erin Stellato gives us a moment of zen:

A question I’ve gotten a few times when teaching relates to database compatibility level and Query Store. I was talking to a client yesterday about post-upgrade plans and implementing Query Store, and the topic came again. They wanted to know what compatibility level the database needed in order to use Query Store.

The quick answer: it doesn’t matter.

Read on for a demonstration.

Comments closed

Qutoed Data and OPENROWSET

Dave Mason wants to remove quoted identifiers from a flat file:

I haven’t shown all the columns, but you get the idea–every column in the result set has data enclosed in double quotes. That’s exactly how it appears in the source data file.

Dave has a method which works for plenty of versions of SQL Server. If you’re using 2017 or later, the FIELDQUOTE parameter was added to solve this problem, though to be fair, I haven’t actually tried it to see if it works as expected.

Comments closed

Finding Data Which Breaks Constraints

Phil Factor has a procedure to test disabled check constraints and find data which would cause an error:

However often I go on about CHECK constraints, there will always be a developer who will leave them out or mutter in a dignified manner about how all checks need to be done only at the application level. This attitude soon gets divine retribution. Bad data springs up like a rotting fungus over your database unless you add CHECK constraints to all your tables. This is fine but then how do you prevent the excellent and estimable habit of adding them to then interfere with a release? The constraints will stop the build if they meet bad data: it is what they are trained to do. If you don’t like that, then you must fix the bad data first.

Click through for the process.

Comments closed

Tips for Using Azure Storage

James Serra takes us through Azure Data Lake Store Gen2 and Azure Blob Storage:

Azure Data Lake Store (ADLS) Gen2 should be used instead of Azure Blob Storage unless there is a needed feature that is not yet GA’d in ADLS Gen2.

The major features that are missing from ADLS Gen2 are premium tiersoft deletepage blobsappend blobs, and snapshots. The major features that are in preview are archive tierlifecycle management, and diagnostic logs. Check out all the missing features at Known issues with Azure Data Lake Storage Gen2.

Note that underneath the covers, ADLS Gen2 uses Azure Blob Storage and is simply a layer over blob storage providing additional features (i.e. hierarchical file system, better performance, enhanced security, Hadoop compatible access).

Click through for a bullet point list of useful information.

Comments closed

Databricks Automated Deployment and Testing

Li Yu, et al, explain how to use Databricks notebooks and MLflow to automate deployment and testing of Spark solutions:

Today many data science (DS) organizations are accelerating the agile analytics development process using Databricks notebooks.  Fully leveraging the distributed computing power of Apache Spark™, these organizations are able to interact easily with data at multi-terabytes scale, from exploration to fast prototype and all the way to productionize sophisticated machine learning (ML) models.  As fast iteration is achieved at high velocity, what has become increasingly evident is that it is non-trivial to manage the DS life cycle for efficiency, reproducibility, and high-quality. The challenge multiplies in large enterprises where data volume grows exponentially, the expectation of ROI is high on getting business value from data, and cross-functional collaborations are common.

In this blog, we introduce a joint work with Iterable that hardens the DS process with best practices from software development.  This approach automates building, testing, and deployment of DS workflow from inside Databricks notebooks and integrates fully with MLflow and Databricks CLI. It enables proper version control and comprehensive logging of important metrics, including functional and integration tests, model performance metrics, and data lineage. All of these are achieved without the need to maintain a separate build server.

Read on to see how.

Comments closed

Querying SQL Server Replicas Under Load

Taryn Pratt points out that replicas of data can contain stale data:

Last week at Stack Overflow we had an internal hack-a-thon, or as we call it, a make-a-thon. I was on the bug-bashing team, which is the team that attempts to fix smallish bugs we haven’t gotten around to fixing, due to other time-constraints. I was tagged to investigate a bug about duplicate badges being awarded because it looked to possibly be an easy fix in SQL. At first glance it looked simple enough, but once I started digging in, I figured out very quickly it wouldn’t be.

It’s an interesting problem, but no solutions in the post. It’s a hard problem.

Comments closed