The Microsoft Team Data Science Process Lifecycle Versus CRISP-DM

Melody Zacharias compares Microsoft’s Team Data Science Process lifecycle with the CRISP-DM process:

As I pointed out in my previous blog, the TDSP lifecycle is made up of five iterative stages:

  1. Business Understanding
  2. Data Acquisition and Understanding
  3. Modeling
  4. Deployment
  5. Customer Acceptance

This is not very different from the six major phases used by the Cross Industry Standard Process for Data Mining (“CRISP-DM”).

This is part of a series on data science that Melody is putting together, so check it out.

Corrupting Managed Instances

Brent Ozar has found a bug with Azure SQL Database Managed Instances:

Corruption happens. It’s just a fact of life – storage is gonna fail. Microsoft’s SLAs for storage only give you 3-4 9’s, and there’s nothing in there about never losing your data. Nothing against Azure, either – I’ve lost entire VMs in AWS due to storage corruption.

So let’s demo it. Normally, this kind of thing might be hard to do, but at the moment, DBCC WRITEPAGE is enabled (although I expect that to change before MIs hit General Availability.) I used Erik’s notorious sp_GoAheadAndFireMe to purposely corrupt the master database (not TempDB. I modified it to work with a user database instead, ran it, and in less than ten seconds, the entire instance went unresponsive.

It’s a good post, so check it out.

Collecting PRINT Outputs From Powershell

Jana Sattainathan shows how to query a number of SQL Server instances in parallel using Powershell and collecting the PRINT outputs from each:

As an example, you may have a block of SQL that PRINTs out the current privileges in the databasethat can then be saved off and used as an independent script.

In my case today, I need to collect information on the Service Pack and Cumulative Updates that need to be installed/applied to 200+ SQL Server instances. MS provides the SQL script to identify the right SP and CU to install. Please make sure you download the zip file and unzip the long SQL. I need to run this SQL against the instances to get the info. However, the information returned is completely with PRINT statements!

Click through to see how Jana did it.

For GDPR, Don’t Forget Query Monitoring Tools

Grant Fritchey points out another spot that might store personal information:

When you capture query metrics through trace events or extended events, either using rpc_completed or sql_batch_completed, you not only get the query. You also get any parameter values associated with that query. Article 17 of the GDPR is extremely clear:

The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay…

While there are a list of exceptions to the definitions of Article 17 listed at the link, none of those is because the data isn’t in the database or is stored in some separate information store such as your monitoring of queries. Instead, the GDPR pretty much says that any place (SharePoint, Excel, etc.) that the data resides, must be documented as part of your processing and is subject to control through the Regulation.

Read the whole thing.

Quantifier {x,y} Following Nothing

Shane O’Neill reminds us that reading is fundamental:

Glancing at the error message, the first things that stick out are the bits “{x,y}” so I change my regex to be anywhere from 1 to  digits "*\d{1,6}$"

Why are you glancing, read the error message!

That doesn’t work, so I again quickly scan the error message and see the bit “following nothing”

“Quickly scan”?! No, actually read the error message!!

This wasn’t a great error message, but at least it does make sense after the fact and it’s a case where actually reading the error message does clarify things.  The ones I really hate are “error 0x4849f8f8” types which aren’t even intended to make any sense to us.

Data Lake Permissions

Melissa Coates has started a multi-part series on Azure Data Lake permissions.  She’s put up the first three parts already.  Part 1 covers the types of permissions available as well as some official documentation:

(1) RBAC permissions to the ADLS account itself, for the purpose of managing the resource.
RBAC = Role-based access control. RBAC are the familiar Azure roles such as reader, contributor, or owner. Granting a role on the service allows someone to view or manage the configuration and settings for that particular Azure service (ADLS in this case). See Part 2 for info about setting up RBAC.

Part 2 looks at permissions for the Azure Data Lake Store service itself:

Setting permissions for the service + the data stored in ADLS is always two separate processes, with one exception: when you define an owner for the ADLS service in Azure, that owner is automatically granted ‘superuser’ (full) access to manage the ADLS resource in Azure *AND* full access to the data. Any other RBAC role other than owner needs the data access specifically assigned via ACLs. This is a good thing because not all system administrators need to see the data, and not all data access users/groups/service principals need access to the service itself. This type of separation is true for certain other services too, such as Azure SQL Database.

Try to use groups whenever you can to grant access, rather than individual accounts. This is a consistent best practice for managing security across many types of systems.

Part 3 covers using ACLs to grant rights to specific files or folders in Azure Data Lake Storage:

There are two types of ACLs: Access ACLs and Default ACLs.

An Access ACL is the read/write/execute permissions specified for a folder or file. Every single folder or file has its security explicitly defined — so that means the ADLS security model is not an ‘inheritance’ model. That is an important concept to remember.

Default ACL is like a ‘template’ setting at a folder level (the concept of a default doesn’t apply at the file level). Any new child item placed in that folder will automatically obtain that default security setting. The default ACLs are absolutely critical, given that data permissions aren’t an inheritance model. You want to avoid a situation where a user has permission to read a folder, but is unable to see any of the files within the folder — that situation will happen if a new file gets added to a folder which has an access ACL set at the folder level, but not a default ACL to apply to new child objects.

There’s a lot of good information here and I’m looking forward to parts 4 and 5.

Exploratory Analysis With Hockey Data In Power BI

Stacia Varga digs into her hockey data set a bit more:

Once I know whether a variable is numerical or categorical, I can compute statistics appropriately. I’ll be delving into additional types of statistics later, but the very first, simplest statistics that I want to review are:

  • Counts for a categorical variable
  • Minimum and maximum values in addition to mean and median for a numerical value

To handle my initial analysis of the categorical variables, I can add new measures to the modelto compute the count using a DAX formula like this, since each row in the games table is unique:

Game Count = countrows(games)

It’s interesting seeing Stacia use Power BI for exploratory analysis.  My personal preference would definitely be to dump the data into R, but there’s more than one way to analyze a data set.

Defending Pie Charts

Bobby Johnson makes a valiant effort at defending the indefensible:

In the world of data analysis, there are few things more reviled than the pie chart. Among “serious” data people, it is at best trivial and naive, and at worst downright evil.

I do not agree with this. The pie chart is simple, but that is its beauty. It does exactly one thing and it does it well: it shows you how much different parts contribute to a whole. This isn’t the only question you ever have about your data, but when it’s the question you do have, the pie chart is perfect. That is not evil and it is not naive. It is data visualization doing what it should: taking something large and abstract and saying something simple about it that your brain can easily internalize.

I strongly disagree with arguments in the article, but do respect the attempt.  In each of the cases, at least one of a bar chart, stacked 100% bar chart, or dot plot could give at least the same amount of information with less lower mental overhead.

Cassandra To Kafka Connect

Mike Barlotta shows how to feed data into Kafka from Cassandra via Kafka Connect.  Part one involves basic setup:

Modeling data in Cassandra must be done around the queries that are needed to access the data (see this article for details). Typically this means that there will be one table for each query and data (in our case about the pack) will be duplicated across numerous tables.

Regardless of the other tables used for the product, the Cassandra Source connector needs a table that will allow us to query for data using a time range. The connector is designed around its ability to generate a CQL query based on configuration. It uses this query to retrieve data from the table that is available within a configurable time range. Once all of this data has been published, Kafka Connect will mark the upper end of the time range as an offset. The connector will then query the table for more data using the next time range starting with the date/time stored in the offset. We will look at how to configure this later. For now we want to focus on the constraints for the table. Since Cassandra doesn’t support joins, the table we are pulling data from must have all of the data that we want to put onto the Kafka topic. Data in other tables will not be available to Kafka Connect.

Part 2 is around tuning the connector:

One of the problems we initially had with the Cassandra Source connector was how much data it tried to process during one polling cycle. In the original versions (0.2.5 and 0.2.6) the connector would retrieve all of the data that was inserted since the last polling cycle. For systems ingesting large amounts of data this can pose a challenge.

Our logs showed that it took 6 hours to retrieve and publish 6.8 million rows of data.

The problem (or one of them) with this slow rate of ingestion was that the table was continuing to have new data inserted into it while the connector was processing the data it had retrieved. With data being added to the table faster than it was being published the connector was getting behind. Worse there was no opportunity for it to ever catch up, until there was a lull in receiving new data.

If you’re using Cassandra, this looks like a rather useful connector.

Use Cases For Apache Kafka

Amy Boyle shows a few scenarios where New Relic uses Apache Kafka:

The Events Pipeline team is responsible for plumbing some of New Relic’s core data streams-specifically, event data. These are fine-grained nuggets of monitoring data that record a single event at a particular moment in time. For example, an event could be an error thrown by an application, a page view on a browser, or an e-commerce shopping cart transaction.

In this post, we show how we built our Kafka pipeline so that it stitches together microservices and serves as a changelog and “durable cache,” all with the idea of processing data streams as smoothly and effectively as possible at our scale. In an upcoming post, we’ll share thoughts on how we manage topic partitions in this pipeline.

If you’re wondering if Kafka might be right for you, check out this post for several scenarios which fit.


March 2018
« Feb