Press "Enter" to skip to content

Month: September 2019

Fun with Residual Plots

Nina Zumel explains why, when plotting residuals, you always put predictions on the X axis and residuals on the Y axis:

One reason that the proper residual graph (for a well fit model) should smooth out to the line y=0 is known as reversion to mediocrity, or regression to the mean.

Imagine that you have an ideal process that always produces a single value y. You don’t actually observe this “true value”; instead, what you observe is y plus (IID, zero mean) noise. You can build a “model” for this process that predicts the mean of the observations, in this case the value 0.1033149. Then you can calculate the residuals of your “model” in the usual way.

This post went in a direction I wasn’t expecting, and it was all the better for it.

Comments closed

Topic Modeling

Federico Pascual has an article on topic modeling and topic classification:

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. It’s known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is why many businesses opt to invest time training a topic classification model.

The article is long but worth the read, with examples in Python and additional notes for R.

Comments closed

Using the Power BI API

Jeff Pries takes us through the Power BI API with Powershell:

The Power BI API is a way of essentially bypassing the web interface of powerbi.com and asking questions directly to the back-end of the service (without bypassing security). Using this allows you to issue a command, such as “Give me a list of all of the workspaces” and receive just the result data in a bulk data format. While outside the scope of this exercise, the API also allows reading actual business data from Power BI assets and even manipulating assets.

For my first attempt at interacting with the API, I decided to try the Microsoft developed Powershell Cmdlets. These are a set of free Powershell commands which can be installed on your workstation and then will query the Power BI API and return results in a standard way without the need to write any code (C# or otherwise) or understand how to read a JSON response.

Jeff does more than just hit API endpoints, though, so read the whole thing.

Comments closed

Hammering Azure SQL DB

John Morehouse shows how we can configure HammerDB to run against Azure SQL Database:

Bench marking your environment is an important step when introducing new hardware, which is accomplished by running a test workload against the hardware.    There are multiple ways to accomplish this to get  SQL Server performance data  One of these methods is using HammerDB, which is a free tool that provides TPC standard bench marking metrics for multiple database systems, including Microsoft SQL Server.  These metrics are an industry standard and are defined by the Transaction Processing Performance Council (TPC).  The results from bench marking will help you to ensure that the new infrastructure will be able to support the expected workload.

Azure introduces ways to quickly implement new hardware.  However, if the Azure environment isn’t setup correctly, you can introduce issues that could potentially degrade performance.  Thankfully, HammerDB is Azure aware which allows you to easily benchmark your cloud environment.

HammerDB isn’t as good as having your own perfectly-created set of scripts which exactly replicate your environment, but I’ve never seen an environment with one of those.

Comments closed

Configuring Memory Limits for SQL Server in Kubernetes

Anthony Nocentino doesn’t have all the RAM in the world:

With that Pod deployed, I loaded up a HammerDB TPC-C test with about 10GB of data and drove a workload against our SQL Server. Then while monitoring the workload…boom HammerDB throws connection errors and crashes. Let’s look at why.

First thing’s first, let’s check the Pods status with kubectl get pods. We’ll that’s interesting I have 13 Pods. 1 has a Status of Running and the remainder have are Evicted. 

Anthony does a great job of explaining the problem and showing you the solution.

Comments closed

Documenting SSIS Catalogs

Dave Mason continues a series on documenting Integration Services:

In the last post, we looked at query options for documenting SSIS packages that are deployed via the legacy package deployment model. What about SSIS packages deployed to a catalog via the project deployment model? Is the package XML accessible so that we can query and shred it? I don’t know with certainty, but I think it is not.

However, there are .NET Framework interfaces we can use to programatically obtain SSIS catalog metadata. So for this post, I’ll soldier on without my dear friend T-SQL and proceed with PowerShell. Here’s a look at the projects and packages I was working with as I developed and debugged the script code. It’s a short list with just two projects, having one package each. The “DailyETLMain.dtsx” package is a sample from Microsoft’s GitHub repository.

Click through for Dave’s explanation and a link to the script.

Comments closed

Variable Header Counts and Azure Data Factory

Mark Kromer shows how you can convince Azure Data Factory to skip a variable number of lines before beginning processing:

A common requirement that I see from customers who are processing text files in data lakes with Azure Data Factory, is to read and process files where there are variable numbers of lines that precede both the data headers and the data that needs to be processed. ADF already has facilities that handle the ability to switch headers off or on as well as the ability to specify parameterized skip line counts. However, in many cases, files that are received for processing have variable numbers of superfluous lines that need to be skipped.

In ADF, between pipeline activities and data flows, there are a number of ways to handle this scenario. In this post, I am going to demonstrate one such technique. 

Read on to see which technique Mark chose and how to get it working.

Comments closed

Storytelling with Power BI

Marc Lelijveld wraps up a series on storytelling with Power BI:

Let them ask questions
As a report author, you start building your reports based on the information needs and business requirements you collected before your project. However, every answer to a question, triggers a new question to come up. In the end you end-up with more questions to answer than you thought about up front. Maybe even with scope creep in agile projects.

However, it is very unlikely that you answer all the business information needs in your dashboard or report within one iteration. So why not give them the ability to exploitative interact with the report and ask questions in a native language to their dataset? Power BI has the ability to ask questions to your data in your native language in just a few clicks.

This is probably one of the most underutilized aspects of Power BI.

Comments closed

HBase and S3

Krishna Maheshwari, et al, explain how we can allow Apache HBase to use S3 for storage:

Cloudera Data Platform (CDP) provides an out-of-the-box solution that allows Apache HBase deployments to use Amazon Simple Storage Service (S3) as its main persistence layer for saving table data. Amazon S3 is an object store which offers a high degree of durability with a pay-per-use cost structure. There is no server-side component to run or manage for S3 — all that is needed is the S3 client library and AWS credentials. However, HBase requires a consistent and atomic filesystem which means that it cannot directly use S3 because it is an eventually consistent object store. Both CDH and HDP have only provided HBase solely using HDFS because there have been long-standing impediments that prevented HBase from natively using S3. To address these issues, we’ve built an out-of-the-box solution which we are delivering for the first time via CDP. When you launch an Operational Database (HBase) cluster on CDP, HBase StoreFiles (the backing files for HBase tables) are stored in S3 and HBase write-ahead-logs (WAL) are stored in an HDFS instance run alongside HBase per usual.

I hadn’t thought of using S3, but it’s an interesting post.

Comments closed

Data Warehouse Concepts

Katrine Spirina takes us through several basic data warehousing concepts:

There are three basic types of modeling. Conceptual Data Model describes all entities a business needs information about. It provides facts about real-world things, customers, and other business-related objects and relations.

The goal of creating this data model is to synthesize and store all the data needed to gain an understanding of the whole business. The model is designed for the business audience.
Logical Data Model suits more in-depth data. It describes the structure of data elements, their attributes, and ways these elements interrelate. For instance, the model can be used to identify relationships between customers and products of interest for them. This model is characterized by a high level of clarity and accuracy.

Physical Data Model describes specific data and relationships needed for a particular case as well as the way data model is used in database implementation. It provides a wealth of meta-data and facilitates visualizing the structure of a database. Meta-data can involve accesses, limitations, indexes, and other features.

Click through for the whole story.

Comments closed