Press "Enter" to skip to content

Month: January 2017

Modern Data Warehouse Dictionary

Melissa Coates has put together a glossary of terms for modern data warehousing:

Logical Data Warehouse

A logical data warehouse (LDW) builds upon the traditional DW by providing unified data access to multiple platforms. Conceptually, the logical data warehouse is a view layer that abstractly accesses distributed systems such as relational DBs, NoSQL DBs, data lakes, in-memory data structures, and so forth, consolidating and relating the data in a virtual layer. This availability of data on various platforms adds flexibility to a traditional DW, and speeds up data availability. The tradeoff for this flexibility can be slower performance for user queries, though the full-fledged LDW vendors employ an array of optimization techniques to mitigate performance issues. A logical data warehouse is broader than just data virtualization and distributed processing which can be thought of as enabling technologies. According to Gartner a full-fledged LDW system also involves metadata management, repository management, taxonomy/ontology resolution, auditing & performance services, as well as service level agreement management.

If you’re just getting started with the topic, check this out, as it will probably clear up several concepts.

Comments closed

T-SQL Tuesday Roundup, Bugs Edition

Brent Ozar rounds up T-SQL Tuesday 86:

I dunno about you, but I got a big stocking full of coal. Next year, I’m gonna be better, and I plan on asking Santa for a whole bunch of Connect requests. For T-SQL Tuesday, I asked you to name your favorite SQL Server bugs & enhancement requests, and here’s what you want in your stocking next year.

If you agree with a feature, click on it, and upvote its Connect request. These bloggers took the time to make their case – now it’s time for you to vote.

29 separate links, so there’s a lot of reading here.

Comments closed

Shredding Excel With R

John MacKintosh shows how to use R for wrangling + ETL:

I had over 140 files to process. That’s not usually a big deal – I normally use SQL Server Integration Services to loop through network folders, connect to hundreds of spreadsheets and extract the source data.

But this relies on the data being in a tabular format (like a dataframe or database table).

A quick glance at the first few sheets confirmed I could not use this approach – the data was not in tabular format. Instead it was laid out in a format suited to viewing the data on screen – with the required data scattered in different ranges throughout each sheet ( over 100 rows and many columns). It wasn’t going to be feasible to point SSIS at different locations within each sheet. (It can be done, but it’s pretty complex and I didn’t have time to experiment).

The other challenge was that over time, changes to design meant that data moved location e.g. dates that were originally in cell C2 moved to D7, then moved again as requirements evolved. There were 14 different templates in all, each with subtle changes. Each template was going to need a custom solution to extract the data.

This is a good look at how R can be about more than “just” statistical analysis.

Comments closed

Getting Started With Functional Programming

Jose Gonzalez links to a set of blog posts and videos introducing functional programming:

Since I’ve started to play with (and rave about) functional programming (FP), a lot of people have asked me how to get started.

Instead of writing the same email multiple times, I decided to create a blog post I can refer them to. Also, it’s a central place to put all my notes about the topic.

Here’s a small collection of all the resources I’ve accumulated on my adventure on learning functional programming.

I think the functional paradigm fits relational database development extremely well, better than the object-oriented paradigm.

Comments closed

Support Vector Machines In R

Deepanshu Bhalla explains what support vector machines are:

The main idea of support vector machine is to find the optimal hyperplane (line in 2D, plane in 3D and hyperplane in more than 3 dimensions) which maximizes the margin between two classes. In this case, two classes are red and blue balls. In layman’s term, it is finding the optimal separating boundary to separate two classes (events and non-events).

Deepanshu then goes on to implement this in R.

Comments closed

Disabling Office 365 Groups

John White says, spoilers, you can’t:

The bottom line of all this is that even if you use Office 365, and you think that you have disabled Groups in your tenant, the chances are that you could be in for a surprise. If any of these dependent services are in use, the chances are that you already have several created.

Groups are the bedrock of all new features in Office 365 moving forward – it is therefore a good idea that your organization understand them as soon as possible. Their inevitability is also another strong argument for paying close attention to them. If you are currently discussing whether or not they should be used, I would strongly encourage you to shifting that discussion to how they should best be used.

Read the whole thing if you’re getting into Office 365.

Update:  John has a clarification, showing that it is possible, but it’s not trivial.

Comments closed

Data Analysis Basics In R

Sibanjan Das provides some of the basics of data analysis using R:

Let’s start thinking in a logical way the steps that one should perform once we have the data imported into R.

  • The first step would be to discover what’s in the data file that was exported. To do this, we can:
    • Use head function to view few rows from the data set. By default, head shows first 5 rows. Ex: head(mtcars)
    • str to view the structure of the imported data. Ex: str(mtcars)
    • summary to view the data summary. Ex: summary(mtcars) 

There’s a lot to data analysis, but this is a good start.

Comments closed

Suse On Windows 10

Brad Sams reports that Ubuntu isn’t the only flavor of Linux available on Windows 10 anymore:

If you do go down this route, you have the option for installing either openSUSE Leap 42.2 and SUSE Linux Enterprise Server 12 SP2.

The benefits here are obvious, with Microsoft enabling the Windows subsystem for Linux, they are opening the door to more than simply running Bash inside of Windows 10. While that is a good feature and one of the most likely used instances of this subsystem, what Microsoft has actually done is opened the door for more vendors to bring their Linux tools to the Windows platform.

I’d expect Red Hat to follow suit.

Comments closed

Detail Rows Expression In SSAS Tabular

Chris Webb shows a new feature in SSAS Tabular vNext:

What drillthrough does in SSAS Multidimensional, and what the new Detail Rows Expression property in SSAS Tabular v.next does, is allow an end user to see the detail-level data (usually the rows in the fact table) that was aggregated to give the value the user clicked on in the original PivotTable.

Read through for an example as well as how it’s already an improvement over Multidimensional’s dillthrough.

Comments closed

Thinking About Real-Time Analytics

Martin Willcox offers some advice for people getting into the real-time analytics game:

  1. Clarify who will be making the decision – man, or machine? Humans have powers of discretion that machines sometimes lack, but are much slower than a silicon-based system, and only able to make decisions one-at-a-time, one-after-another.  If we chose to put a human in the loop, we are normally in “please-update-my-dashboard-faster-and-more-often” territory.

  2. It is important to be clear about decision-latency. Think about how soon after a business event you need to take a decision and then implement it. You also need to understand whether decision-latency and data-latency are the same. Sometimes a good decision can be made now on the basis of older data. But sometimes you need the latest, greatest and most up-to-date information to make the right choices.

There are some good insights here.

Comments closed