It’s another terminology post! Earlier this week I was having a delightful lunch with Angela Henry, Kevin Feasel, Javier Guillen, and Jason Thomas. We were chatting about various new things. Partway thru our conversation Jason stops me because he thought I was talking about Power BI Dataflows when I was really talking about Azure Data Factory Data Flows. It was kind of a funny moment actually but it did illustrate that we have some overlapping terminology coming into our world.
So, with that inspiration, let’s have a chat about some of the new data flow capabilities in the Microsoft world, shall we?
Melissa clarifies the term “data flow” (or “dataflow” as the case may be) across several products in Microsoft’s BI stack. Worth the read.
Notably, the three Magic Quadrant Leaders except Tableau were ranked near the middle in all use cases. MicroStrategy, Birst, Sisense, TIBCO, YellowFin, Salesforce, SAS and a few other players excelled above the rest with high scores on this report. These results are a bit refreshing to see. Gartner Critical Capabilities scores seem to better align with Forrester’s rankings of Analytics and Business Intelligence Platforms and also my own understanding of several top offerings. I admit that I was surprised by these results. I was rarely – if ever – asked about several of the top scoring vendors over the past three years.
Read the whole thing, and then read the report.
For the first time in 3 years, Gartner dropped a significant amount of vendors off its quadrant. There were 24 vendors in the firm’s quadrant in 2016 and 2017. This year, the Magic Quadrant only lists 20 vendors…that’s a 16% quadrant reduction. Has the market shrunk?!
Not exactly: the market has evolved….and in a pretty predictable way actually. Take a look at our 3-year-movement analysis table below: we see a pretty consistent story, e.g. the big are getting bigger, some of the visionaries got absorbed (or disappeared) and few ‘trend-setters’ graduated up.
Read on for more. The leader quadrant pretty much fits my expectations in terms of the major vendors and their rank ordering.
In previous articles, we mentioned that most reporting performance issues need to be addressed during the data preparation stage, but many scenarios can’t be handled within the data source. For example, parallel data retrieval should be performed outside of the data source because its purpose is to increase I/O performance. To achieve the controllable buffer, the buffer information needs to be written to an external storage device, which can’t be handled within a data source. The asynchronous data buffering and loading data by random page number in building a list report can’t be handled by a data source. Even for an associative query over multiple datasets that a data source can deal with, it would be necessary to get it done outside the data source when multiple databases or a non-database source is involved and when the database load needs to be reduced. Obviously, these scenarios that are not able to be handled within a data source also can’t be handled by a reporting tool.
I would be concerned about implementation details overwhelming the general value of a data computing layer.
Currently, I think there are two main approaches to Data Warehouse Automation
- Data Warehouse Generation: You provide sources, mappings, datatype mappings etc.. The tool generates code (or artifacts).
- Data Warehouse Automation (DWA): The tool not only generates code / artifacts, but also manages the existing Data Warehouse, by offering continuous insight in data flows, actual lineage, row numbers, etc..
The difference might seem small, but IMHO is visible most clearly whenever changes occur in the Data Warehouse – the second class of tools can handle those changes (while preserving history). With the first class of tools provide you with the new structures, but you need to handle the preservation of history yourself (as you would’ve without DWA).
Read on for a contrast of these two approaches.
In order to set up and run an effective Big Data Hadoop project that provides reliable BI, your organization will need to adopt a new mindset that addresses not only the technology, but also the organizational EIM. You will need to conduct a comprehensive analysis of your business with the help of analysts, internal domain experts, and strategists to come up with robust and relevant business use cases. You will also need buy-in from management, and take company politics into consideration.
Your Big Data project needs to work with your existing BI tools, along with your security and monitoring systems. Data security needs to be addressed because standard Hadoop implementations have relatively poor security, and many organizations are wary of keeping all their data in one location.
I do agree with these reasons, though I’m a bit surprised that I didn’t see much about “classic” BI problems like the inability of the company to standardize on terminology or definitions (e.g., what the Kimball method describes as conformed dimensions), the desire to tackle too much of the problem at once, rapidly-changing source systems (and how BI team members tend to be the last to know that something has changed), etc.
The earlier you address data quality the better; the less time your end users spend on data wrangling, and the more they can focus on high value analytics. As your organization’s data infrastructure matures, migrating from spreadsheets to databases and data warehouses, data quality checks should be formally defined, documented, and automated. Exceptions should either be handled automatically during data intake using predefined business rules logic or require immediate user intervention to correct any errors.
Providing clean, centralized, and analytics-ready data to end users should not be a one-way process. By allowing end users to focus on high-value analytics, like data mining, network graphs, clustering, etc., they can uncover certain outliers and anomalies in the data. Effective data management should include a feedback loop to communicate these findings and, if necessary, incorporate any changes in the ETL processes, making centralized data management more dynamic and flexible.
The big question to ask is, “what problem are we trying to solve?” That will help determine the answer to many of the questions, including how you store the data, how you expose the data, and even which data you collect and keep.
Logical Data Warehouse
A logical data warehouse (LDW) builds upon the traditional DW by providing unified data access to multiple platforms. Conceptually, the logical data warehouse is a view layer that abstractly accesses distributed systems such as relational DBs, NoSQL DBs, data lakes, in-memory data structures, and so forth, consolidating and relating the data in a virtual layer. This availability of data on various platforms adds flexibility to a traditional DW, and speeds up data availability. The tradeoff for this flexibility can be slower performance for user queries, though the full-fledged LDW vendors employ an array of optimization techniques to mitigate performance issues. A logical data warehouse is broader than just data virtualization and distributed processing which can be thought of as enabling technologies. According to Gartner a full-fledged LDW system also involves metadata management, repository management, taxonomy/ontology resolution, auditing & performance services, as well as service level agreement management.
If you’re just getting started with the topic, check this out, as it will probably clear up several concepts.
(2) Data Sources
From a single source such as a data warehouse. This is the most traditional path for BI development, and still has a very valid place in many BI/analytics deployments. This scenario puts the work of data integration on the ETL process into the data warehouse, which is the most appropriate place.
Directly from various systems. This can be done, but works well only in specific cases – it definitely won’t work well if there are a lot of highly normalized tables, or if there’s not a straightforward way to relate the disparate data together. Trying to go directly to the source systems & skip an intermediary data warehouse puts the “integration” burden on the data source view in Analysis Services, so plan for plenty of time testing if you’re going to try this route (i.e., it can be much harder, not easier). Note that this option only makes sense if the data is stored in Analysis Services because it needs to be related together somehow (i.e., DirectQuery mode, discussed next in #3, with > 1 data source won’t work if a user tries to combine data sources because the data is not inherently related).
If you’re thinking about Azure Analysis Services, this post is a good one.
Before we get into discussing how to create it date dimension and how to use it, first let’s talk about what it is and why do we need it. Depending on who you talk to, people can refer to this concept as “Calendar table” or “Date Dimension,” which is usually found in Data Warehouse. No matter how it is called, at the end of the day, it is a table in SQL Server which is populated with different date/calendar related information to help speed up SQL queries which require specific parts of dates.
In my case, I have created it to be able to aggregate data by quarters, years and month. Depending on how large your requirements are it will add additional complexity to building it. Since I don’t care about holidays (for now at least), I will not be creating holiday schedule which can be complicated to populate.
I love date dimensions, even on non-warehouse databases, because it’s an easy way of providing additional context to time series data. Think about graphing orders per day in an industry with weekday-versus-weekend trends; a date dimension lets you strip out weekends (maybe plotting them separately) or even lets you build day-of-week analysis for each day, or looking at week of the month, etc. You might also be interested in computing holidays.