This brings me to my data question. If an order is processed in a store, but the expected data is not created, did that order ever occur?
Very often, the staff of a business are very focused on pleasing the customer, making sure they get their product, but due to software limitations, may not end up not capturing information about every sale in a satisfactory manner. Most of the blame I have seen lies in software that doesn’t meet the requirements of a customer, making capturing desired details tedious to achieve when the process is in the norm. Very often the excuse programmers give is that too much work of the work to build a system would need to be done for the atypical cases, but requirements are requirements, and it is generally essential that every action that occurs in a business is captured as data.
Read on for more. My conjoined twin case is, how much information do we have about why users give up? For example, if you have a three-part form, how many users get through part one, part two, and part three? There’s some natural level of attrition, but if you see an abnormally low follow-through rate, that might indicate a bug or major issue. Auditing is hard work, as you have to hit both sides of the problem at the same time.
(2) Data Sources
From a single source such as a data warehouse. This is the most traditional path for BI development, and still has a very valid place in many BI/analytics deployments. This scenario puts the work of data integration on the ETL process into the data warehouse, which is the most appropriate place.
Directly from various systems. This can be done, but works well only in specific cases – it definitely won’t work well if there are a lot of highly normalized tables, or if there’s not a straightforward way to relate the disparate data together. Trying to go directly to the source systems & skip an intermediary data warehouse puts the “integration” burden on the data source view in Analysis Services, so plan for plenty of time testing if you’re going to try this route (i.e., it can be much harder, not easier). Note that this option only makes sense if the data is stored in Analysis Services because it needs to be related together somehow (i.e., DirectQuery mode, discussed next in #3, with > 1 data source won’t work if a user tries to combine data sources because the data is not inherently related).
If you’re thinking about Azure Analysis Services, this post is a good one.
HTAP is used to describe the capability of a single database that can perform both online transaction processing (OLTP) and online analytical processing (OLAP) for the purpose of real-time operational intelligence processing. The term was created by Gartner in 2014.
In the SQL Server world you can think of it as: In-memory analytics (columnstore) + in-memory OLTP = real-time operational analytics. Microsoft supports this in SQL Server 2016 (see SQL Server 2016 real-time operational analytics).
I’m not completely sold on HTAP yet, particularly once you get to high-scale OLTP systems doing hundreds of thousands of transactions per second. That said, there’s always more and more pressure to get data available for analytics faster and faster.
It’s simple. Both the set-builder notation, and the SQL language (and in principle, other languages’ for comprehensions) are declarative. They are expressions, which can be composed to other, more complex expressions, without necessarily executing them.
Remember the imperative approach? We tell the machine exactly what to do:
- Start counting from this particular minimal integer value
- Stop counting at this particular maximal integer value
- Store all even integers in between in this particular intermediate collection
What if we don’t actually need negative integers? What if we just wanted to have a utility that calculates even integers and then reuse that to list all positive integers? Or, all positive integers less than 100? Etc.
It may be my innate contrarian curmudgeonliness, but I am moving more and more toward the idea that the easiest way to deal with data is a combination of SQL and functional programming languages, leaving OO out of the picture.
Although I dearly love using SQL Server, I wouldn’t use it in every circumstance; there are times, for example, when just isn’t necessary to use a Server-based RDBMS for a data-driven application. The open-source SQLite is arguably the most popular and well-tried-and-tested database ever. It is probably in your phone, and used by your browser. Your iTunes will use it. Most single-user applications that need to handle data will use SQLite because it is so reliable and easy to install.
It is specifically designed as a zero-configuration, embedded, relational database with full ACID compliance, and a good simple dialect of SQL92. The SQLite library accesses its storage files directly, using a single library, written in C, which contains the entire database system. Creating a SQLite database instance is as easy as opening a simple cross-platform file that contains the entire database instance. It requires no administration.
There’s a lot going on in this interesting article; I recommend giving it a read.
So at work, I’d say space matters, and in order to optimize our storage requirements it’s very important to know about SQL Server internals, specially the Storage Engine, which happens to be one of my favorite topics of study.
In my quest to release some space I got to this database, just one table which is 165M of XML documents stored as NVARCHAR(MAX).
It was interesting walking through the process. Some part of me wonders if it’s a bit complex for the next maintainer to handle, but saving a terabyte of disk space is worth a few extra pages of documentation…
For an introduction to this interesting Hadoop project, check out this article. Apache Kylin originally from eBay, is a Distributed Analytics Engine that provides SQL and OLAP access to Hadoop datasets utilizing Hive and HBase. It can use called through SparkSQL as well making for a very useful project. This project let’s you work with PowerBI, Tableau and Excel with more tool support coming soon. You can doMOLAP cubes and support many users with fast queries over billions of rows. Apache Kylin provides JDBC and ODBC drivers.
There are a few interesting options here.
Synchronous – Code that runs one one line at a time. Each line of code is completed before the next one starts. If an external call is made then it is completed before the next line of code runs.
Asynchronous – Code that is launched and runs separately from the initial code. If a SQL job is launched from inside a batch of code (using sp_start_job for example) then the job is running in parallel (at the same time as) to the remainder of the batch of code.
Understanding which operations are synchronous versus asynchronous, and which operations are blocking versus non-blocking versus semi-blocking, will do wonders for improving application performance.
ETL. Spec said only Address Line 1 is needed to be loaded, so the developers only bring that line in (plus name, city etc.). Fast forward 8 years, I get a request on my desk: “Please add Address Line 2 to import, and all tables. Oh, and we need historical data for previously loaded files. And for all address types”.
No normalization in this database (which is just one of about 40 databases with SIMILAR structure, but not identical).
Read on for the damage done, as well as another example of foresight saving the day.
The primary system I deal with on a daily basis was originally developed as a DOS application and several of the above examples are drawn from it. Looking at the core tables and columns, it’s easy to identify those that began life in those early days – they all have 8-character names. Time moved on and the system grew and evolved. DOS to Windows. Windows to the web. But the database, and the practices and patterns used in the database, haven’t come along for the ride.
Data schema conversions can be hard and disruptive – you need to update your application, your stored procedures, and provide customers/users with a clean migration path. Code changes require testing. Complexity and cost grows every time you introduce changes. I get that.
There’s a lot of effort in Andy’s advice, but it’s well worth it.