With the changes in the data paradigm, a new architectural pattern has emerged. It’s called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.
The way I compare a data lake to a data warehouse or a mart is like this:
Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.
One way of thinking about this is that data warehouses are great for solving known business questions: generating 10K reports or other regulatory compliance reporting, building the end-of-month data, and viewing standard KPIs. By contrast, the data lake is (among other things) for spelunking, trying to answer those one-off questions people seem to have but which the warehouse never seems to have quite the right set of information.
The odbc package is a from-the-ground-up implementation of an ODBC interface for R that provides native support for additional data types (including dates, timestamps, raw binary, and 64-bit integers) and parameterized queries. The odbc package provides connections with any ODBC-compliant database, and has been comprehensively tested on SQL Server, PostgreSQL and MySQL. Benchmarks show that it’s also somewhat faster than RODBC: 3.2 times faster for reads, and 1.9 times faster for writes.
Sounds like odbc lets you run ad hoc queries and also lets you use dplyr as an ORM, similar to using Linq in C#.
Since I might be doing a similar thing in the near future, I decided to see if I could push 4-5 small test databases to Azure SQLDB at the same time (SQL on Azure VM test may come later). If this happens with my client it will be hundreds of databases, with the same schema but wildly varying sizes.
Create 5 sample databases
Create .bacpacs on local drive
Create Azure Logical Server if not already done (see step 4)
Create a Notepad of SQLPackage.exe commands, one for each .bacpac
Run with “start” in CMD to fire off separate windows
Worth reading the whole thing.
While working with some SQL Server 2017 tests, I ran out of disk space. I tend to size my VMs around 40GB, and that works for some things, but I’ll run out of space.
I needed to expand the VMWare disk. That doesn’t mean Linux sees the space directly, and I had to figure out how to make the partition bigger. I could have added another disk, but I wanted to work through this process. I learned I needed to have an inactive partition, so I download gparted on a live cd and booted to that.
Steve uses the GUI approach; in the comments, David Klee links to his CLI approach.
One cool little feature in SQL Server 2016 is COMPRESS(). It’s a TSQL function available in all editions that shrinks down data using the GZIP algorithm (documentation).
Things to know about COMPRESS():
Compressed data is in the VARBINARY(max) data type
You get the data “back to normal” by using the DECOMPRESS function – which also outputs VARBINARY(max)
You can’t use columns of the VARBINARY(max) type in an index key column– but it may be useful to use the column as a filter in a filtered index, in some cases
COMPRESS() uses standard GZip compression, so you could use methods other than DECOMPRESS() to inflate the data—for example, bring the compressed data out to your application and use language-specific GZip libraries to decompress the data. Read the whole thing.
You can push data to the Power BI streaming dataset API in a few ways… but they generally boil down to these 3 options…
- Directly call the API from code
- You could use something like Azure Function Apps to iteratively pull NEW rows that land in a SQL table, create the API Call, and push the new data directly to the API
- See here info on Azure Functions – https://azure.microsoft.com/en-us/services/functions/
- Directly call the API from an Azure Logic App
- Azure Logic Apps are cool as for simple functions like this you can do pretty much the same as in the code option above but just using drag/drop and WITHOUT writing any code – https://azure.microsoft.com/en-us/services/logic-apps/
- Use Azure Stream Analytics to push data into the API
- This is leveraging the solution in my previous post to push data from SQL CDC and into an Azure Event Hub, then via Azure Stream Analytics to Power BI
- Azure Stream Analytics – https://azure.microsoft.com/en-us/services/stream-analytics/
- Azure Event Hubs – https://azure.microsoft.com/en-us/services/event-hubs/
This blog post extends on my previous post – and thus I will be leveraging Option #3 above.
Definitely worth checking out if you are interested in real-time Power BI dashboards.
As it turns out SQL Server Management Studio (SSMS) can display many types of emoji and this is the key.
Keeping in mind that the whole thing was not written with performance or best practice in mind, I’d like to introduce the world to the very first action adventure game 100% written and played in SQL Server!
The goal here is to have a game which helps teach some basics of development practices. Interesting concept.
Startup procedures automatically execute whenever SQL Server is started. Where would you use this? One example is if you had an application that required the use of a global temporary table, you could use a startup procedure to create it. This would allow the table to be immediately accessible to any application that requires it.
Another possibility is that you could “warm up” the cache so that data is already cached when an application needs it. This would help prevent that initial I/O hit the first time the query is called, thus potentially improving initial performance.
Click through for details. I’ve created a couple of these, but they rarely come to mind and that’s one of the big problems: if there is an issue, there’s probably quite a bit of investigative work that would take place before somebody remembers that hey, these things exist.