Azure warn you not to to store data on the D drive in Azure VMs, but following this advice could mean you are missing out on some very fast local storage. It’s good general advice because this local storage is not permanently attached to your instance, meaning you could lose data or log files if your VM is stopped and restarted but what if you could afford to lose certain files? Say files that are recreated during startup anyway.
TempDB is the ideal candidate for this. No other database is suitable! Putting the tempdb data and log files onto D drive can be achieved quite easily with a little bit of effort. And you will most likely see a big improvement in tempdb read/write latency.
John ended up seeing much bigger gains than I did when I tried this, but with a difference that big, it’s definitely worth using the temporary drive for tempdb.
Do you want to identify the correct Service Tier and Compute Size ( was once known as performance level) for your Azure SQL Database? How would you go about it? Would you use the DTU (Database Transaction Unit) calculator? What about the new pricing model vCore? How would you translate you current on-premises workload to the cloud?
It can be a form of trial and error especially if you are new to this but I really do recommend trying out the PowerShell script that you can access once you have installed DMA – Database Migration Assistant.
Read on to see how to run this tool and potentially save some money.
File management may not be at the top of my list of priorities during data integration projects. I assume that once I learn enough about sourcing data systems and target destination platform, I’m ready to design and build a data integration solution between two or more connecting points. Then, a historical file management process becomes a necessity or a need to log and remove some of the incorrectly loaded data files. Basically, a step in my data integration process to remove (or clean) such files would be helpful.
Click through to see how to do this.
In particular, we’ll look at an example scenario that addresses Data Drift – where new information is added mid-stream and when that occurs the new table structure and new column values are created in Snowflake automatically.
To illustrate, let’s take HTTP web server logs generated by Apache web server (for example) as our main source of data. Here’s what a typical log line looks like:
126.96.36.199 - - [14/Jun/2014:10:30:19 -0400] "GET /department/outdoors/category/kids'%20golf%20clubs/product/Polar%20Loop%20Activity%20Tracker HTTP/1.1" 200 1026 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
Click through for the demonstration.
I’ve supplied these two python scripts in my
githubrepo at the following link. Firstwe need to open the install the relevant python libraries so you’ll need to issue the below pip command in whatever command tool you use, bash or cmd Prompt
pip install azure-eventhub
Check it out if you need pub-sub in Azure.
Azure Data Explorer (ADX) was announced as generally available on Feb 7th. In short, ADX is a fully managed data analytics service for near real-time analysis on large volumes of data streaming (i.e. log and telemetry data) from such sources as applications, websites, or IoT devices. ADX makes it simple to ingest this data and enables you to perform complex ad-hoc queries on the data in seconds – ADX has speeds of up to 200MB/sec per node (currently up to 3 nodes) and queries across a billion records take less than a second. A typical use case is when you are generating terabytes of data from which you need to understand quickly what that data is telling you, as opposed to a traditional database that takes longer to get value out of the data because of the effort to collect the data and place it in the database before you can start to explore it.
It’s a tool for speculative analysis of your data, one that can inform the code you build, optimizing what you query for or helping build new models that can become part of your machine learning platform. It can not only work on numbers but also does full-text search on semi-structured or un-structured data. One of my favorite demo’s was watching a query over 6 trillion log records, counting the number of critical errors by doing a full-text search for the word ‘alert’ in the event text that took just 2.7 seconds. Because of this speed, ADX can be a replacement for search and log analytics engines such as elasticsearch or Splunk. One way I heard it described that I liked was to think of it as an optimized cache on top of a data lake.
Click through for James’s explanation and where you might want to use ADX.
AWS provides a lot of services, these services are sufficient to run your architecture. The backbone for the security of this architecture is VPC (Virtual Private Cloud). VPC is basically a private cloud in the AWS environment that helps you to use all the services by AWS in your defined private space. You have control over the virtual network and you can also restrict the incoming traffic using security groups.
Overall, VPC helps you to secure your environment and give you a complete authority of incoming traffic. There are two types of VPCs, Default VPC that is by default created by Amazon and Non-Default VPC that is created by you to suffice your security needs.
Now that you have an idea of how VPC works, I will take you through the different services offered by Amazon VPC.
Read on to see how to set one up.
For security purposes, Databricks Apache Spark clusters are deployed in an isolated VPC dedicated to Databricks within the customer’s account. In order to run their data workloads, there is a need to have secure connectivity between the Databricks Spark Clusters and the above data sources.
It is straightforward for Databricks clusters located within the Databricks VPC to access data from AWS S3 which is not a VPC specific service. However, we need a different solution to access data from sources deployed in other VPCs such as AWS Redshift, RDS databases, streaming data from Kinesis or Kafka. This blog will walk you through some of the options you have available to access data from these sources securely and their cost considerations for deployments on AWS. In order to establish a secure connection to these data sources, we will have to configure the Databricks VPC with either one of the following two available options :
Read on for those two options.
When I first started with VSTS and ultimately Azure DevOps, I went through many failed builds because the order of the jobs in your pipeline don’t run in the order that you’ve built them and how you would logically believe them to run. The image below shows two Build Pipeline jobs but when the build is queued, whether this be manual or via CI, the second job is running before job #1. In this example the build will fail because Job #2 is to deploy a dacpac to a SQL Server on Linux Docker Container (Using Ubuntu Agent Host) but obviously this cannot be done until the dacpac has been created in Job #1 which is running on a VS2017 Agent Host:
Click through to see how it’s done.
Hive is a “SQL on Hadoop” technology that combines the scalable processing framework of the ecosystem with the coding simplicity of SQL. Hive is very useful for performant batch processing on relational data, as it leverages all of the skills that most organizations already possess. Hive LLAP (Low Latency Analytical Processing or Live Long and Process) is an extension of Hive that is designed to handle low latency queries over massive amounts of EXTERNAL data. One of this coolest things about the Hadoop SQL ecosystem is that the technologies allow us to create SQL tables directly on top of structured and semi-structured data without having to import it into a proprietary format. That’s exactly what we’re going to do in this post. You can read more about Hive here and here and Hive LLAP here.
We understand that SQL queries don’t typically constitute traditional data science functionality. However, the Hadoop ecosystem has a number of unique and interesting data science features that we can explore. Hive happens to be one of the best starting points on that journey.
Click through for the screenshot-laden demonstration.