Press "Enter" to skip to content

Category: Cloud

Generating TPC-DS Data Sets with HDInsight

Chris Koester shows how you can generate artificial data sets in the TCP-DS format using HDInsight:

This post describes how to generate big datasets with Hive in HDInsight, specifically TPC-DS benchmarking datasets. There are many tools for generating sample data, and this one is particularly nice due to its familiarity and ability to generate massive datasets up to 100 terabytes in size. The intended purpose of TPC data is for benchmarking purposes, but big sample datasets are also very useful for learning big data tools, proofs of concept, testing, etc.

The TPC (Transaction Processing Performance Council) provides tools for generating the benchmarking data, but using them to generate big data is not trivial, and would take a very long time on modest hardware. Thankfully someone has written a nice utility that uses Hive and Python to run the generator on a Hadoop cluster. While Hadoop clusters are not easy to setup, using a Hadoop cloud service like Azure HDInsight is remarkably easy. With HDInsight, you can use a powerful cluster of machines to generate the data quickly, and when you’re done you can delete the cluster, leaving the data in place.

Most of the instructions should follow through to work with on-prem or non-HDInsight Hadoop clusters, though there will be some changes to accommodate differences in HDInsight.

Comments closed

Databricks Dashboards

Megan Quinn takes us through building dashboards with Apache Zeppelin on Databricks:

The first step in any type of analysis is to understand the dataset itself. A Databricks dashboard can provide a concise format in which to present relevant information about the data to clients, as well as a quick reference for analysts when returning to a project.

To create this dashboard, a user can simply switch to Dashboard view instead of Code view under the View tab. The user can either click on an existing dashboard or create a new one. Creating a new dashboard will automatically display any of the visualizations present in the notebook. Customization of the dashboard is easily achieved by clicking on the chart icon in the top right corner of the desired command cells to add new elements.

This isn’t quite a step-by-step guide but does spur on ideas.

Comments closed

Putting TempDB Files On Azure IaaS D Drive

John McCormack tries out using the temporary drive on Azure VMs for tempdb:

Azure warn you not to to store data on the D drive in Azure VMs, but following this advice could mean you are missing out on some very fast local storage. It’s good general advice because this local storage is not permanently attached to your instance, meaning you could lose data or log files if your VM is stopped and restarted but what if you could afford to lose certain files? Say files that are recreated during startup anyway.

TempDB is the ideal candidate for this. No other database is suitable! Putting the tempdb data and log files onto D drive can be achieved quite easily with a little bit of effort. And you will most likely see a big improvement in tempdb read/write latency.

John ended up seeing much bigger gains than I did when I tried this, but with a difference that big, it’s definitely worth using the temporary drive for tempdb.

Comments closed

Sizing Azure SQL Database

Arun Sirpal takes us through finding the right size for Azure SQL Database:

Do you want to identify the correct Service Tier and Compute Size ( was once known as performance level) for your Azure SQL Database? How would you go about it? Would you use the DTU (Database Transaction Unit) calculator? What about the new pricing model vCore? How would you translate you current on-premises workload to the cloud?

It can be a form of trial and error especially if you are new to this but I really do recommend trying out the PowerShell script that you can access once you have installed  DMA – Database Migration Assistant.

Read on to see how to run this tool and potentially save some money.

Comments closed

Cleaning Up After Yourself in Azure Data Factory

Rayis Imayev shows how you can automatically delete old files in Azure Data Factory:

File management may not be at the top of my list of priorities during data integration projects. I assume that once I learn enough about sourcing data systems and target destination platform, I’m ready to design and build a data integration solution between two or more connecting points. Then, a historical file management process becomes a necessity or a need to log and remove some of the incorrectly loaded data files. Basically, a step in my data integration process to remove (or clean) such files would be helpful. 

Click through to see how to do this.

Comments closed

Using the StreamSets Snowflake Destination

Dash Desai shows how you can use StreamSets to write data into SnowflakeDB:

In particular, we’ll look at an example scenario that addresses Data Drift – where new information is added mid-stream and when that occurs the new table structure and new column values are created in Snowflake automatically.

To illustrate, let’s take HTTP web server logs generated by Apache web server (for example) as our main source of data. Here’s what a typical log line looks like:
150.47.54.136 - - [14/Jun/2014:10:30:19 -0400] "GET /department/outdoors/category/kids'%20golf%20clubs/product/Polar%20Loop%20Activity%20Tracker HTTP/1.1" 200 1026 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

Click through for the demonstration.

Comments closed

Investigating Azure Data Explorer

James Serra digs into how you can use Azure Data Explorer:

Azure Data Explorer (ADX) was announced as generally available on Feb 7th.  In short, ADX is a fully managed data analytics service for near real-time analysis on large volumes of data streaming (i.e. log and telemetry data) from such sources as applications, websites, or IoT devices.  ADX makes it simple to ingest this data and enables you to perform complex ad-hoc queries on the data in seconds – ADX has speeds of up to 200MB/sec per node (currently up to 3 nodes) and queries across a billion records take less than a second.  A typical use case is when you are generating terabytes of data from which you need to understand quickly what that data is telling you, as opposed to a traditional database that takes longer to get value out of the data because of the effort to collect the data and place it in the database before you can start to explore it.

It’s a tool for speculative analysis of your data, one that can inform the code you build, optimizing what you query for or helping build new models that can become part of your machine learning platform.  It can not only work on numbers but also does full-text search on semi-structured or un-structured data.  One of my favorite demo’s was watching a query over 6 trillion log records, counting the number of critical errors by doing a full-text search for the word ‘alert’ in the event text that took just 2.7 seconds.  Because of this speed, ADX can be a replacement for search and log analytics engines such as elasticsearch or Splunk.  One way I heard it described that I liked was to think of it as an optimized cache on top of a data lake.

Click through for James’s explanation and where you might want to use ADX.

Comments closed

Building a VPC with AWS

Priyaj Kumar takes us through the process of building a Virtual Private Cloud in AWS:

AWS provides a lot of services, these services are sufficient to run your architecture. The backbone for the security of this architecture is VPC (Virtual Private Cloud). VPC is basically a private cloud in the AWS environment that helps you to use all the services by AWS in your defined private space. You have control over the virtual network and you can also restrict the incoming traffic using security groups.

Overall, VPC helps you to secure your environment and give you a complete authority of incoming traffic. There are two types of VPCs, Default VPC that is by default created by Amazon and Non-Default VPC that is created by you to suffice your security needs.

Now that you have an idea of how VPC works, I will take you through the different services offered by Amazon VPC.

Read on to see how to set one up.

Comments closed

Securely Accessing External Resources From Databricks AWS

Itai Weiss shows how you can securely hit external data sources when using Databricks for AWS:

For security purposes, Databricks Apache Spark clusters are deployed in an isolated VPC dedicated to Databricks within the customer’s account. In order to run their data workloads, there is a need to have secure connectivity between the Databricks Spark Clusters and the above data sources.

It is straightforward for Databricks clusters located within the Databricks VPC to access data from AWS S3 which is not a VPC specific service. However, we need a different solution to access data from sources deployed in other VPCs such as AWS Redshift, RDS databases, streaming data from Kinesis or Kafka. This blog will walk you through some of the options you have available to access data from these sources securely and their cost considerations for deployments on AWS. In order to establish a secure connection to these data sources, we will have to configure the Databricks VPC with either one of the following two available options :

Read on for those two options.

Comments closed