Press "Enter" to skip to content

Category: Cloud

S3 And HDFS Data Migration

Ilya Yalovyy looks at S3DistCp, which allows you efficiently to migrate data back and forth between HDFS and S3:

Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table_filtered --dest s3://my-tables/incoming/hourly_table_gz --outputCodec=gz

The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:

  • none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.

  • keep” – Don’t change the compression of the files but copy them as-is.

This is an important article if you’ve got a Hadoop cluster running on EC2 nodes.

Comments closed

Jupyter And Kubernetes

David Crook shows how to use Jupyter notebooks inside Kubernetes:

We start with a 16.04 image, we run some upgrades, install python, upgrade pip, install our requirements and expose port 8888 (jupyter’s default port).

Here is our requirements.txt file

1
2
3
4
5
6
7
8
9
numpy
pandas
scipy
jupyter
azure_common
azure-storage
scikit-learn
nltk
plotly

Notice how Jupyter is in there, I also added a few other things that I very commonly use including numpy, pandas, plotly, scikit-learn and some azure stuff.

The big benefit to doing this is that your installation of Jupyter can exist independently from your notebooks, so if you accidentally mess up Jupyter, you kill and reload from the image in a couple commands.

Comments closed

Using Hive As A Power BI Data Source

Ust Oldfield shows how to use Hive via Azure HDInsight as a data source for Power BI:

As Hive is part of the Azure HDInsight stack it would be tempting to select the HDInsight or Hadoop connector when you’re getting data. However, note HDFS in brackets beside the Azure HDInsight and Hadoop File options as this means that you’ll be connecting to the underlying data store, which can be Azure Data Lake Store or Azure Blob Storage – both of which use HDFS architectures.

But this doesn’t help when you want to access a Hive table. In order to access a Hive table you will first of all need to install the Hive ODBC driver from Microsoft. Once you’ve downloaded and installed the driver you’ll be able to make your connection to Hive using the ODBC connector in PowerBI.

Read the whole thing.  Connecting to Hive is pretty easy.

Comments closed

Automating Azure SQL DB Maintenance

Tim Radney shows several methods for performing automated Azure SQL Database maintenance, including runbooks:

Once you create your account, you can then start creating runbooks. You can do just about anything with the runbooks. There are numerous existing run books that you can browse through and modify for your own use, including provisioning, monitoring, life cycle management, and more.

You can create the runbooks offline, or using the Azure Portal, and they’re built using PowerShell. In this example, we will reuse the code from the PowerShell demo and also demonstrate how we can use the built in Azure Service scheduler to run our existing PowerShell code and not have to rely on an on-premises scheduler, task scheduler, or Azure VM to schedule a job.

Read the whole thing if you have Azure SQL Database instances in your environment.

Comments closed

Cross-Database Queries With Azure SQL DB

Dustin Ryan shows how to set up cross-database queries within Azure SQL Database:

2. Vertical queries (in preview): A vertical elastic query is a query that is executed across databases that contain different schemas and different data sets. An elastic query can be executed across any two Azure SQL Database instances. This is actually really easy to set up and that what this blog post is about! The diagram below represents a query being issued against tables that exist in separate Azure SQL Database instances that contain different schemas.

Read on to learn how to implement vertical elastic queries today.

Comments closed

Making Calls With IoT Hub

Rolf Tesmer combines Azure IoT Hub with Twilio to make phone calls based on incoming messages:

When the IoT Hub is created you will get an endpoint hosted in Azure.  This is the target for the JSON events being generated from the mobile device.

Azure IoT Hubs are more complex than an Azure Event Hub, perform a lot more device based functions and also have stronger security capability.  However, operationally they work pretty much the same.  

If you want to learn more about the differences between the two Hubs then this is a great article – https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-compare-event-hubs

It’s a neat tutorial for a fun weekend project.

Comments closed

Azure AD On Azure SQL DB

Arun Sirpal shows how to set up Azure SQL Database to use Azure Active Directory accounts:

I think it is important to highlight a couple of points, more specifically around the requirement of ADALSQL.DLL and proper setup of AD which I will highlight below and reference some links, please do this as it lays the foundation for you.

ADALSQL.DLL

You need ADALSQL.DLL which is part of the latest SQL Server Management Studio (SSMS) to test access. This stands for Active Directory Authentication Library for SQL Server.

This goes through some of the issues Arun had setting everything up and provides workarounds and explanations.

Comments closed

Presto On HDInsight

Ashish Thapliyal shows how to install Presto on an HDInsight cluster:

What is Presto?

Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. Presto is becoming popular SQL interactive query engine that has grabbed the attention and mind-share in Big data communities.

What are the key advantages of Presto?

1- It’s very fast – Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses.

2- Presto can query data where it lives – Presto supports many data sources via the number of connectors that community has built. You can query HDFS , Hive, Azure Storage or data stored in SQL Server , My SQL , CosmosDB or Cassandra etc.

You can install Presto in one simple step with HDInsight Script Action feature

Read on for instructions and showing how to connect this to other Azure products like CosmosDB and Azure SQL Database.

Comments closed

SQL In Spaaaaaaacccce!

Drew Furgiuele knows that in space, no-one can hear your Sev 18 alerts:

Over the last few months, I’ve had a new itch: I wanted to get into the world of high altitude ballooning. The concept is pretty simple: get a balloon and some helium, tie it to a payload, and let it go. The balloon travels a certain height and distance, then bursts, and your payload falls back to earth. That in itself is pretty interesting to me, and it’s not prohibitively expensive: students have done it for a couple hundred dollars. For a few dollars more, you can put a camera on it and take pictures as it travels.

The thing is, I wanted to do more than that. The maker in me wanted to do something special, something no one (to my knowledge) has done before. I not only wanted to launch a balloon and a camera, I wanted to put SQL Server up there, too. So that’s why we’re announcing the High Altitude SQL Server Project (HASSP).

I love it.

Comments closed