Now you can start poking around and seeing what’s in the Dashboard. Since I opted to not put any handles in for analysis of FROM and TO, the first two tabs in the workbook (Outbound Tweets and Inbound Tweets) will not have any information, this is normal.
But then we get to tab #3 – Author Hashtag Graph. The gray dots are hashtags and the green dots are accounts that have tweeted. You can see that I made a tweet that had 2 hashtags – #osmf2017 and #mvpbuzz. And boy was @TexasMusicDude busy tweeting up a storm – and using lots of other hashtags in conjunction with his tweets. Other hashtags that were popular appear to be #CampGround, #ShinyRibs, #TexasMusic, #DreamFolk and #Strings. Along the bottom you can see the day/timeline and the quantity of tweets at what time of day. If you click on any of the nodes, the information about what time the tweet(s) took place is highlighted in the timeline. It’s very interactive.
It does require an Azure subscription, but it looks very useful as a model for an advanced set of dashboards as well as a campaign management tool.
On a recent project I used Azure Data Factory (ADF) to retrieve data from an on premises SQL Server 2014 instance and land them in Azure Data Lake Store (ADLS) as ORC files. This required the use of the Data Management Gateway (DMG). Setup was quick and easy in our development environment. We installed the DMG for development on a separate server in the client’s network, where we also installed SQL Server Management Studio (SSMS) for query development and data validation. We set up resource groups in Azure for development and production, and made sure the settings for development and production were the same. Then we set up a separate server for the production DMG.
Deployment and execution went well in the dev environment. Testing was completed, so we deployed to prod. Deployment went fine, but the pipelines failed execution and returned the following error on the output data sets.
Weird solution, but I’m going to guess that it makes perfect sense if you are able to look at the code.
Each record that is inserted goes onto the next available distribution. This guarantees that you will have a smooth, even distribution of data, but it means you have no way of telling which data is on which distribution. This isn’t always a problem!
If I wanted to perform a count of records, grouped by a particular field, I can perform this on a round-robin table. Each distribution will run the query in parallel and return it’s grouped results. The results can be simply added together as a second part of the query, and adding together 60 smaller datasets shouldn’t be a large overhead. For this kind of single-table aggregation, round-robin distribution is perfectly adequate!
However, the issues arise when we have multiple tables in our query. In order to join two tables. Let’s take a very simple join between a fact table and a dimension. I’ve shown 6 distributions for simplicity, but this would be happening across all 60.
Figuring out which distribution key to use can make a huge difference in performance.
Recently a client found this article on “Best Practices for SQL Server in Azure Virtual Machines” and wanted to re-provision his volumes to adhere to them.
No my first thoughts was wait, I’m a DBA, not a System Admin that’s not my role! But thinking more about it I realized the client views this as a SQL Server issue and I am the SQL Server Consultant and that it is my job to remedy this problem.
Not being 100% confident in Azure, I spun up a VM SQL Server and attempted to add some volumes. To my surprise, this was way too easy.
Click through for the steps.
All communication with the Azure Storage via connection strings and BLOB URLs enforce the use of HTTPS, which provides Encryption in Transit. You can enforce the use of “Always HTTPS” by setting the connection string like this: “DefaultEndpointsProtocol=https;AccountName=myblob1…” or in SAS signatures, as in the example below:
To protect data at rest, the service provides an option to encrypt the data as they are stored in the account. There’s no additional cost associated with encrypting the data at rest and it’s a good idea to switch it on as soon as the account is created. There is a one-click setting at the Storage Account level to enable it, and the encryption is applied on both new and existing storage accounts. The data is encrypted with AES 256 cipher and it’s now generally available to all Azure regions and Azure clouds (public, government etc)
There’s some good information here, making it worth the read.
The Azure Data Lake (ADL) vision from the beginning has been to transform business data into intelligence by providing analytics on any data at cloud scale. ADL enterprise customers gain insights on their business data using a wide range of tools and platforms. Today’s release of Cloudera Enterprise 5.11 brings another very valuable and widely-used Hadoop computation platform to the set of platforms that can leverage ADLS. No matter what big data analytics platform you choose, Azure Data Lake Store provides a single high throughput enterprise-scale hierarchical file system data lake repository for big data.
Anyone with an Azure subscription can now deploy Cloudera clusters with ADLS. To get started, you can use the Cloudera Enterprise Data Hub template or the Cloudera Director template on Azure Marketplace to create a Cloudera cluster. Once the cluster is up, see here for more information on how to set up your Cloudera cluster with ADLS today!
That’s an interesting development.
Classify text using BigDL
In this tutorial, we demonstrate how to solve a text classification problem based on the example found here. This example uses a convolutional neural network to classify posts in the 20 Newsgroup dataset into 20 categories.
We’ve provided a companion Jupyter notebook example on GitHub that you can open in the Jupyter dashboard to execute the code sections.
There’s a lot to this tutorial.
We had no options along the way for selecting names for resources, so we have a lot of auto-generated suffixes for our resource names. This is ok for purely learning scenarios, but not my preference if we’re starting a true project with a pre-configured solution. Following an existing naming convention is impossible with solutions (at this point anyway). A wish list item I have is for the solution deployment UI to display the proposed names for each resource and let us alter if desired before the provisioning begins.
The deployment also doesn’t prompt for which subscription to deploy to (if you have multiple subscriptions like I do). The deployment did go to the subscription I wanted, however, it would be really nice to have that as a selection to make sure it’s not just luck.
It sounds like there are some undesirable defaults, but at least it does appear to be very easy to do.
Power BI Privacy Levels “specify an isolation level that defines the degree that one data source will be isolated from other data sources”. After working through some testing scenarios and trying to discover the real impact to data security, I was unable to effectively show how this might have any bearing on data security in Power BI. During one test was I shown a warning about using data from a website with data I had marked Organizational and Private. In all cases, I was able to merge the data in the query and in the relationships with no warning or filtering. All of the documentation makes the same statement and most bloggers are restating what is found in the Power BI documentation as were not helpful. My takeaway after reviewing this for a significant amount of time is to not consider these settings when evaluating data security in Power BI. I welcome comments or additional references which actually demonstrate how this isolation actually works in practice. In most cases, we are using organizational data within our Power BI solutions and will not be impacted by this setting and my find improved performance when disabling it.
As Steve notes, this is not really a security feature. Instead, it’s intended to be more a warning to users about which data is confidential and which is publicly-sharable .
H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work – all within Flow’s browser-based environment. In this blog, we will only focus on its visualization part.
H2O FLOW web service lives in the Spark driver and is routed through the HDInsight gateway, so it can only be accessed when the spark application/Notebook is running
You can click the available link in the Jupyter Notebook, or you can directly access this URL:
Setup is pretty easy.