Press "Enter" to skip to content

Month: December 2016

Azure Management Using R

Alan Weaver introduces AzureSMR:

The AzureSMR functions currently addresses the following Azure Services:

  • Azure Blob: List, Read and Write to Blob Services

  • Azure Resources: List, Create and Delete Azure Resource. Deploy ARM templates.

  • Azure VM: List, Start and Stop Azure VMs

  • Azure HDI: List and Scale Azure HDInsight Clusters

  • Azure Hive: Run Hive queries against a HDInsight Cluster

  • Azure Spark: List and create Spark jobs/Sessions against a HDInsight Cluster(Livy)

This can be useful for cases like when you need to ramp up the Spark cluster before running a particularly compute-intensive process.

Comments closed

Linear Gauge Custom Visual

Devin Knight shows off the linear gauge custom visual in Power BI:

In this module you will learn how to use the Linear Gauge Power BI Custom Visual.  The Linear Gauge would often be used to visualize a KPI. It gives you the ability to compare an actual vs target as well as showing up to two trend lines.

This can be a very useful visual.  The tricky part is that the bars aren’t scaled the same, so when your eyes want to compare bar lengths, it can get a little confusing.

Comments closed

Windows 10 IoT Code To Back Up Databases

Drew Furgiuele writes code to back up your databases using a Raspberry Pi 3 and Windows 10 IoT edition:

The trickiest part of wiring a circuit like this is detecting a button press. Most logic boards don’t know if an input circuit should poll at high or low levels. That’s where pull-ups come in. Above, you can see we set one of the pins for the button to be a pull-up (or an input if we were using another board). That means it will pull the current and look for impedance. The other important thing is our debounce. With circuits, one button press can actually turn into lots because as soon as the switch completes (or interrupts) the circuit, it starts sending signals. A debounce is like a referee saying “only look for a signal for this long” and it will filter out extra “presses” based on current that might linger on a press.

Once we detect our button press, we’re calling the function below. All it does is read the current LED pin values, and looks to see which one is currently lit, and then lights the next one.

Go from understanding general purpose input/output pins to calling SMO via a web service all in one post.  If you’ve got an itch for a weekend project, have at it.

Comments closed

Understanding Data Gateways

James Serra walks us through the different data gateways available in Azure:

On-premises data gateway: Formally called the enterprise version.  Multiple users can share and reuse a gateway in this mode.  This gateway can be used by Power BI, PowerApps, Microsoft Flow or Azure Logic Apps.  For Power BI, this includes support for both scheduled refresh and DirectQuery.  To add a data source such as SQL Server that can be used by the gateway, check out Manage your data source – SQL Server.  To connect the gateway to your Power BI, you will sign in to Power BI after you install it (see On-premises data gateway in-depth).

Click through for more details on additional gateways.

Comments closed

Pivoting Data

Jana Sattainathan explains the PIVOT operator:

The results are so much easier to look at and comprehend, aren’t they? All object types for a schema are on a single line and it is easy for us to do impact analysis visually.

Sometimes doing it in T-SQL is the best approach, but pivoting is generally something which is cheaper in the application tier, whether you’re building a report, dashboard, or web app.

Comments closed

Checking Last CHECKDB Date Using DBCC PAGE

Wayne Sheffield shows how to get the last time DBCC CHECKDB ran on each database:

The “trick” to making this work is to encapsulate the DBCC command as a string, and to call it with the EXECUTE () function. This is used as part of an INSERT INTO / EXECUTE statement, so that the results from DBCC PAGE are inserted into a table (in this case a temporary table is used, although a table variable or permanent table can also be used). There are three simple steps to this process:

  1. Create a table (permanent / temporary) or table variable to hold the output.

  2. Insert into this table the results of the DBCC PAGE statement by using INSERT INTO / EXECUTE.

  3. Select the data that you are looking for from the table.

Read on for his code as well as important caveats.

Comments closed

Spark Versus Flink

Sibanjan Das compares Apache Flink to Apache Spark:

The primitive concept of Apache Flink is the high-throughput and low-latency stream processing framework which also supports batch processing. The architecture is a flip of the other Big Data processing architectures where the primary notion was the batch processing framework. This is something that organizations have been looking for over the last decade. There is a need for platforms supporting low latency data movement for applications where even a millisecond delay can lead to severe consequences. The prospect of Apache Flink seems to be significant and looks like the goal for stream processing.

While comparing these two, don’t forget about Kafka Streams.  We’ve entered the streaming era for Hadoop & friends, and it’s an exciting time.

Comments closed

Mixed Integer Optimization

David Smith discusses the ompr package in R:

Counterintuitively, numerical optimizations are easiest (though rarely actually easy) when all of the variables are continuous and can take any value. When integer variables enter the mix, optimization becomes much, much harder. This typically happens when the optimization is constrained by a limited selection of objects, for example packages in a weight-limited cargo shipment, or stocks in a portfolio constrained by sector weightings and transaction costs. For tasks like these, you often need an algorithm for a specialized type of optimization: Mixed Integer Programming.

For problems like these, Dirk Schumacher has created the ompr package for R. This package provides a convenient syntax for describing the variables and contraints in an optimization problem. For example, take the classic “knapsack” problem of maximizing the total value of objects in a container subject to its maximum weight limit.

Read the whole thing.

Comments closed

Understanding HDFS Disk Checks

Xiao Chen explains how the HDFS Disk Checker works for data nodes:

The function of block scanner is to scan block data to detect possible corruptions. Since data corruption may happen at any time on any block on any DataNode, it is important to identify those errors in a timely manner. This way, the NameNode can remove the corrupted blocks and re-replicate accordingly, to maintain data integrity and reduce client errors. On the other hand, we don’t want to utilize too many resources, so that disk I/O can still serve actual requests.

Therefore, block scanner needs to make sure that suspicious blocks are scanned relatively quickly, and other blocks are scanned every once in awhile, at a relatively lower frequency, without significant I/O usage.

This is a nice article for operations folks who own Hadoop clusters.

Comments closed

Using Polybase To Insert Into HDFS

I have a post on writing to HDFS using Polybase:

What’s interesting is the error message itself is correct, but could be confusing.  Note that it’s looking for a path with this name, but it isn’t seeing a path; it’s seeing a file with that name.  Therefore, it throws an error.

This proves that you cannot control insertion into a single file by specifying the file at create time.  If you do want to keep the files nicely packed (which is a good thing for Hadoop!), you could run a job on the Hadoop cluster to concatenate all of the results of the various files into one big file and delete the other files.  You might do this as part of a staging process, where Polybase inserts into a staging table and then something kicks off an append process to put the data into the real tables.

Sometime in the future, I plan to see how it scales:  with multiple files writing to a multi-node Hadoop cluster, do I get better write performance with a Polybase scaleout cluster?  And if so, how close to linear scale can I get?

Comments closed