Press "Enter" to skip to content

Curated SQL Posts

Polybase Row Size Limits

Manoj Pandey notes that Polybase has a row size limit:

With the error description its quiet evident that the External tables does not support row size more than 32768 bytes. But still I take a look online and found in Azure Documentation that this is a limitation right now with Polybase. The Azure document mentions:

Wide rows support is not supported yet, “If you are using Polybase to load your tables, define your tables so that the maximum possible row size, including the full length of variable length columns, does not exceed 32,767 bytes. While you can define a row with variable length data that can exceed this figure, and load rows with BCP, you will not be be able to use Polybase to load this data quite yet. Polybase support for wide rows will be added soon. Also, try to limit the size of your variable length columns for even better throughput for running queries.”

You can still use varchar(max) and nvarchar(max) for data types (unlike the Hive provider, which has a strict limit of 8000 characters for a single column) but can’t break that 32K mark.

Comments closed

Power BI + Interactive R Charts

Leila Etaati shows us how to integrate R charts in the Power BI experience:

In previous videos you’ve learned that we can demonstrate R visualization in Power BI, In this video you will learn how R visualization is working interactively with other elements in Power BI report. In fact Power BI works with R charts as a regular visualization and highlighting and selecting items in other elements of report will effect on that. Here is a quick video about this functionality

Check out the five-minute video.

Comments closed

Finding Malicious Domains

Rafael San Miguel Carrasco uses dimensionality reduction to figure out if a domain is malicious:

Dimensionality reduction is a common techique to visualize observations in a dataset, by combining all features into two, that can then be used to draw the observation in an scatter plot.

One popular algorithm that implements this technique is PCA (Principal Components Analysis), which is available in R through the prcomp() function.

The algorithm was applied to observations of sthe dataset, and ggplot2’s geom_point() function was used to draw the results in a 2D chart.

I would want to see this done for a couple hundred thousand domains, but I do like the idea of taking advantage of statistical modeling tools to find security threats.

Comments closed

XML Includes Tabs And Spaces

Sander Stad ran into an error creating a Biml script:

Apparently SSIS doesn’t agree with my code. So opening the editor of the raw file connection, changing the access mode to “File name” showed me this:

There are spaces and tabs in front of the path! SSIS doesn’t work well with spaces and that’s one of the reasons why you should not use spaces in file names in the first place.

This is one of the trickier bits of XML-based languages (like Biml):  spacing inside tags can matter…sometimes…

Comments closed

MapR Goes Spark-First

MapR has introduced a new version of their platform which is based on Spark:

With the emergence of Spark as a unified computing engine, developers can perform ETL and advanced analytics in both continuous (streaming) and batch mode either programmatically (using Scala, Java, Python, or R) or with procedural SQL (using Spark SQL or Hive QL).

With MapR converging the data management platform, you can now take a preferential Spark-first approach. This differs from the traditional approach of starting with extended Hadoop tools and then adding Spark as part of your big data technology stack. As a unified computing engine, Spark can be used for faster batch ETL and analytics (with Spark core instead of MapReduce and Hive), machine learning (with Spark MLlib instead of Mahout), and streaming ETL and analytics (with Spark Streaming instead of Storm).

MapReduce is so 2012…

Comments closed

HDInsight Tool For IntelliJ

Xiaoyong Zhu introduces the new HDInsight Tool for IntelliJ:

This tools extends IntelliJ to support Spark job life cycle from create, author, debug and submit job to Azure cluster and view results.  This IntelliJ HDInsight tool integrates well with Azure to allow user navigate HDInsight Spark clusters and view associated Azure storage account. To further boost productivity, the IntelliJ HDInsight tool also offers the capability to view Spark job history, display detailed job logs, and the job output to boost developer productivity. A few usability improvements have been implemented upon user preview feedback, which includes auto locate artifact, add intelligence to remember assembly location, caches spark logs, etc.

It looks like this is specifically designed for Spark-enabled clusters.

Comments closed

Machine Learning Packages In R

Khushbu Shah discusses good R packages to help with your machine learning projects:

If missing values are something which haunts you then MICE package is the real friend of yours.

When we face an issue of missing values we generally go ahead with basic imputations such as replacing with 0, replacing with mean, replacing with mode etc. but each of these methods are not versatile and could result into a possible data discrepancy.

MICE package helps you to impute missing values by using multiple techniques, depending on the kind of data you are working with.

I’d heard of a couple of these, but most of them are new to me.

Comments closed