Press "Enter" to skip to content

Day: April 13, 2020

Distributed XGBoost in Cloudera

Harshal Patil walk us through the XGBoost algorithm and shows how we can use it in Cloudera Machine Learning:

DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as Numpy, Pandas, and Scikit-Learn. Dask was initially released around 2014 and has since built significant following and support. 

DASK uses Python natively, distinguishing it from Spark, which is written in Java, and has the overhead of running JVMs and context switching between Python and Java. It is also much harder to debug Spark errors vs. looking at a Python stack trace that comes from DASK.

We will run Xgboost on DASK to train in parallel on CML. The source code for this blog can be found here.

Click through for the process.

Comments closed

Saving Graphics in R Across Multiple OSes

Colin Gillesipie takes us through exporting graphics in R and some of the cross-platform foibles you’ll find:

One of R’s outstanding features is that it is cross platform. You write R code and it magically works under Linux, Windows and Mac. Indeed, the above the code “runs” under all three operating systems. But does it produce the same graphic under each platform? Spoiler! None of the above functions produce identical output across OS’s. So for “same”, I going to take a lax view and I just want figures that look the same.

Read on to understand the differences and hopefully limit confusion around them.

Comments closed

Migrating to Azure with SQL Server Management Studio

Magi Naumova walks us through some options for migrating on-prem instances to Azure, all of which are available in SQL Server Management Studio:

The cases of migrating our database in Azure become more and more every day. Azure SQL Database is the flagship SaaS service Microsoft Provides for hosting a relational database. But no matter it is the same engine there are still many features not supported or with limited functionalities in Azure SQL DB comparing to on premises SQL Server versions. For example, all cross-database references are possible in on premises SQL Server databases but is not supported in Azure SQL Database.

If we could check in advance and plan our migration based on those checks it would be time and effort saving. This is what Migrate to Azure new SSMS features are built for.

Click through for the options, some of which are simply informational and some of which actually do the work.

Comments closed

Power BI & Disabling Export to Excel

Marc Lelijveld explains why you might not want to let users export to Excel:

Export to Excel is a feature in Excel which is available in Power BI for a very long time. It allows report users to export the data from a specific visual in the report to an editable Excel file. After exporting, they can do whatever they want. For example, sending the data to others via mail, transforming or manipulating the data, start building new reports based on the Excel file and many other things. The export option can be used by clicking the ellipsis on the right top of a visual (if the visual header is enabled).

If you have all export functionalities enabled, users can both export underlying data and summarized data. The difference is mainly raw data or only data as visible in the chart where you clicked the export button.

Read on to understand why this might not be an unalloyed good.

Comments closed

Hyperthreading and VMs

David Klee shares some thoughts on hyperthreading in virtual environments:

I recommend leaving the hyper-threaded logical cores enabled in the host BIOS, but not depending on them for performance gains. Hyperthreaded CPU cores, or logical cores, should not be factored into CPU overcommitment rations as if they are full processor cores.

Every task that is triggered inside a virtual machine must be scheduled to run on a physical compute resource. These scheduled tasks must be placed into a scheduling queue inside the hypervisor layer before it gets its time on the physical compute resource. If the hypervisor is overloaded, or if the vCPU scheduling queues are imbalanced from an incorrect vCPU configuration, these queues can grow, and the performance impact on the vCPU performance can suffer.

Click through for an explanation of hyperthreading and David’s guidance on the topic.

Comments closed

Power BI Warning Regarding “Store datasets in enhanced metadata format”

Imke Feldmann does not recommend turning on the “Store datasets in enhanced metadata format” setting in Power BI all willy-nilly:

Background

With the march release came function “Store datasets in enhanced metadata format”. With this feature turned on, Power BI data models will be stored in the same format than Analysis Services Tabular models. This means that they inherit the same amazing options, that this open-platform connectivity enables.

Limitations and their consequences

But with the current setup, you could end up with a non-working file which you would have to build up from scratch for many parts. So make sure to fully read the documentation . Now!

Read on to see what has Imke concerned.

Comments closed

Powershell 7 Pipeline Chain Operators

Patrick Gruenauer show off a pair of new operators in Powershell 7:

With PowerShell 7 new operators were introduced. We call them chain operators. Chain operators enables you to do something after doing something. They use the $? and $LASTEXITCODE variable to determine whether a command on the left hand of the pipe failed or succeded.

Let’s cover this topic by demonstrating some examples to fully understand the new pipeline technology.

This is definitely Bash-inspired and I’m happy they made this move.

Comments closed