Press "Enter" to skip to content

Author: Kevin Feasel

Tidy Simulation of Stochastic Processes in R

David Robinson shows off my favorite distribution:

The Riddler puzzle describes a Poisson process, which is one of the most important stochastic processes. A Poisson process models the intuitive concept of “an event is equally likely to happen at any moment.” It’s named because the number of events occurring in a time interval of length is distributed according to , for some rate parameter (for this puzzle, the rate is described as one per day, ).

How can we simulate a Poisson process? This is an important connection between distributions. The waiting time for the next event in a Poisson process has an exponential distribution, which can be simulated with rexp().

Read on to learn about the Poisson distribution and Yule processes.

Comments closed

Changing the Graphics Device in RMarkdown Docs

Colin Gillespie shows us how to change PDF and PNG output settings within knitr:

In many workflows, function calls to graphic devices are not explicit. Instead, the call is made by another package, such as knitr.

When kniting an Rmarkdown document, the default graphics device when creating PDF documents is grDevices::pdf() and for HTML documents it’s grDevices::png(). As we demostrated, these are the worst possible choices!

Click through to see what you can do about it.

Comments closed

Clarifying Nomenclature around Azure Synapse Analytics

James Serra clears a few things up:

I see a lot of confusion among many people on what features are available today in Azure Synapse Analytics (formally called Azure SQL Data Warehouse) and what features are coming in the future. Below is a picture (click to zoom) that I describe below that hopefully clears things up:

I tend to just say “Azure Synapse Analytics SQL Pools” for the product formerly known as Azure SQL Data Warehouse and save “Azure Synapse Analytics” to include Spark + hyperscale (James’s v3).

Comments closed

Why Unit Testing in the Database Is Tough

Rob Farley talks about a couple of reasons why database unit testing can be difficult to do:

Hamish wants to develop a conversation about unit testing within database because he recognises that the lack of unit testing is a significant problem. It’s quite commonplace in the world of iterative code, of C#, Java, and those kinds of languages, but a lot less commonplace in the world of data. I’m going to look at two of the reasons why I think this is.

Read Rob’s thoughts in their entirety. I fully agree that we need to test, but get wishy-washy on the topic of automated testing. The reason for that is that tooling is quite limited, and many of those limitations are inherent limits in the database platform itself. For the types of things you most need to test (like hefty stored procedures), the number of test cases spirals out of control quickly. And unlike functional or structured programming languages, T-SQL performance gets markedly worse as you modularize, which makes it so difficult to get down to an easily testable block of code.

Comments closed

SQL Server Backup History

Dave Bland talks about a few useful tables in msdb:

How long a database takes to backup is something that over the years I have been asked to get.  These requests come for different reasons, sometimes it could be to find out how much it has increased over time, sometimes it could be to see if the backup job is interfering with other jobs and sometime it isn’t about duration at all, it is more about showing the backups were completed.  Over the years I have had a number of auditors ask of backup history.

In order to get this information we need to pull data from two tables in the MSDB database, backupset and backupmediafamily.

Read on to learn about these two tables and to get a sample query. On systems with a large number of databases and a DBA who loves frequent transaction log backups (like I do), this table can get pretty big, so don’t forget to prune that data over time.

Comments closed

Incremental Refresh with Power BI

Chris Webb talks about a special use case for Power BI incremental refresh:

Power BI incremental refresh is a very powerful feature and now it’s available in Shared capacity (not just Premium) everyone can use it. It’s designed for scenarios where you have a data warehouse running on a relational database but with a little thought you can make it do all kinds of other interesting things; Miguel Escobar’s recent blog post on how to use incremental refresh for files in a folder is a great example of this. In this post I’m going to show you how to use incremental refresh to solve another very common problem – namely how to get Power BI to keep the data that’s already in your dataset and add new data to it.

Click through for the details.

Comments closed

Distributed XGBoost in Cloudera

Harshal Patil walk us through the XGBoost algorithm and shows how we can use it in Cloudera Machine Learning:

DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as Numpy, Pandas, and Scikit-Learn. Dask was initially released around 2014 and has since built significant following and support. 

DASK uses Python natively, distinguishing it from Spark, which is written in Java, and has the overhead of running JVMs and context switching between Python and Java. It is also much harder to debug Spark errors vs. looking at a Python stack trace that comes from DASK.

We will run Xgboost on DASK to train in parallel on CML. The source code for this blog can be found here.

Click through for the process.

Comments closed

Saving Graphics in R Across Multiple OSes

Colin Gillesipie takes us through exporting graphics in R and some of the cross-platform foibles you’ll find:

One of R’s outstanding features is that it is cross platform. You write R code and it magically works under Linux, Windows and Mac. Indeed, the above the code “runs” under all three operating systems. But does it produce the same graphic under each platform? Spoiler! None of the above functions produce identical output across OS’s. So for “same”, I going to take a lax view and I just want figures that look the same.

Read on to understand the differences and hopefully limit confusion around them.

Comments closed

Migrating to Azure with SQL Server Management Studio

Magi Naumova walks us through some options for migrating on-prem instances to Azure, all of which are available in SQL Server Management Studio:

The cases of migrating our database in Azure become more and more every day. Azure SQL Database is the flagship SaaS service Microsoft Provides for hosting a relational database. But no matter it is the same engine there are still many features not supported or with limited functionalities in Azure SQL DB comparing to on premises SQL Server versions. For example, all cross-database references are possible in on premises SQL Server databases but is not supported in Azure SQL Database.

If we could check in advance and plan our migration based on those checks it would be time and effort saving. This is what Migrate to Azure new SSMS features are built for.

Click through for the options, some of which are simply informational and some of which actually do the work.

Comments closed

Power BI & Disabling Export to Excel

Marc Lelijveld explains why you might not want to let users export to Excel:

Export to Excel is a feature in Excel which is available in Power BI for a very long time. It allows report users to export the data from a specific visual in the report to an editable Excel file. After exporting, they can do whatever they want. For example, sending the data to others via mail, transforming or manipulating the data, start building new reports based on the Excel file and many other things. The export option can be used by clicking the ellipsis on the right top of a visual (if the visual header is enabled).

If you have all export functionalities enabled, users can both export underlying data and summarized data. The difference is mainly raw data or only data as visible in the chart where you clicked the export button.

Read on to understand why this might not be an unalloyed good.

Comments closed