If you’re brand-new to unit testing your R package, I’d recommend reading this chapter from Hadley Wickham’s book about R packages.
There’s an R package called
RUnitfor unit testing, but in the whole post we’ll mention resources around the
testthatpackage since it’s the one we use in our packages, and arguably the most popular one.
testthatis great! Don’t hesitate to reads its docs again if you started using it a while ago, since the latest major release added the
teardown()functions to run code before and after all tests, very handy.
To setup testing in an existing package i.e. creating the test folder and adding
testthatas a dependency, run
usethis::use_testthat(). In our WIP
pRojectspackage, we set up the tests directory for you so you don’t forget. Then, in any case, add new tests for a function using
testthispackage might help make your testing workflow even smoother. In particular,
test_this()“reloads the package and runs tests associated with the currently open R script file.”, and there’s also a function for opening the test file associated with the current R script.
This is an area where I know I need to get better, and Maelle gives us a plethora of tooling for tests.
The Hadoop Distributed File System (HDFS) allows you to both federate storage across many computers as well as distribute files in a redundant manor across a cluster. HDFS is a key component to many storage clusters that possess more than a petabyte of capacity.
Each computer acting as a storage node in a cluster can contain one or more storage devices. This can allow several mechanical storage drives to both store data more reliably than SSDs, keep the cost per gigabyte down as well as go some way to exhausting the SATA bus capacity of a given system.
Hadoop ships with a feature-rich and robust JVM-based HDFS client. For many that interact with HDFS directly it is the go-to tool for any given task. That said, there is a growing population of alternative HDFS clients. Some optimise for responsiveness while others make it easier to utilise HDFS in Python applications. In this post I’ll walk through a few of these offerings.
Read on for reviews of those offerings.
In this little proof of concept work, we grab some of these flows process them in Apache NiFi and then store them in Apache Hive 3 tables for analytics. We should probably push the data to HBase for aggregates and Druid for time series. We will see as this expands.
There are also other data access options including the NiFi REST API and the NiFi Python APIs.
- Send notification when the NiFi starts, stops or died unexpectedly
- Two OOTB notifications
- Email notification service
- HTTP notification service
- It’s easy to write a custom notification service
AmbariReportingTask (global, per process group)
MonitorDiskUsage(Flowfile, content, provenance repositories)
Much of this is an overview of the tools and measures available.
As it is shown above, the variable
agegphas 6 groups (i.e., 25-34, 35-44) which has different alcohol intake and smoking use combinations. I think it would be interesting to transform this dataset from long to wide and to create a column for each age group and show the respective cases. Let see how the dataset will look like.
dt %>% spread(agegp, ncases) %>% slice(1:5)
Click through for a few additional transformations.
Having recently worked-through numerous issues with API data feeds and deployed report configurations, I’ve learned a few important best practices and caveats – at least for some common use cases. In one example, we have a client who expose their software-as-a-service (SaaS) customer data through several web API endpoints. Each SaaS customer has a unique security key which they can use with Power BI, Power Query or Excel and other tools to create reporting solutions. If we need a list of available products, it is a simple matter to create a long URL string consisting of the web address for the endpoint, security key and other parameters; an then just pass this to Power Query as a web data source. However, it’s not quite that easy for non-trivial reporting scenarios.
Thanks to Jamie Mikami from CSG Pro for helping me with the Azure function code for demonstrating this with demo data. Thanks also to Chris Webb who has meticulously covered several facets of API data sources in great detail on his blog, making this process much easier.
Click through for the instructions.
-- note that this creates a single column of values -- which could be used in something like IN -- for example SELECT 1 WHERE 12 IN ( SELECT tempField FROM (VALUES(11),(12),(7)) tempTable(tempField)) -- I could rephrase this as: SELECT 1 WHERE 12 = ANY ( SELECT tempField FROM (VALUES(11),(12),(7)) tempTable(tempField))
I rarely see these operators in the wild and might have used them in production code a couple of times if that.
While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer. The virtual data layer allows users to query data from many sources through a single, unified interface. Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date. Storage costs and data governance complexity are minimized. See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement.
SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:
James covers some of the reasoning behind this and the shift from using Polybase to integrate data with Hadoop + Azure Blob Storage to using SQL Server as a data virtualization engine.
This might be kind of basic but since I am working on a comprehensive script to discover things that a DBA really needs to know about, I made a couple of queries that will produce a list of the databases that have multiple files along with the locations of the physical files. One query finds multiple database files (mdf’s) and the other looks for multiple transaction log files (ldf’s). This will also find the Filestream file locations. Since I often have to take on instances without ever having seen them, it is good to know about little things like this.
This script might be helpful in finding minor performance gains by looking for places to add data files or remove log files.
The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.
I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.
Read on to see how you can set it up yourself. As Derik points out at the end, though, this is still one big table, but there are a few columns which can lead to dimensions, things like rate code, location, and payment type.
You’ll note that the second statement in the sequence is “CREATE OR AL…” in the batch_text. That’s me modifying the procedure. The very next event is sp_cache_remove. It shows the remove_method as “Compplan Remove”. This the plan being removed in an automated way from cache. The next three events are all for query_cache_removal_statistics.
What are they?
These are the statement level statistical information being removed from the DMVs. That’s right, we can observe that information getting removed from the system along with the plan from cache.
Unless I’m missing something, it seems like this is more helpful for pedagogical reasons rather than auditing reasons—I’d be concerned that on a busy production system, we’d see too many messages to correlate things all that well.