Press "Enter" to skip to content

Author: Kevin Feasel

Identifying R Package and Function Use in GitHub Repos

Bryan Shalloway does a search:

TLDR: funspotr provides helpers for spotting the functions and packages in R and Rmarkdown files and associated github repositories. See Examples for catalogues of the functions/packages used in posts by Julia Silge, David Robinson, and others.

This is an interesting project. I’d imagine that with enough different code bases, you could develop a programming profile and possibly understand people’s strengths on a variety of characteristics like which functions they use, what they use given alternatives (e.g., “functional-friendly” map versus the *apply series versus loops), and how familiar they are with certain packages. I could see this being an advanced technique for learning what you should learn next: you obviously have familiarity with packages A, B, and C but it appears you don’t know about E or K and you might learn them to replace some of the work you’re doing with C.

Leave a Comment

Azure Synapse Analytics Integration Points

Warner Chaves takes us through several integration points with Azure Synapse Analytics:

Azure Stream Analytics allows for in-flight querying of streaming data from Blog storage, Data Lake Storage, IoT Hub or Event Hubs. The querying is done through an easily adoptable SQL language and it really speeds up the development of a streaming solution.

The nice thing here is that Stream Analytics allows the use of a Synapse SQL Pool table as the target for the results of the streaming query. So, this is another way to do near real-time analytics by passing data from a streaming source through a Stream Analytics job and into a Synapse table. You could do this to pre-aggregate data on the fly, score data in real-time, perform real-time calculations over specific time or event windows, etc.

Click through for several examples of this.

Leave a Comment

Adding Methods to a PSCustomObject

Robert Cain builds on a prior post:

In the previous installment of this series, I covered the various ways to create objects using the PSCustomObject. We saw how to create it using the New-Object cmdlet, then how to add your custom properties to it using the Add-Member cmdlet.

In this post we’ll learn how to add our own methods to our objects using script blocks. Before we go on, just a quick reminder on vocabulary.

Click through for that reminder, as well as implementation details.

Leave a Comment

CIS Security Checks with dbachecks

Tracy Boggiano shows how to perform a security check based on CIS requirements:

Well back at the end of 2019 I finished writing most of the checks related to the CIS Center for Internet Security requirements.  I have yet to write a blog post on how to use them.  So, well here is how to go about using them, it’s mostly code so should be pretty simple to implement.  I’ve mentioned this several times over the past year in presenting on dbatools.

So first you need to have dbachecks.  So, let’s start with the basics just in case you haven’t heard of dbachecks.  dbachecks is PowerShell module that checks the configuration of your SQL Server against various test have been predefined.  By default, it exports the data to JSON, and we will be opening PowerBI to display the data because why that is pretty.  So, go download you a copy of Power BI from the Microsoft website and let’s install dbachecks first.

Read on to see what you need, the steps for this process, and what the results look like.

Leave a Comment

Understanding the Transaction Log

Paul Randal has a new series:

With this post, I’m starting an occasional series on the transaction log and how it works and should be managed, and I’ll touch on all the problems above over its course. In this post, I’ll explain what logging is and why it’s required.

Basic Terminology Around Logging

When I’m talking about any mechanism in SQL Server, I find there’s a chicken-and-egg problem where I need to use a word or phrase before I’ve explained it. To avoid that problem in this series, I’m going to start by explaining some terminology that needs to be used when discussing logging, and I’ll expand on many of these terms as the series progresses.

This post starts off with some of the basics and it’s always good to get the occasional refresher on the basics.

Leave a Comment

Choosing between String Data Types in SQL

Greg Larsen compares CHAR, VARCHAR, and VARCHAR(MAX):

In every database, there are different kinds of data that need to be stored. Some data is strictly numeric, while other data consists of only letters or a combination of letters, numbers, and even special symbols. Whether it is just stored in memory or on disk, each piece of data requires a data type. Picking the correct data type depends on the characteristics of the data being stored. This article explains the differences between CHARVARCHAR, and VARCHAR(MAX).

Click through for Greg’s explanation. My official rule of thumb is as follows:

  • If you have a fixed-length code which you display to customers, use NCHAR.
  • If you have a fixed-length code which you only use internally and you know that the code will never include characters outside of your SQL Server installation’s code page and you know that the code page will never change…probably still use NCHAR, though if you twist my arm enough I’d say fine, CHAR.
  • Otherwise, use NVARCHAR.

Three decades ago, the choice was a lot trickier given performance differences between the two. Today? Unless you’re hunting for microseconds I don’t think you’ll see a practical difference. And if you are hunting for microseconds, you probably want more than rules of thumb.

Leave a Comment

Bucketing Data in Hive

Chitra Sapkal explains why bucketing in Hive can be so powerful:

When a column has a high cardinality, we can’t perform partitioning on it. A very high number of partitions will generate too many Hadoop files which would increase the load on the node. That’s because the node will have to keep the metadata of every partition, and that would affect the performance of that node

In simple words, You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions.

Click through to see how bucketing works and examples of how you can use it to make queries faster.

Leave a Comment

Graph Analysis with NetworkX

Tori Tompkins introduces us to a Python package:

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex graphs. It’s a really cool package that contains heaps of graph algorithms for all different uses. In this tutorial, I will cover how to create a graph from an edge list and different ways we can query it.

Unsure what a graph is exactly? Check out my Data Science Moments video which introduces graphs and their uses in 5 minutes:

Click through for that video, as well as a way to load, process, and display graph data.

Leave a Comment