Press "Enter" to skip to content

Month: November 2018

Detecting Redirects With httr

Peter Meissner shows us how we can find redirects when using the httr package:

I am the creator and maintainer of the robotstxt package an R package that enables users to retrieve and parse robots.txt files and ultimately is designed to do access permission checking for web resources.

Recently a discussion came up about how to interpret permissions in case of sub-domains and HTTP redirects. Long story short: In case of robots.txt files redirects are suspicious and users should at least be informed about it happening so they might take appropriate action.

So, I set out to find a way to check whether or not a robots.txt files requested via the httr package has gone through one or more redirects prior to its retrieval.

Click through for the solution.

Comments closed

Premium Blob Storage In Azure

James Serra describes a new tier of Azure Blob Storage:

As a follow-up to my blog Azure Archive Blob Storage, Microsoft has released another storage tier called Azure Premium Blob Storage (announcement).  It is in private preview in US East 2, US Central and US West regions.

This is a performance tier in Azure Blob Storage, complimenting the existing Hot, Cool, and Archive tiers.  Data in Premium Blob Storage is stored on solid-state drives, which are known for lower latency and higher transactional rates compared to traditional hard drives.

It is ideal for workloads that require very fast access time such as interactive video editing, static web content, and online transactions.  It also works well for workloads that perform many relatively small transactions, such as capturing telemetry data, message passing, and data transformation.

It’s in private preview for now, but my guess is that it’ll be available to the general public soon enough.

Comments closed

Polybase And Azure Data Studio

Rajendra Gupta continues his series on Polybase in SQL Server 2019 with a look at Polybase integration in Azure Data Studio:

We have learned earlier that PolyBase in SQL Server 2019 Preview allows access to various data sources such as SQL Server, Oracle, MongoDB, Teradata, and ODBC based sources etc. Azure Data Studio SQL Server 2019 preview extension currently supports for SQL Server and Oracle data sources only from the External table wizard.

In this series, we will create an external table for SQL Server and explore some more features around it.

Launch Azure Data Studio and connect to the SQL Server 2019 preview instance. Right click on the database and launch ‘Create External Table’.

Rajendra also looks at some of the Polybase DMVs and the notion of predicate pushdown, which is critical to understand for writing Polybase queries which perform well.

Comments closed

The Performance Impacts Of Query Store

Erin Stellato explains the performance impacts of enabling Query Store in various types of environments:

The short answer:

  • The majority of workloads won’t see an impact on system performance

    • Will there be an increase in resource use (CPU, memory)?  Yes.
    • Is there a “magic number” to use to figure out Query Store performance and the increase in resource use?  No, it will depend on the type of workload.  Keep reading.
  • An impact on system performance can be seen with ad-hoc workloads (think Entity Framework, NHibernate), but I still think it’s worth enabling. With an ad-hoc workload there are additional factors to consider when using Query Store.

  • You should be running the latest version CU for SQL Server 2017 and latest CU for SQL Server 2016 SP2 to get all performance-related improvements Microsoft has implemented specific to Query Store

Definitely read the long answer.  There are also settings to reduce the load that Query Store puts on a system, and being up to date is critical.

Comments closed

SQL Server IaaS Versus PaaS On AWS

John McCormack identifies some differences between running SQL Server in EC2 versus RDS on Amazon Web Services:

How do I run SQL Server on AWS?

Running SQL Server on AWS can be done in 2 ways.

  • Relation Database Service (RDS): AWS’s managed solution where some of the administration (maintenance, backups and patching) is handled for you.

  • EC2: Your very own virtual machine in the cloud. With EC2, you manage SQL Server, just like you would do on-premises. This gives you full control over your SQL instance.

Click through for the comparison.

Comments closed

Exploratory Data Analysis In R

Laura Ellis walks us through some easy techniques for learning about our data using R:

DIM AND GLIMPSE

Next, we will run the dim function which displays the dimensions of the table. The output takes the form of row, column.

And then we run the glimpse function from the dplyr package. This will display a vertical preview of the dataset. It allows us to easily preview data type and sample data.

Spending some quality time doing EDA can save you in the long run, as it can help you get a feel for things like data quality, the distributions of variables, and completeness of data.

Comments closed

Using Datadog To Monitor Spark Clusters On EMR

Priya Matpadi walks us through one way to monitor Spark clusters on Amazon ElasticMapReduce:

We recently implemented a Spark streaming application, which consumes data from from multiple Kafka topics. The data consumed from Kafka comprises different types of telemetry events generated by mobile devices. We decided to host the Spark cluster using the Amazon EMR service, which manages a fleet of EC2 instances to run our data-processing pipelines.

As part of preparing the cluster and application for deployment to production, we needed to implement monitoring so we could track the streaming application and the Spark infrastructure itself. At a high level, we wanted ensure that we could monitor the different components of the application, understand performance parameters, and get alerted when things go wrong.

In this post, we’ll walk through how we aggregated relevant metrics in Datadog from our Spark streaming application running on a YARN cluster in EMR.

Check it out.  If this is interesting, Priya’s blog has the full series.

Comments closed

Pivoting With Spark SQL

MaryAnn Xue shows us how to use the PIVOT operator in Spark SQL:

Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns.

The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as well. In this blog, using temperatures recordings in Seattle, we’ll show how we can use this common SQL Pivot feature to achieve complex data transformations.

The syntax is quite similar to the PIVOT syntax that SQL Server uses.

Comments closed

Showing Forecasts With Actuals In Power BI

Alberto Ferrari shows us how we can incorporate actuals and forecasted values in the same Power BI visuals:

The Forecast measure in the demo model is quite an advanced piece of DAX code that would require a full article by itself. The curious reader will find more information on how to reallocate budget at different granularities in the video Budgeting with Power BI. In this article, we use the Forecast measure without detailed explanations; our goal is to explain how to compute the next measure: Remaining Forecast.

The Remaining Forecast measure must analyze the Sales table, finding the last day for which there are sales, and only then computing the forecasts.

Read the whole thing.

Comments closed