Using data.table to Add Aggregate Values to Data Frames

Kevin Feasel

2019-07-03

R

John Mount shows how you can combine := and by in the data.table package to add a new column with the results of an aggregation in R:

The “by” signals we are doing a per-group calculation, and the “:=” signals to land the results in the original data.table. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row.

It’s worth reading up on data.table if you aren’t familiar with the great things it can do.

Troubleshooting Kafka Listeners

Kevin Feasel

2019-07-03

Hadoop

Robin Moffatt has some tips for configuring listeners in Kafka:

Apache Kafka® is a distributed system. Data is read from and written to the leader for a given partition, which could be on any of the brokers in a cluster. When a client (producer/consumer) starts, it will request metadata about which broker is the leader for a partition—and it can do this from anybroker. The metadata returned will include the endpoints available for the Leader broker for that partition, and the client will then use those endpoints to connect to the broker to read/write data as required.

It’s these endpoints that cause people trouble. On a single machine, running bare metal (no VMs, no Docker), everything might be the hostname (or just localhost), and it’s easy. But once you move into more complex networking setups and multiple nodes, you have to pay more attention to it.

Click through for more tips.

Performing Row-Wise Operations with pmap

Kevin Feasel

2019-07-03

R

Sebastian Sauer shows how you can use pmap in the purrr library to perform row-wise aggregations:

Rowwwise operations are a quite frequent operations in data analysis. The R language environment is particularly strong in column wise operations. This is due to technical reasons, as data frames are internally built as column-by-column structures, hence column wise operations are simple, rowwise more difficult.

This post looks at some rather general way to comput rowwise statistics. Of course, numerous ways exist and there are quite a few tutorials around, notably by Jenny Bryant, and by Emil Hvitfeldt to name a few.

The ideal solution is to have your data be properly columnar, but if you’re in a pinch, it’s good to know that you can do this.

Unexpected Results with ANY Aggregate

Paul White points out a couple odd scenarios with the ANY aggregate in SQL Server:

The execution plan erroneously computes separate ANY aggregates for the c2 and c3 columns, ignoring nulls. Each aggregate independently returns the first non-null value it encounters, giving a result where the values for c2 and c3 come from different source rows. This is not what the original SQL query specification requested.

The same wrong result can be produced with or without the clustered index by adding an OPTION (HASH GROUP) hint to produce a plan with an Eager Hash Aggregate instead of a Stream Aggregate.

Click through for the scenarios. Paul has also reported the second scenario as a bug.

Joining Lists of Values in T-SQL

Kevin Feasel

2019-07-03

T-SQL

Jason Brimhall shows how you can build a list of values using the table value constructor and join to it:

The table value constructor is basically like a virtual table not too different from a CTE or a subquery (in that they are all virtual tables of sorts). The table value constructor however can be combined with either of those other types and is a set of row expressions that get put into this virtual table in a single DML statement.

It’s one of the nicer things SQL Server 2008 gave us.

Date Columns and Query Folding in Power BI

Marc Lelijveld discovered a case where transforming a date+time to a date could potentially break query folding:

While working on projects, It is always my goal to keep query folding active as long as possible. Simply because it improves my dataset refresh timings and reduces the load executed on Power BI side.

While working on a project last week, I found something interesting in default Power BI functionality which breaks query folding, but can be avoided by slightly changing the Power Query code yourself or picking another option which results in exactly the same. So, this week a short blog post to make you aware of things which might brake query folding.

Click through for two scenarios, one where a transformation can break query folding and one where query folding is still fine.

Installing SQL Server on Windows Containers

Mohammad Darab shows how you can install SQL Server on a Windows Container:

Today, Microsoft released Windows Containers, currently in the Early Adoption Program here. You can use the same Docker Desktop for Windows but now you can use Windows Containers to deploy SQL Server (as you will see below).

This is primarily for dev/test environments. This is in no way production ready. Think about a developer asking for a SQL instance so that they can run their code against, etc. You can now easily, and quickly, spin up a container running SQL Server. Once they are done, you can remove it or stop it.

The downside is that your Docker Desktop for Windows is now stuck running Windows containers…

Collecting the Last Actual Plan for a Query

Gail Shaw explains a new Dynamic Management Function which works to get the latest execution plan for a particular query:

Getting the actual execution plan, that is the plan with run-time statistics for a query from an application has always been a little difficult. It’s fine if you can get the query running in Management Studio and reproducing the behaviour from the app, but that can be difficult.

There’s the query_post_execution_showplan event in Extended Events, but that’s a pretty heavy event and not something that I’d like to run on a busy server.

No more! SQL 2019 adds a new plan-related function to get the last actual plan for a query: sys.dm_exec_query_plan_stats.

Read on to see how to configure this, as well as a demo.

Querying Essbase from Power BI

Kellyn Pot’vin-Gorman shows how to query data from an Oracle Essbase cube in the Oracle Applications Cloud from Power BI:

The OAC environment that Opal gave me access possessed an example schema/data based on an Audio-Video store revenue for multiple years. 

I’d never worked with the OAC before, but I was quickly able to find five methods to connect Power BI to it, either to import or to direct connect to the data:

1. Data Direct makes a Rest API to connect to the Oracle Cloud.
2. Use the web interface with Power BI and “scrape the website”, allowing PBI to build the table from example.
3. Connect with an ODBC driver
4. Via the OAC interface, export to Excel and then import into Power BI
5. Via the OAC interface, export to a table and then import into Power BI as a CSV file.

So, uh, yeah, you can do it. At least five ways.

Categories

July 2019
MTWTFSS
« Jun Aug »
1234567
891011121314
15161718192021
22232425262728
293031