Apache Pulsar 2.0 Released

Kevin Feasel

2018-06-08

Hadoop

George Leopold reports on a new version of Apache Pulsar:

The startup’s Apache Pulsar 2.0 released on Wednesday (June 6) adds new functionality designed to move data users “beyond batch” processing. Among them is a “stream-native” processing capability called Pulsar Functions designed to apply analytics to data as its flows through the Pulsar platform. Processing functions can be written in either Java or Python, the company said.

Debuted earlier this year as a preview feature, Streamlio announced general availability of Functions this week as part of its 2.0 release.

Another is a Pulsar enhancement developed in conjunction with Apache Bookkeeper, a scalable storage system. Streamlio said the new features, called Topic Compaction, delivers streaming data storage designed to improve the performance of applications consuming data from Pulsar. It serves as a “broker” that builds a snapshot of the latest value for each topic key, the startup said.

Read the whole thing.

Removing Time From A DateTime

Wayne Sheffield compares the performance of four methods for removing time from a DateTime data type:

Today, we’ll compare 3 other methods to the DATEADD/DATEDIFF method:

  1. Taking advantage of the fact that a datetime datatype is stored as a float, with the decimal being fractions of a day and the whole numbers being days, we will convert the datetime to float, taking the floor (just the whole numbers), and converting back to datetime.
  2. Using the DATEADD/DATEDIFF routine.
  3. Converting the datetime to DATE and back to datetime.
  4. Converting the datetime to varbinary (which returns just the time), and subtracting that from the datetime value.

While there are other ways of stripping the time (DATETIMEFROMPARTS, string manipulation), those ways are already known as poorly performing. Let’s just concentrate on these four.

Click through for the methods, as well as a performance test to see which is fastest.

Scatterplot Matrices

The Plotly folks show off scatterplot matrices in Python:

The scatterplot matrix, known acronymically as SPLOM, is a relatively uncommon graphical tool that uses multiple scatterplots to determine the correlation (if any) between a series of variables.

These scatterplots are then organized into a matrix, making it easy to look at all the potential correlations in one place.

SPLOMs, invented by John Hartigan in 1975, allow data aficionados to quickly realize any interesting correlations between parameters in the data set.

In this post, we’ll go over how to make SPLOMs in Plotly with Python. For extra insights, check out our SPLOM tutorial in Python and R.

fff

Missing @@SERVERNAME On Linux

Steve Jones fixes a naming issue on his SQL on Linux installation:

I setup a new instance of SQL Server on Linux some time ago. At the time, the Linux machine didn’t have any Samba running, and no real “name” on the network. As a result, after installing SQL Server I got a NULL when running SELECT @@SERVERNAME.

The fix is easy. It’s what you’d do if you had the wrong name.

Read on for the command, and don’t forget to restart the database engine afterward.

Restoring Point-In-Time To Another Azure SQL Managed Instance

Jovan Popovic announces an improvement to Azure SQL Database Managed Instances:

Azure SQL Database Managed Instance enables you to create a database as a copy of another database at some point in time in the past. This is known as point-in-time restore feature, and up till now you could perform point-in-time restore only within the same instance.

The latest release of Azure SQL Database Managed Instance enables you to perform point-in-time restore of a database from one instance to another. This might be useful if you need to be sure that you could easily restore a database to another instance if there is some issue on the original instance, or if you need a database for testing or auditing purposes on the test instance and you want to use copy of some of the existing database on another server.

Click through for the current requirements and limitations, as well as a sample.

Polybase Rejected Row Location

Casey Karst announces a nice improvement to Polybase on Azure SQL Data Warehouse:

Every row of your data is an insight waiting to be found. That is why it is critical you can get every row loaded into your data warehouse. When the data is clean, loading data into Azure SQL Data Warehouse is easy using PolyBase. It is elastic, globally available, and leverages Massively Parallel Processing (MPP). In reality clean data is a luxury that is not always available. In those cases you need to know which rows failed to load and why.

In Azure SQL Data Warehouse the Create External Table definition has been extended to include a Rejected_Row_Location parameter. This value represents the location in the External Data Source where the Error File(s) and Rejected Row(s) will be written.

This is a big improvement, one that I hope to see on the on-prem product.

Table Variables And Parallelism

Erik Darling shows your brain on table variables:

Inserts and other modifications to table variables can’t be parallelized. This is a product limitation, and the XML warns us about it.

The select could go parallel if the cardinality estimate were more accurate. This could potentially be addressed with a recompile hint, or with Trace Flag 2453.

Click through to see an example of what Erik means.

Categories

June 2018
MTWTFSS
« May Jul »
 123
45678910
11121314151617
18192021222324
252627282930