Time-series databases have emerged as a best-in-class approach for storing and analyzing huge amounts of data generated by users and IoT devices. While relational and NoSQL databases are sometimes used for time-stamped and time-series data – such as clickstream data from Web and mobile devices, log data from IT gear, and data generated by industrial machinery — today’s massive data volumes from the IoT have outstripped the capability of those databases to keep up.
As the high-end time-series use cases piled up, AWS decided it was time to take action and make its entry into the still-specialized field, much as it did with last year’s launch of Neptune, a graph database, which is another specialized database field that’s emerging.
Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ open source project Avro Data Source for Apache Spark (referred to as spark-avro from now on). In addition, it provides:
- New functions from_avro() and to_avro() to read and write Avro data within a DataFrame instead of just files.
- Avro logical types support, including Decimal, Timestamp, and Date types. See the related schema conversions for details.
- 2X read throughput improvement and 10% write throughput improvement.
In this blog, we examine each of the above features through examples, giving you a flavor of its easy API usage, performance improvements, and merits.
Avro is one of the better rowstore data formats in the Hadoop world, so it’s good to see built-in support here.
Archive the data for historical analysis
One of the production DBAs pointed out that having gathered the information, it would be useful to hold it for better analysis of repeated issues. I have added an archiving step so that when the tools runs, if there is already data in the data gathering folder, it will copy that to an archive folder and name it with the date and time that the cluster log was created as this is a good estimation of when the analysis was performed. If an archive folder location is not provided it will create an archive folder in the data folder. This is not an ideal solution though, as the utility will copy all of the files and folders from there to its own location so it is better to define an archive folder in the parameters.
There are several improvements in here, so check them out.
In SQL Server 2019 Microsoft added the ability to execute custom Java code along the same lines we execute R and Python, and this blog post intends to give an introduction of how to install and enable the Java extension, as well as execute some very basic Java code. In future posts, I drill down how to pass data back and forth between SQL Server and Java.
There may very well be future posts discussing how the internals differ between Java and R/Python, but I want to talk about that a little bit in this post as well, as it has an impact on how we write and call Java code.
The not-so-secret here is that Java itself is less interesting of a language than, say, Scala. And the reason you’d support Scala? To interact with an Apache Spark cluster. I think that’s a big part of why you’d want the installer to load Java 1.8 instead of 1.9 or later (which contain API changes which break Spark). Definitely give this a careful read, as there are more working parts and more gotchas than R or Python support.
Now, in the next step we shall create a derived view, which shall contain the list with all Persons and Businesses, joining them together:CREATE OR ALTER VIEW dbo.Followers AS SELECT PersonId as Id, FullName FROM dbo.Person UNION ALL SELECT BusinessId, BusinessName FROM dbo.Business;
Now, the real new thing is that we can use such derived tables in SQL Server 2019 CTP 2.1 and Azure SQL Database together with the MATCH clause, in the statements such as the one below where we list all the followers of the “Real Stuff” company:SELECT Followers.ID, Followers.FullName FROM Followers, Follows, Company WHERE MATCH(Followers-(Follows)->Company) AND CompanyName = 'Real Stuff'
This query works fine, delivering us the expected results while generating a pretty complex execution plan in the background.
Niko focuses on heterogeneous nodes and edges, as well as derived views.
In this post, I share my approach and code snippets for:
Getting SQL Server 2019
Running SQL Server 2019 in a Docker Container
Restoring Demo Databases (AdventureWorks and WideWorldImporters)
If your hardware supports Docker, this is a great way of getting some experience with a new version of SQL Server without the mess of cleaning up after a CTP or affecting your current dev environment.
Based on the tests I’ve been doing, I’ve observed that users with membership to the Power BI administrator role have two sets of permissions apply:
- Activities which are scoped across the entire organization
- Activities for which normal user permissions apply
Within the above 2 categories, I’m thinking there are 4 main types of activities:
- Manage tenant settings (always scoped to the organization)
- Compile inventory and metadata (can be scoped to the organization)
- Manage workspace users (can be scoped to the organization)
- Export content from a workspace (relies on user permissions)
There’s a fair amount to digest, but Melissa does a good job explaining the implications of specific permissions.