Press "Enter" to skip to content

Author: Kevin Feasel

JSON Enhancements in Azure SQL DB and SQL Server 2022

Umchandar Jayachandran has an announcement:

Today, we are announcing the public preview of JSON enhancements in Azure SQL Database and SQL Server 2022 CTP 2.0. This preview contains an enhancement to ISJSON function and three new JSON functions – JSON_PATH_EXISTS, JSON_OBJECT and JSON_ARRAY. Currently, the ISJSON function allows you to test if a string value contains a valid JSON object or array. The new optional json_type_constraint parameter in ISJSON function can now be used to test conformance of JSON documents to the IETF RFC 8259 specification. This capability allows you to test for strings that contain a JSON value, scalar, object, or array. This functionality is like the IS JSON predicate in the ANSI SQL standard. The new JSON_PATH_EXISTS function allows you to test for the existence of a specific SQL/JSON path expression in a JSON document. This functionality is like the JSON_EXISTS predicate in the ANSI SQL standard. The new ANSI SQL compatible JSON value constructors – JSON_OBJECT and JSON_ARRAY functions allow you to construct JSON object or array from SQL data.

Even if you don’t store data in JSON format, there are good reasons why you might need to accept data in JSON format (or emit data in JSON format), especially when working with languages like R and Python.

Comments closed

Low-Latency Flink

Jun Qin and Nico Kruber have started a series on low-latency streaming in Apache Flink. The first two posts of the series are up, starting with the overview:

Latency can refer to different things. LatencyMarkers in Flink measure the time it takes for the markers to travel from each source operator to each downstream operator. As LatencyMarkers bypass user functions in operators, the measured latencies do not reflect the entire end-to-end latency but only a part of it. Flink also supports tracking the state access latency, which measures the response latency when state is read/written. One can also manually measure the time taken by some operators, or get this data with profilers. However, what users usually care about is the end-to-end latency, including the time spent in user-defined functions, in the stream processing framework, and when state is accessed. End-to-end latency is what we will focus on.

Part 2 discusses direct latency optimization techniques:

When interacting with external systems (e.g., RDBMS, object stores, web services) in a Flink job for data enrichment, the latency in getting responses from external systems often dominates the overall latency of the job. With Flink’s Async I/O API (e.g., AsyncDataStream.unorderedWait() or AsyncDataStream.orderedWait()), a single parallel function instance can handle many requests concurrently and receive responses asynchronously. This reduces latencies because the waiting time for responses is amortized over multiple requests.

Stay tuned for more posts in the series.

Comments closed

The Future Object in Scala

Gulshan Singh visits from the future:

You have units of work that you want to run asynchronously, so you don’t block while they’re running. A future gives you a simple way to run an algorithm concurrently. A future starts running concurrently when you create it and returns a result at some point, well, in the future. In Scala, we call that a future returns eventually.

The Future instance is a handle to an eventually available result. You can continue doing other work until the future completes, either successfully or unsuccessfully.

You may also know of Futures as Promises. It’s quite similar to async calls in .NET as well.

Comments closed

Minimum Viable Data Mesh in Azure

Paul Andrew was on a podcast:

For Paul, delivering a single data mesh data product on its own is not all that valuable – if you are going to go to the expense of implementing data mesh, you need to be able to satisfy use cases that cross domains. And the greater value is in cross-domain interoperability, getting to a data product that wasn’t possible before. And, you need to deliver the data platform alongside those first 2-3 data products, otherwise you create a very hard to support data asset, not really a data product.

When thinking about minimum viable data mesh, Paul views an approach leveraging DevOps and generally CI/CD – or Continuous Integration/Continuous Deliver – as very crucial. You need repeatability/reproducibility to really call something a data product.

Click through for the interview as well as Scott Hirleman’s summary.

Comments closed

Allowing for Cross-Database Access via Module Signing

Tom Zika enables cross-database access:

I’ve recently had to revisit this topic and spent a lot of time recalling the details. So I’m writing this blog post mainly as a reminder for myself.

The most helpful part will be the diagram detailing all the components and their relation and a comprehensive example anyone can follow.

I’m not going to cover Module Signing in general (I’ll leave that to Solomon Rutzky).
Nor will I cover other ways to achieve Cross DB access (like Cross DB Ownership chaining) because this is superior from the security standpoint.

Click through to see a good way of providing access to cross-database resources without explicitly granting rights to users.

Comments closed

Counting Employees by Period with DAX

Matt Allington solves a common problem:

I’m calling this article, “How many employees by period”. Staff come and go for different reasons. In some companies, the number of staff can change over time. The principles used in this article can also be used in other instances.  There can be staff moving in and out of departments, on and off of projects, etc. The technique can also be used to work out how many staff were on leave, how many off sick, how many tickets were open in a support queue, or any other concept that has a start and end date in a transactional table.

Read on for Matt’s answer but be sure to check out the comments as there are some other good solutions in there.

Comments closed

Object-Level Security in Power BI

Chris Webb checks out Object-Level Security:

If you have sensitive data in your Power BI dataset you may need to stop some users seeing the data in certain columns or measures. There is only one way to achieve this: you have to use Object Level Security (OLS) in your dataset. It’s not enough to exclude those measures or columns from your reports or to hide them, because there will always be ways for enterprising users to see data they shouldn’t be allowed to see. However the problem with OLS up to now is that it didn’t play nicely with Power BI reports and so you had to create multiple versions of the same report for different security roles. The good news is that there’s now a way to create one report connected to a dataset with OLS and have it display different columns and measures to users with different permissions.

And then watch as Chris combines Row-Level Security with Object-Level Security to make it nicer for users but probably a mess for maintainers.

Comments closed

The Move to General Database Platforms

Steve Jones muses on specialization in data platforms:

It’s been a decade-plus of the Not-Only-SQL (NoSQL) movement where a large variety of specialized database platforms have been developed and sold. It seems that there are so many different platforms for data stores that you can find one for whatever specialized type of data you are working with. However, is that what people are doing to store data in their applications?

I saw this piece on the return to the general-purpose database, postulating that a lot of the NoSQL database platforms have added additional capabilities that make them less specialized and more generalized. I’ve seen some of this, just as many relational platforms have added features that compete with one of the NoSQL classes of databases. The NoSQL datastores might be adding SQL-like features because some of these platforms are too specialized, and the vendors have decided they need to cover a slightly wider set of use cases.

I see three overlapping forces here in play. First, you have vendors looking at Total Addressable Market for their specific technology (document, key-value, graph, whatever) versus the size of the relational database market and they salivate about getting those general-purpose fat stacks of cash. That’s what Steve is getting at in the graf above.

I think the second force is that specialization is ultimately a sucker’s game when it comes to databases. By specializing in one area, you ultimately sacrifice others. A tool like Elasticsearch is outstanding as a document search engine and it is miserable as an aggregation engine (ask me how I know—it’s like every 6 months, another product team decides that this time, they’ll get stats aggregation with Elasticsearch to work well…and six months after people actually start to use the thing, they move all the data to someplace else that is adequately queryable). Similarly, document databases are excellent for populating details in an application but is not at all excellent at aggregation or arbitrary queries connecting data together. Specialization seems like a great idea until new requirements come in which require advanced reporting.

The third force is that these systems are independent and getting them to talk to each other typically involves writing a lot of ETL/ELT code or using additional third-party tools. To the extent that there are data virtualization platforms, they’re either excruciatingly slow (e.g., PolyBase) or expensive and out of date because they cache the data periodically. A corollary of the third force is that different platforms tend to use different languages and trying to remember which of the three or four different languages you need to use to access data in this case can be a bit painful. This is part of the reason Feasel’s Law exists.

The net result of all of this is that it seems you end up with the same piece of information in several separate places and build complicated systems to keep these separate systems aligned. Each system is (theoretically) optimized for a given use case but you end up with more and more people spending their time gluing together data from disparate systems, ensuring that data in disparate systems matches up, or moving data between disparate systems. If you need to do all of this, then sure, do it. But if there’s a single general-purpose platform which does all of this stuff 90% as well, a large number of companies and use cases will do just fine with the single tool. And that’s why general-purpose database platforms are still so popular and why I believe they will remain popular indefinitely.

The biggest exception I see is caching but that’s because it’s more a “fire-and-forget” data storage system. If you do it right, you don’t have any ETL/ELT to or from the cache and if cache dies, your system continues to work (albeit slower than with the cache). It’s also tied to a specific application and only exists temporarily, so data mismatches are (hopefully) transitory enough not to matter.

Comments closed