Press "Enter" to skip to content

Month: July 2020

The Basics of Gremlin

Raul Gonzalez introduces us to Gremlin:

Graph databases in Cosmos DB benefit from the same features, like the SQL API, it is globally distributed, scales independently throughput and storage, provides guaranteed latency, automatic indexing and more. So when relational databases choke with certain queries, No-SQL databases come to play.

Gremlin is the query language used by Apache Tinkerpop and it is implemented in Azure Cosmos DB. This language enables us to transverse graphs and answer complex queries that would be otherwise very expensive to run in traditional relational database engines.

Read on for a detailed example.

Comments closed

Securing S3 Buckets

Adam Youngberg relates an experience with securing public S3 buckets:

As a response to our initial alert, we took action to identify all of our S3 buckets and the public / non-public status. Since Databricks is a cloud-native company, we had already deployed JupiterOne, a commercial cloud asset management solution that allowed us to quickly query and determine which assets were public. Open-source tools are available, such as Cartography from Lyft, which allow for similar query capabilities. With the outputs of our queries, we were able to quickly identify and triage our public buckets and determine whether they should remain public.

Read on for the process, as well as some issues they experienced in rollout.

Comments closed

All About The Compute Scalar Operator

Hugo Kornelis continues a quest to document query plan operators:

The Compute Scalar operator is used to compute new values based on other columns in the same row. These new values are then added as extra column(s) in the output rows.

The expressions used to compute the new values can only refer to constant values and to columns in the input rows of the Compute Scalar operator. Other than that, there are, to my knowledge, no restrictions. The expressions can vary from very simple to extremely complex. The expressions can even include references to scalar user-defined functions, to CLR user-defined functions, and to built-in CLR functions.

Read on for a good deal of information about the operator.

Comments closed

Monitoring Identity Columns

Michael J. Swart has a script to monitor identity columns:

My friend pointed out an interesting RCA by Github where a

database table’s auto-incrementing ID column exceeded [maxint]. When we attempted to insert larger integers into the column, the database rejected the value

This led to a discussion about setting up monitoring for this kind of problem in our software. We have a place for monitoring and health-checks for all our databases. We just need to know how to define them.

So how do I create a script that reports any tables whose current identity values are above a certain threshold? This is what I came up with. Maybe you’ll find it useful too.

Click through for the script, but also read the comments for a simplification. The next step in the process is to figure out which tables can handle it and auto-reseed when you get to a certain threshold. Most commonly, this would be queue tables, where the active set of IDs in use is a relatively small number and there are no ID stragglers hanging around, so resetting is safe.

Comments closed

Getting the Last Query Plan Stats in SQL Server 2019

John Morehouse walks us through retrieving the actual query plan stats of the last run of an execution plan:

Currently, if you are not on SQL Server 2019 and wanted to see an execution plan, you would attempt to dive into the execution plan cache to retrieve an estimated plan.  Keep in mind that this just an estimated plan and the actual plan, while the shape should be the same, will have runtime metrics.  These  Actual runtime metrics could be, different than what is shown in the estimated plan, so it is important to get the actual whenever possible.

With the introduction of lightweight statistics, SQL Server can retain the metrics of the actual execution plan if enabled.  Note that this could introduce a slight increase in overhead, however, I haven’t yet seen it be determinantal. These metrics are vastly important when performing query tuning.

Read on to see the specific set of metrics you can pull and how to do it. This does require SQL Server 2019.

Comments closed

Azure Data Factory and Key Vault References

Gerhard Brueckl shows how we can get around a limitation in the Azure Data Factory user interface:

As You can see, the setting “AccessToken” can use a Key Vault reference whereas settings like “Databricks Workspace URL” and “Cluster” do not support them. This is usually fine because the guys at Microsoft also thought about this and support Key Vault references for the settings that are actually security relevant or sensitive. Also, providing the option to use Key Vault references everywhere would flood the GUI. So this is just fine.

But there can be good reasons where you want to get values from the Key Vault also for non-sensitive settings, especially when it comes to CI/CD and multiple environments. From my experience, when you implement a bigger ADF project, you will probably have a Key Vault for your sensitive settings and all other values are provided during the deployment via ARM parameters.

So you will end up with a mix of Key Vault references and ARM template parameters which very likely will be derived from the Key Vault at some point anyway. To solve this, you can modify the JSON of an ADF linked service directly and inject KeyVault references into almost every property!

Click through to see how that works, as well as the ramifications.

Comments closed

Adding Tooltips to Column Headers in Power Query

Chris Webb has some fun with M:

How does this work? Here’s what each of the steps do:

– The source step creates a simple table with two columns called firstname and lastname using #table (see here for more details on that).
– The tabletype step declares a new table type with two columns (the same two columns in the table from the previous step) and then adds a metadata record to this type. In that record the Documentation.FieldDescription field contains the text values that will appear as tooltips when you hover over each column.
– The replacetype step replaces the type of the table returned by source with the type declared in tabletype.

Click through for the code.

Comments closed

Docker Compose and SQL Server

Andrew Pruski makes it easy to launch a fully-featured Docker container running SQL Server:

The solution here is to create a custom image with the volume created and permissions set.

But wouldn’t it be easier to just have to run one command to spin up a custom 2019 image, with volumes created and permissions set?

Enter Docker Compose.

Andrew has a GitHub repo with everything set up and includes plenty of screenshots to demonstrate.

Comments closed

Sparklyr 1.3 Released

Yitao Li announces sparklyr 1.3:

sparklyr 1.3 is now available on CRAN, with the following major new features:

Higher-order Functions to easily manipulate arrays and structs
– Support for Apache Avro, a row-oriented data serialization framework
Custom Serialization using R functions to read and write any data format
Other Improvements such as compatibility with EMR 6.0 & Spark 3.0, and initial support for Flint time series library

Between this and the work from the Spark side, we are seeing some nice quality of life improvements for Spark and R.

Comments closed

Case-Insensitive Searches in Cosmos DB

Hasan Savran gives us a few techniques for case-insensitive searching in Azure Cosmos DB:

Data gets saved as JSON in Azure Cosmos DB. JSON documents supports string, numeric and Boolean data types. Because of limited data types, you might end up picking string data type for most of your data. You may need to use Cosmos DB’s string functions to help you in many situations. For example; you might need to find names start with bunch of characters or you might need to find data ends like a word or bunch of characters. You may need to find all data that has address contains a street name. What do you do in these cases? Azure Cosmos DB has couple of system functions to help you, I will focus on the following functions in this post. These are the functions just got updated in June 2020.

Click through to learn a bit about those money-saving updates.

Comments closed