Stop me if you’ve heard this one before.
The company is growing like crazy, your engineering team keeps rising to the challenge, and you are ferociously proud of them. But some cracks are beginning to show, and frankly you’re a little worried. You have always advocated for engineers to have broad latitude in technical decisions, including choosing languages and tools. This autonomy and culture of ownership is part of how you have successfully hired and retained top talent despite the siren song of the Faceboogles.
But recently you saw something terrifying that you cannot unsee: your company is using all the languages, all the environments, all the databases, all the build tools. Shit!!! Your ops team is in full revolt and you can’t really blame them. It’s grown into an unsupportable nightmare and something MUST be done, but you don’t know what or how — let alone how to solve it while retaining the autonomy and personal agency that you all value so highly.
I hear a version of this everywhere I’ve gone for the past year or two. It’s crazy how often. I’ve been meaning to write my answer up for ages, and here it (finally) is.
I like the solution: embrace the sprawl but make the default a stable set of well-supported systems with reasons for people to want to start there. Read the whole thing.
I am pretty sure that (almost) everyone reading this blog knows that a CTE (Common Table Expression) is an independent subquery that can be named and then referenced (multiple times if needed) in the main query. This makes CTEs an invaluable tool to increase the readability of complex queries. Almost everything we can do with a CTE can equally well be done by using subqueries directly in the query but at the cost of losing readability.
However, there is also one feature that is actually unique to CTEs: recursion. This blog post is not about what this feature is, what it does, or how to write the code. A brief description is given as part of the complete explanation of CTEs in Books Online, and many bloggers have written about recursive CTEs as well.
For this post, I’ll assume the reader knows what a recursive CTE is, and jump into the execution plan to learn how this recursion is actually implemented.
This is (as usual) a great article, and helps explain why recursive CTEs can be slow.
And here we have a T-SQL solution for Day 1 of the Advent of Code challenge. The key tasks that we can learn from today are:
Loading a file.
Split a string on a delimiter.
Including additional rows into a result set (adding the first zero with a UNION ALL).
Multiplying (duplicating) a result set multiple times.
Performing a running total calculation.
Assigning a sequential number to a set of rows in a specific order.
Use of the GROUP BY and HAVING clauses while performing an aggregation.
Read the whole thing.
Any DAX formula involving arithmetical operators ( + – * / ) might produce a result in a different data type. While this is obvious when you have different data types in the arguments, it could be less intuitive when the arguments have the same data type. Indeed, the result might have a different data type. This is important. Indeed, in a complex expression there could be many operators, but every operator defines a single expression that produces a new data type – that is the argument of the next operator. We will start looking at the resulting data type of the standard operators, showing a few examples later of how they could affect the result in a more complex expression.
Marco shows some relatively drastic differences: hundreds of dollars when dealing with millions (and any company okay with being off by hundreds of dollars when dealing with millions, please mail me a check for hundreds of dollars).
Some of these technologies and concepts are not owned or created by Microsoft – the concepts are universal, and a few of the technologies are open-source. I’ve marked those in italics.
I’ve also included a few links to a training resource I’ve found to be useful. I normally use LinkedIn Learning for larger courses, along with EdX, DataCamp, and many other platforms for in-depth training. The links I have indicated here are by no means exhaustive, but they are free, and provide a good starting point.
Click through for a list of some of the technologies in play.
For this post I want to actually show you the TSQL code to do this, hopefully it will become a good reference point for the future. Before we step into the code lets understand the differences between database level and server level rules.
For server level rules they enable access your entire Azure SQL server, that is, all the databases within the same logical server. These rules are stored in the master database. Database level rules enable access to certain databases (yes you could also run this within master) within the same logical server, think of this as you being more granular with the access where they are created within the user database in question.
Personally, I try and always use database level rules, this is especially true when I work with failover groups.
Click through for instructions on how to work with both server and database level rules.
Time-series databases have emerged as a best-in-class approach for storing and analyzing huge amounts of data generated by users and IoT devices. While relational and NoSQL databases are sometimes used for time-stamped and time-series data – such as clickstream data from Web and mobile devices, log data from IT gear, and data generated by industrial machinery — today’s massive data volumes from the IoT have outstripped the capability of those databases to keep up.
As the high-end time-series use cases piled up, AWS decided it was time to take action and make its entry into the still-specialized field, much as it did with last year’s launch of Neptune, a graph database, which is another specialized database field that’s emerging.
Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ open source project Avro Data Source for Apache Spark (referred to as spark-avro from now on). In addition, it provides:
- New functions from_avro() and to_avro() to read and write Avro data within a DataFrame instead of just files.
- Avro logical types support, including Decimal, Timestamp, and Date types. See the related schema conversions for details.
- 2X read throughput improvement and 10% write throughput improvement.
In this blog, we examine each of the above features through examples, giving you a flavor of its easy API usage, performance improvements, and merits.
Avro is one of the better rowstore data formats in the Hadoop world, so it’s good to see built-in support here.
Archive the data for historical analysis
One of the production DBAs pointed out that having gathered the information, it would be useful to hold it for better analysis of repeated issues. I have added an archiving step so that when the tools runs, if there is already data in the data gathering folder, it will copy that to an archive folder and name it with the date and time that the cluster log was created as this is a good estimation of when the analysis was performed. If an archive folder location is not provided it will create an archive folder in the data folder. This is not an ideal solution though, as the utility will copy all of the files and folders from there to its own location so it is better to define an archive folder in the parameters.
There are several improvements in here, so check them out.
In SQL Server 2019 Microsoft added the ability to execute custom Java code along the same lines we execute R and Python, and this blog post intends to give an introduction of how to install and enable the Java extension, as well as execute some very basic Java code. In future posts, I drill down how to pass data back and forth between SQL Server and Java.
There may very well be future posts discussing how the internals differ between Java and R/Python, but I want to talk about that a little bit in this post as well, as it has an impact on how we write and call Java code.
The not-so-secret here is that Java itself is less interesting of a language than, say, Scala. And the reason you’d support Scala? To interact with an Apache Spark cluster. I think that’s a big part of why you’d want the installer to load Java 1.8 instead of 1.9 or later (which contain API changes which break Spark). Definitely give this a careful read, as there are more working parts and more gotchas than R or Python support.