Curated SQL – Page 1010 – A Fine Slice Of SQL Server

Excel automatically spaces your intervals and labels equidistant from one another but it is assuming that your intervals actually are equidistant. In this graph, that’s not the case. We are missing the months of March, April, July, and August, when either no one was enrolled in the study or we have some missing data. But we can’t just gloss over those months. It isn’t truthful and it distorts the data display.

Click through for the solution.

Comments closed

IDEs and Cloudera Data Science Workbench

Published 2019-07-25 by Kevin Feasel

Bethann Noble walks us through some of the options available for IDEs operating against Cloudera Data Science Workbench:

Other coders on the team including ML and DevOps engineers often work in local IDEs such as PyCharm. These applications run locally on the user’s computer and connect to CDSW remotely over SSH for code completion and execution. They must be configured per user and are not associated at the project level in CDSW. The documentation provides sample instructions for the Professional Edition of PyCharm v2019.1.

They support both browser-based and local IDEs.

Comments closed

Custom SQL Server Error Messages

Published 2019-07-25 by Kevin Feasel

Kenneth Fisher shows how you can build your own custom SQL Server error messages:

I’m sure lots of you have used the function RAISERROR to handle an error caused by your code. The problem is, what do you do if the error you want to display isn’t one that Microsoft choose to include in the list of errors? (sys.messages)
You have two options.

Read on for those options.

Comments closed

Why Transaction Logs are Zero-Initialized

Published 2019-07-25 by Kevin Feasel

Paul Randal explains why the transaction log needs to be zero-initialized before SQL Server starts up:

It’s all to do with crash recovery. SQL Server knows where crash recovery has to start for a database, but not where it ends – i.e. SQL Server does not persist the ‘most recent LSN’ for a database anywhere. This means it has to work out where the end of the log is (and by end, I mean the most recent log record persisted on disk, not the physical end of the log file).

Read on for the detailed explanation.

Comments closed

Test Those Restores

Published 2019-07-25 by Kevin Feasel

Bob Pusateri wants you to test your restores, probably right now:

I had a client that was hit by a ransomware virus, encrypting several of their systems including the database server. Not to worry, though, they had “full backups” of all the affected machines, done by a third-party backup utility. After taking a day to cleanse their network, they restored these backups onto their servers. Now it was just a simple matter of bringing all the applications back online, right? Well, not exactly…

Just because a backup completes with no error code doesn’t mean it’s really a successful backup.

Comments closed

Getting the Largest Value Per Group

Published 2019-07-25 by Kevin Feasel

Erik Darling has a series on getting the highest value using CROSS APPLY. Part 1 covers the no-index route:

Let’s say you wanna get the highest thing. That’s easy enough as a concept.
Now let’s say you need to get the highest thing per user. That’s also easy enough to visualize.
There are a bunch of different ways to choose from to write it.

Part 2 covers the yes-index route:

In this round, row number had a tougher time than other ways to express the logic.
It just goes to show you, not every query is created equal in the eyes of the optimizer.

I don’t think I’m spoiling too much by saying that you really want a good index in place when using CROSS APPLY in this manner.

Comments closed

SQL Server 2019 CTP 3.2

Published 2019-07-25 by Kevin Feasel

Asad Khan announces the release of SQL Server 2019 CTP 3.2:

With this release of SQL Server 2019 community technology preview 3.2, we are announcing the public preview of Big Data Clusters for SQL Server 2019. Big Data Clusters for SQL Server enables big data analytics within SQL Server. It brings HDFS and Apache Spark™ into SQL Server for scale out compute and storage.
Big data clusters allow you to deploy scalable clusters of SQL Server, Apache Spark™, and HDFS running on Kubernetes. It provides all the tools and systems to ingest, store, and prepare data for analysis as well as to train and operationalize machine learning models. It allows you to query external data sources through data virtualization and combine and analyze your high-value relational data with high-volume big data. You will be also be able to build and deploy scalable and productive data-driven applications in big data clusters.

There’s a good bit in this release and because we’re in the 3.x range, you should be able to upgrade directly.

Comments closed

Azul Java in SQL Server 2019

Published 2019-07-25 by Kevin Feasel

Travis Wright announces support for Azul Systems’ Java distribution in SQL Server 2019:

In September 2018, Microsoft announced a new partnership with Azul Systems, a leading Java open source contributor and distributor. This partnership allows for all Azure customers to use Azul’s Zulu for Azure – Enterprise distribution of Java for free with support jointly provided by Microsoft and Azul. That’s right – supported for free.
Today, we are announcing that we have extended that partnership to cover SQL Server. Starting in the SQL Server 2019 community technology preview (CTP) 3.2 that was released today, we are including Azul System’s Zulu Embedded right out of the box for all scenarios where Java is used in SQL Server – in PolyBase, Apache Spark, Java extensibility, and more. There is no additional cost beyond what you pay for SQL Server.

This is interesting. We’ll have to see if the CTP 3.2 installation doesn’t ask for JDK 1.8 anymore and just installs the Azul Systems version.

Comments closed

MLflow 1.1 Released

Published 2019-07-24 by Kevin Feasel

Max Allen, et al, announce the release of MLflow 1.1:

We’re excited to announce today the release of MLflow 1.1. In this release, we’ve focused on fleshing out the tracking component of MLflow and improving visualization components in the UI.
Some of the major features include:
– Automatic logging from TensorFlow and Keras
– Parallel coordinate plots in the tracking UI
– Pandas DataFrame based search API
– Java Fluent API
– Kubernetes execution backend for MLflow projects
– Search Pagination

Looks like they’re putting in a lot of work on this.

Comments closed

Monitoring Backpressure in Apache Flink

Published 2019-07-24 by Kevin Feasel

Nico Kruber and Piotr Nowosjki explain how you can monitor the flow of your Apache Flink processes:

Probably the most important part of network monitoring is monitoring backpressure, a situation where a system is receiving data at a higher rate than it can process. Such behaviour will result in the sender being backpressured and may be caused by two things:
– The receiver is slow.
This can happen because the receiver is backpressured itself, is unable to keep processing at the same rate as the sender, or is temporarily blocked by garbage collection, lack of system resources, or I/O.
– The network channel is slow.
Even though in such case the receiver is not (directly) involved, we call the sender backpressured due to a potential oversubscription on network bandwidth shared by all subtasks running on the same machine. Beware that, in addition to Flink’s network stack, there may be more network users, such as sources and sinks, distributed file systems (checkpointing, network-attached storage), logging, and metrics. A previous capacity planning blog post provides some more insights.

Read the whole thing. Backpressure is not a topic unique to Flink, but affects any ETL or streaming operation.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Curated SQL Posts