Press "Enter" to skip to content

Day: October 6, 2021

pyspark.pandas in Apache Spark 3.2

Hyukjin Kwon and Xinrong Meng announce a built-in pandas API for Apache Spark 3.2:

We’re thrilled to announce the pandas API as part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible library and has grown rapidly to become one of the standard data science libraries. Now pandas users can leverage the pandas API on their existing Spark clusters.

A few years ago, we launched Koalas, an open source project that implements the pandas DataFrame API on top of Spark, which became widely adopted among data scientists. Recently, Koalas was officially merged into PySpark by SPIP: Support pandas API layer on PySpark as part of Project Zen (see also Project Zen: Making Data Science Easier in PySpark from Data + AI Summit 2021).

pandas users can now scale their workloads with one simple line change in the upcoming Spark 3.2 release:

Click through to see more details on the change.

Comments closed

Write-Debug in Powershell

Robert Cain goes from verbose to debug mode:

In my previous post, Fun With PowerShell Write-Verbose, I introduced the use of the built in -Verbose switch. In this post we’ll dive into its counterpart, the -Debug switch and its companion Write-Debug cmdlet.

In covering Write-Verbose, I mentioned verbose messages are typically targeted at the average user. The -Debug messages are meant to target the developer. These messages are meant to assist the PowerShell developer in trouble shooting and debugging their code.

Click through for examples of it in action.

Comments closed

SQL Server Express Memory Limitations

Steve Stedman notes that the memory limitations on SQL Server Express Edition are not quite as stringent as you may first believe:

Looking at the memory limits and other limits on the SQL Server versions over time, we have seen things increase, but one limit that is still very low is the memory limit for SQL Express. Specifically the maximum memory for buffer pool per instance of SQL Server Database Engine for SQL 2019. The limit there is 1410 MB.

At first glance you may think that this limit is the total amount of memory that SQL Server will use, but let me show you a couple of screen shots for Database Health Monitor showing the memory utilization on two different SQL 2019 Express servers.

Read on to see what, exactly, the memory limitation is. Also, there are separate limits for things like In-Memory OLTP table sizes.

Comments closed

Top with Percent

Kevin Wilkie is on the top shelf:

In the last blog post, we went over the extreme basics of using the TOP operator in SQL. We showed how to grab things like the TOP 10 of a certain item.

That ability will get you through a number of criteria that you will be asked to perform. But what if you’re asked to grab the top five percent of performers in your company? Or in a region? It’s kinda hard to do that if you only have what we know so far, right?

Read on for the answer.

Comments closed

Operating System Error 995 on Adding a Database to an AG

Andrew Pruski troubleshoots a problem:

I was adding databases to an availability group (SQL Server 2017 CU20 instance) the other day and one database failed to automatically seed to the secondary.

When I looked in the SQL Server error log I saw this error message: –

BackupIoRequest::ReportIoError: write failure on backup device ‘{GUID}’. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).

Read on to see how Andrew solved the problem.

Comments closed

Combining Windows and SQL Server Time Zone Info

Andy Mallon splices together time zone information like some kind of time zone Dr. Frankenstein:

There was a recent post on Database Administrators, where there was some confusion over Daylight Saving Time(DST) in Australian timezones. Let’s be honest though–timezones are confusing, especially when DST comes into play. Arizona Time doesn’t observe Daylight Saving Time, so sometimes it’s the same as Mountain Time, and sometimes it’s Pacific Time.

Finding the right time zone can be hard. As of this writing, Windows recognizes 141 different timezones, many of which overlap.

In this case, the asker was running this query, and confused over why it wasn’t properly reflecting DST, and the other columns in the DMZ weren’t helpful either:

This is why everybody should just use Eastern Standard Time: it’s the standard—says so in the name, even.

Comments closed

Replication Updates in SQL Server 2019 CU13

Kevin Farlee surprises us all with some changes to replication:

Starting with CU13 of SQL Server 2019 , we are including new functionality which facilitates peer-to-peer replication with conflict detection & resolution using a last-writer-wins strategy.  This enables many scenarios for multi-write configurations which may not have been possible or practical before.

Traditionally, customers configuring peer-to-peer replication are advised to implement a sharded solution which ensures that each row is updated on only one replica server (key values A-M are updated on server 1 while key values N-Z are updated on server 2, etc.).  If it should happen that the same row is updated within a small time window by two servers in a replication environment, a conflict will be detected, which will either stop replication after raising an error, or it may optionally be resolved using the originator ID of the replica sending the update.  (i.e. if you have replicas with Originator IDs 1 and 2, then updates coming from the replica with originator ID 2 would always win, and be persisted.  See Conflict Detection in peer-to-peer replication – SQL Server | Microsoft Docs for a detailed explanation.

Read on for more details.

Comments closed