MLflow 0.8.1 Released

Aaron Davidson, et al, announce a new version of Databricks MLflow:

When scoring Python models as Apache Spark UDFs, users can now filter UDF outputs by selecting from an expanded set of result types. For example, specifying a result type of pyspark.sql.types.DoubleType filters the UDF output and returns the first column that contains double precision scalar values. Specifying a result type of pyspark.sql.types.ArrayType(DoubleType) returns all columns that contain double precision scalar values. The example code below demonstrates result type selection using the result_type parameter. And the short example notebook illustrates Spark Model logged and then loaded as a Spark UDF.

Read on for a pretty long list of updates.

File Formats Supported In HDFS

Kevin Feasel

2018-12-31

Hadoop

Manoj Pandey covers a few of the file types supported by the Hadoop Distributed File System:

HDFS or Hadoop Distributed File System is the distributed file system provided by the Hadoop Big Data platform. The primary objective of HDFS is to store data reliably even in the presence of node failures in the cluster. This is facilitated with the help of data replication across different racks in the cluster infrastructure. These files stored in HDFS system are used for further data processing by different data processing engines like Hadoop Map-Reduce, Hive, Spark, Impala, Pig etc.

There are a few other formats not included in this list, including RCFile (which has been superseded by both ORC and Parquet), but this hits the highlights.

SQL Server’s Built-In Monitoring

Jason Brimhall has a three-part series on the types of monitoring built into SQL Server. Part one is an overview and includes the Default Trace:

The default trace by itself is something that can be turned off via configuration option. There may be good reason to disable the default trace. Before disabling the default trace, please consider the following that can be captured via the default trace. I will use a query to demonstrate the events and categories that are configured for capture in the default trace.

Part two looks at the system_health Extended Event session:

Beyond being a component of the black box for SQL Server, what exactly is this event session? The system_health is much as the name implies – it is a “trace” that attempts to gather information about various events that may affect the overall health of the SQL Server instance.
The event session will trap various events related to deadlocks, waits, clr, memory, schedulers, and reported errors. To get a better grasp of this, let’s take a look at the event session makeup based on the available metadata in the dmvs and catalog views.

Part three is the sp_server_diagnostics stored procedure:

Beyond being a component of the black box for SQL Server, what exactly is this diagnostics process? The sp_server_diagnostics is much as the name implies—it is a “diagnostics” service that attempts to gather information about various events that may affect the overall health of the SQL Server instance.
The diagnostics process will trap various server related health (diagnostics) information related to the SQL Server instance in an effort to try and detect potential failures and errors. This diagnostics session/process traps information for five different categories by default. There is a sixth category of information for those special servers that happen to be running an Availability Group.

I’ve used the first two but did not know about the third. Jason goes into good depth on each, showing you the types of information you can get out of these. Read the whole thing.

Optimizing SSIS Catalog Cleanup

Tim Mitchell has a script which replaces [internal].[cleanup_server_retention_window] in the SSISDB database:

Earlier this week, I blogged about the automatic cleanup process that purges old data from the SSIS catalog logging tables. This nightly process removes data for operations that are older than 365 days. While this is useful, many SSIS admins have complained that this process is very slow and contentious on large or busy SSISDB databases.
In this post, I’ll show to you one of the main reasons this purge process is slow, and will share a more efficient way of performing this delete operation.

Click through for the script and explanation.

Preventing Execution With PARSEONLY And NOEXEC

Solomon Rutzky shows us a way to prevent accidental full script execution:

There are times when I am working on a SQL script that really shouldn’t be executed all at once. Sometimes it’s a series of examples / demos for a presentation or forum answer. Other times it’s just a temporary need while I’m in the process of creating a complex script, but once the script is completed and tested then it should run all at once. In either case, I have accidentally hit F5 too many times when I thought that a certain section of code was highlighted (so only that section would execute) but in fact nothing was highlighted so the script started executing from the very top, and either ran until completion or until I was able to cancel it (if it ran long enough for me to have time to understand what was happening and hit the “cancel” button).
So I needed some way of ensuring that a script would not execute if no section was highlighted.

Read on to learn about PARSEONLY and NOEXEC.

Combining M and Python To Export Power BI Data To CSVs

Imke Feldmann combines Python and M to export Power BI table data to CSV format written out to a directory of your choosing:

Why Python?
I prefer it to R mostly because I don’t have to create the csv-file(names) in advance before I import data to it. This is particularly important for scenarios where I want to append data to an existing file. The key for this task is NOT to use the append-option that Python offers, because M-scripts will be executed multiple times and this would create a total mess in my file. Instead I create a new file with the context to append and use the Import-from-folder method instead to stitch all csvs back together. Therefore I have to dynamically create new filenames for each import. So when the M-Python-scripts are executed repetitively here, the newly created file will just be overwritten – which doesn’t do any harm.

Click through for the code as well as a few caveats.

CSV Cleanup With Powershell

Andy Mallon has a quick fix when some rows in a CSV are missing values:

I was just goofing around with the data, so I didn’t really need anything perfect…but I did want something that was good enough to be repeatable, in case I wanted to do it again.
Fixing thousands of rows by hand sounded like torture. Heck. No.
The data was from a publicly available data set, so getting the file format fixed seemed like it would probably be neither quick nor easy. Depending on others could be a dead end, and while this would be the “rightest” solution to ensure a stable future fix, it was overkill for my casual playtime.

Andy has shown the easy way. Now we lock him in a room with sed and a book on regular expressions to learn the other way. The correct answer to that, of course, is to fashion a pick kit out of the book (and whatever else you might be able to acquire) to get out.

Categories

December 2018
MTWTFSS
« Nov Jan »
 12
3456789
10111213141516
17181920212223
24252627282930
31