When scoring Python models as Apache Spark UDFs, users can now filter UDF outputs by selecting from an expanded set of result types. For example, specifying a result type of
pyspark.sql.types.DoubleTypefilters the UDF output and returns the first column that contains double precision scalar values. Specifying a result type of
pyspark.sql.types.ArrayType(DoubleType)returns all columns that contain double precision scalar values. The example code below demonstrates result type selection using the
result_typeparameter. And the short example notebook illustrates Spark Model logged and then loaded as a Spark UDF.
Read on for a pretty long list of updates.
HDFS or Hadoop Distributed File System is the distributed file system provided by the Hadoop Big Data platform. The primary objective of HDFS is to store data reliably even in the presence of node failures in the cluster. This is facilitated with the help of data replication across different racks in the cluster infrastructure. These files stored in HDFS system are used for further data processing by different data processing engines like Hadoop Map-Reduce, Hive, Spark, Impala, Pig etc.
There are a few other formats not included in this list, including RCFile (which has been superseded by both ORC and Parquet), but this hits the highlights.
Jason Brimhall has a three-part series on the types of monitoring built into SQL Server. Part one is an overview and includes the Default Trace:
The default trace by itself is something that can be turned off via configuration option. There may be good reason to disable the default trace. Before disabling the default trace, please consider the following that can be captured via the default trace. I will use a query to demonstrate the events and categories that are configured for capture in the default trace.
Beyond being a component of the black box for SQL Server, what exactly is this event session? The system_health is much as the name implies – it is a “trace” that attempts to gather information about various events that may affect the overall health of the SQL Server instance.
The event session will trap various events related to deadlocks, waits, clr, memory, schedulers, and reported errors. To get a better grasp of this, let’s take a look at the event session makeup based on the available metadata in the dmvs and catalog views.
Beyond being a component of the black box for SQL Server, what exactly is this diagnostics process? The sp_server_diagnostics is much as the name implies—it is a “diagnostics” service that attempts to gather information about various events that may affect the overall health of the SQL Server instance.
The diagnostics process will trap various server related health (diagnostics) information related to the SQL Server instance in an effort to try and detect potential failures and errors. This diagnostics session/process traps information for five different categories by default. There is a sixth category of information for those special servers that happen to be running an Availability Group.
I’ve used the first two but did not know about the third. Jason goes into good depth on each, showing you the types of information you can get out of these. Read the whole thing.
Earlier this week, I blogged about the automatic cleanup process that purges old data from the SSIS catalog logging tables. This nightly process removes data for operations that are older than 365 days. While this is useful, many SSIS admins have complained that this process is very slow and contentious on large or busy SSISDB databases.
In this post, I’ll show to you one of the main reasons this purge process is slow, and will share a more efficient way of performing this delete operation.
Click through for the script and explanation.
There are times when I am working on a SQL script that really shouldn’t be executed all at once. Sometimes it’s a series of examples / demos for a presentation or forum answer. Other times it’s just a temporary need while I’m in the process of creating a complex script, but once the script is completed and tested then it should run all at once. In either case, I have accidentally hit F5 too many times when I thought that a certain section of code was highlighted (so only that section would execute) but in fact nothing was highlighted so the script started executing from the very top, and either ran until completion or until I was able to cancel it (if it ran long enough for me to have time to understand what was happening and hit the “cancel” button).
So I needed some way of ensuring that a script would not execute if no section was highlighted.
Read on to learn about
I prefer it to R mostly because I don’t have to create the csv-file(names) in advance before I import data to it. This is particularly important for scenarios where I want to append data to an existing file. The key for this task is NOT to use the append-option that Python offers, because M-scripts will be executed multiple times and this would create a total mess in my file. Instead I create a new file with the context to append and use the Import-from-folder method instead to stitch all csvs back together. Therefore I have to dynamically create new filenames for each import. So when the M-Python-scripts are executed repetitively here, the newly created file will just be overwritten – which doesn’t do any harm.
Click through for the code as well as a few caveats.
I was just goofing around with the data, so I didn’t really need anything perfect…but I did want something that was good enough to be repeatable, in case I wanted to do it again.
Fixing thousands of rows by hand sounded like torture. Heck. No.
The data was from a publicly available data set, so getting the file format fixed seemed like it would probably be neither quick nor easy. Depending on others could be a dead end, and while this would be the “rightest” solution to ensure a stable future fix, it was overkill for my casual playtime.
Andy has shown the easy way. Now we lock him in a room with sed and a book on regular expressions to learn the other way. The correct answer to that, of course, is to fashion a pick kit out of the book (and whatever else you might be able to acquire) to get out.
2 Weeks ago I was working in a very interesting case in SQL 2016. I received an email from one of my customers saying that they were having intermitent issues within their app that was executing some SP_Execute_External_script SP calls to the database.
We also restarted the launchpad service but with no luck….. The biggest challenge was that sometimes the service responded fine and sometimes it showed the issue that I paste above (And this for me was absolutelly new…). From SQL Side we executed an extended events session with all R services counters but nothing appeared… From sys.dm_Exec_session_Wait_stats we just observed that the session was waiting for SATELLITE_SERVICE_SETUP wait which points that SQL Was waiting an answer from the R service itself.
Click through for the solution.
Assume you want to execute a Hive query that accesses data from an external RDBMS behind a JDBC connection. A possible naïve way of doing this would treat the JDBC source as a “dumb” storage system, reading all the raw data over JDBC and processing it in Hive. In this case you would ignore the query capabilities of the RDBMS and pull too much data over the JDBC link, thus ending up with poor performance and an overloaded system.
For that reason, Hive implements smart push-down to other systems by relying on its storage handler interfaces and cost-based optimizer (CBO) powered by Apache Calcite. In particular, Calcite provides rules that match a subset of operators in the logical representation of the query and generates a new equivalent representation with more operations executed in the external system. Hive includes those rules that push computation to the external systems in its query planner, and then relies on Calcite to generate a valid query in the language that those systems support. The storage handler implementations are responsible to send the generated query to the external system, retrieve its results, and transform the incoming data into Hive internal representation so it can be processed further if needed.
A lot of platforms are moving toward data virtualization (e.g., SQL Server with its Big Data Clusters). That appears to be the next product battleground.
$nullvalue is used in a numeric equation then your results will be invalid if they don’t give an error. Sometimes the
$nullwill evaluate to
0and other times it will make the whole result
$null. Here is an example with multiplication that gives 0 or
$nulldepending on the order of the values.
Nulls are tricky to handle in any language, making their nuances important to understand.