Press "Enter" to skip to content

Author: Kevin Feasel

SSIS — RPC Server is Unavailable

Jon Morisi does some troubleshooting:

I just spent a long slog sorting out why I could not connect to my SSIS instance remotely.  I work in a very secure environment requiring network approval for any and all ports.  According to the following article, I was under the impression that a request to open incoming traffic on port 135, to a specific IP, would allow SQL Server Management Studio, on that specific IP, to connect remotely to SSIS:

https://docs.microsoft.com/en-us/sql/sql-server/install/configure-the-windows-firewall-to-allow-sql-server-access?redirectedfrom=MSDN&view=sql-server-ver16#BKMK_ssis

After opening port 135, I was receiving the error message in the title of this article:

If you find yourself in this situation, read on to see how Jon was able to solve the problem.

Comments closed

Understanding the Poisson Distribution

Achim Zeileis shows off my favorite statistical distribution:

The Poisson distribution has many distinctive features, e.g., both its expectation and variance are equal and given by the parameter λλ. Thus, E(Y)=λE(Y)=λ and Var(Y)=λVar(Y)=λ. Moreover, the Poisson distribution is related to other basic probability distributions. Namely, it can be obtained as the limit of the binomial distribution when the number of attempts is high and the success probability low. Or the Poisson distribution can be approximated by a normal distribution when λλ is large. See Wikipedia (2002) for further properties and references.

Here, we leverage the distributions3 package (Hayes et al. 2022) to work with the Poisson distribution in R. In distributions3, Poisson distribution objects can be generated with the Poisson() function. Subsequently, methods for generic functions can be used print the objects; extract mean and variance; evaluate density, cumulative distribution, or quantile function; or simulate random samples.

Read on for a detailed tutorial. H/T R-bloggers.

Comments closed

Git Native Support for Databricks Workflows

Vaibhav Sethi and Roland Faeustlin make an announcement:

We are happy to announce native support for Git in Databricks Workflows, which enables our customers to build reliable production data and ML workflows using modern software engineering best practices. Customers can now use a remote Git reference as the source for tasks that make up a Databricks Workflow, for example, a notebook from the main branch of a repository on GitHub can be used in a notebook task. By using Git as the source of truth, customers eliminate the risk of accidental edits to production code. They also remove the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improve reproducibility as each job run is tied to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure Devops and AWS CodeCommit.

Read on to see how it works.

Comments closed

Saving and Loading a Keras Model

Jason Brownlee made it to a savepoint in time:

Given that deep learning models can take hours, days and even weeks to train, it is important to know how to save and load them from disk.

In this post, you will discover how you can save your Keras models to file and load them up again to make predictions.

After reading this tutorial you will know:

– How to save model weights and model architecture in separate files.

– How to save model architecture in both YAML and JSON format.

– How to save model weights and architecture into a single file for later use.

Read on for an updated step-by-step tutorial.

Comments closed

Working with xp_cmdshell

Hadi Fadlallah takes us through xp_cmdshell:

In brief, xp_cmdshell is a system stored procedure in SQL Server. It allows executing Windows shell commands from the SQL Server environment. While commands are passed as an input string, the shell’s output is returned as rows of text.

The xp_cmdshell takes two parameters; one required and one optimal:

Hadi does a good job of showing us what security is in place protecting malicious use of xp_cmdshell and how you can add a person to the list of users.

Comments closed

Setting Breakpoints in Powershell Scripts

Patrick Gruenauer does a bit of debugging:

The Set-PSBreakPoint cmdlet sets a breakpoint in a script. When you are troubleshooting a script it could be helpful to know what’s going on in a particualar step or workflow. In this blog bost I will give you an overview and the basics you can build on to troubleshoot and investigate your script. Let’s jump in.

One of these years, I’m finally going to learn command-line debugging. I grew up in the IDE era and so never took the time to learn that skill.

Comments closed

Example Data Pre-Processing Activities

Aayush Srivastava takes us through some pre-processing activities in machine learning:

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm

Read on for examples of pre-processing steps and how pre-processing differs from data cleaning.

Comments closed

Improving Join Performance with Skewed Datasets in Spark

Ajay Gupta gets into the topic of join performance:

Performing Joins on Skewed DatasetsA Dataset is considered to be skewed for a Join operation when the distribution of join keys across the records in the dataset is skewed towards a small subset of keys. For example when 80% of records in the datasets contribute to only 20% of Join keys.

Implications of Skewed Datasets for Join: Skewed Datasets, if not handled appropriately, can lead to stragglers in the Join stage (Read this linked story to know more about Stragglers). This brings down the overall execution efficiency of the Spark job. Also, skewed datasets can cause memory overruns on certain executors leading to the failure of the Spark job. Therefore, it is important to identify and address Join-based stages where large skewed datasets are involved.

Read on for five techniques which may help you out.

Comments closed

Distributed Replay Deprecated in SQL Server 2022

Brent Ozar starts the wake:

For SQL Server 2022, Microsoft deprecated Distributed Replay.

The idea behind the feature was that you’d capture a trace against your production environment, set up another environment for load testing or QA testing, and then replay that exact same workload against it. You’d be able to measure which queries got better or worse, and how.

The reality was a complete mess. It was a giant pain in the rear to set up and use, to the point where I got frustrated with it within a few hours and asked my peers about their experiences with it. I got back a string of four-letter words – everybody really struggled to get it across the finish line. Over subsequent versions, Microsoft made token efforts to improve it, but never really gave it the love it required.

Yep, I can concur. What we wanted was a simple button-click (or easy-to-navigate UI) that let you capture “What does a real production workload look like?” and then the ability to re-run it elsewhere, like on new hardware. What we got was indeed a mess.

I don’t fully agree with Brent’s argument that the right answer is to build app-level testing. If everything was architected and developed for this, then yeah, that might be a better answer. But unless you’ve built all relevant applications around APIs (so they can be programmatically invoked rather than trying to do everything via Selenium) and have put in the legwork necessary to track and re-run calls, I think you end up with an even bigger mess—especially if there are multiple applications working with the same database. I do agree that this is a hard problem regardless of the path you choose.

Comments closed