Press "Enter" to skip to content

Day: October 25, 2024

Map and FlatMap in PySpark

Vipul Kumar does a bit of work with resilient distributed datasets:

PySpark, the Python API for Apache Spark, is widely used for big data processing and distributed computing. It enables data engineers and data scientists to efficiently process large datasets using resilient distributed datasets (RDDs) and DataFrames. Two commonly used transformations in PySpark are map() and flatMap(). These functions allow users to perform operations on RDDs and are pivotal in distributed data processing.

In this blog, we’ll explore the key differences between map() and flatMap(), their use cases, and how they can be applied in PySpark.

The DataFrame approach has all but obviated having developers use the original Hadoop-like map-reduce approach to writing code in Spark. Even so, I do think it’s useful to know how it all works.

Comments closed

A Survey of Predictive Analytics Techniques

Akmal Chaudhri tries a bunch of things:

In this short article, we’ll explore loan approvals using a variety of tools and techniques. We’ll begin by analyzing loan data and applying Logistic Regression to predict loan outcomes. Building on this, we’ll integrate BERT for Natural Language Processing to enhance prediction accuracy. To interpret the predictions, we’ll use SHAP and LIME explanation frameworks, providing insights into feature importance and model behavior. Finally, we’ll explore the potential of Natural Language Processing through LangChain to automate loan predictions, using the power of conversational AI.

Click through for the notebook, as well as an overview of what the notebook includes. I don’t particularly like word clouds as the “solution” in the BERT example, though without real data to perform any sort of NLP, there’s not much you can meaningfully do.

Comments closed

Source Control Tips

Aamir Khan shares some tips on source control:

In software development, version control is an essential practice that helps manage changes to code and collaborates effectively with team members. A well-organized repository not only streamlines the development process but also enhances productivity and minimizes errors. In this blog post, we’ll explore best practices for maintaining a clean and organized repository, including branch naming conventions and crafting effective commit messages.

The primary audience for this is software developers, but if you create and modify SQL assets (tables, stored procedures, views, functions, etc.), source control is a great thing for many reasons, and this still applies to you. And if you write Powershell scripts because you’re “not a coder,” well, I have some shocking news for you.

Comments closed

Fixing Timeout Issues with Azure SQL Database

Reitse Eskens shares some knowledge:

The customer can connect to the Azure Sql database with Sql Server Management Studio (SSMS) but not with a specific client application.
When digging into the logs (all logs were activated for this database), nothing shows up for the specific login used by the client application. The application itself returns a connection error caused by a time-out.

The application resides outside of Azure and can’t use a VPN connection, the Azure Sql Server has a specific firewall rule to allow incoming traffic from this specific IP address. Not a situation I’m really happy with, but it happens.

Read on for the solution. It was not one I had anticipated. But it did land in my “When in doubt, blame the network” policy.

Comments closed

The Importance of Monitoring in Microsoft Fabric

Marc Lelijveld flips a switch but also watches it:

A long time ago, I blogged about Power BI governance with topics like feature implementation in a phased approach and why you should consider to disable export to Excel. In this blog, I want to continue the governance topic with another blog about why monitoring your tenant is important! This blog will also provide you an overview of the various monitoring options you have out of the box, no matter what your role is. No matter if you are the workspace-, capacity-, domain- or tenant administrator.

I encourage everyone, no matter if you are the service administrator or not, to go through this blog and look from various angles how monitoring can help. I think it can be relevant for any Fabric / Power BI user to see all capabilities it has to offer from a different angle and better understand possible restrictions that are set by your service administrator.

Read on for Marc’s argument, as well as plenty of examples of what you can do as far as monitoring goes.

Comments closed