Press "Enter" to skip to content

Category: Misc Languages

Finding Near-Duplicates in a Corpus

Estelle Wang de-dupes text data:

Building a large high-quality corpus for Natural Language Processing (NLP) is not for the faint of heart. Text data can be large, cumbersome, and unwieldy and unlike clean numbers or categorical data in rows and columns, discerning differences between documents can be challenging. In organizations where documents are shared, modified, and shared again before being saved in an archive, the problem of duplication can become overwhelming.

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options.

Click through for solutions in SAS.

Comments closed

Running Postman Tests in GitLab

Rahul Kumar automates Postman tests:

Hi folks, In this brief blog post, we’ll learn more about Gitlab CI and Postman, the API testing tool we use the most frequently. This article’s goal is to provide a quick process for automatically testing the service API response. The solution makes use of the capabilities provided by the Gitlab-integrated Continuous Integration tool.

Click through for the tutorial.

Comments closed

Raw String Literals in C# 11

Patrick Smacchia shows off raw string literals:

C# 11 introduces Raw String Literals. Undoubtedly this feature will become very popular because it represents an elegant way to solve some issues with actual string literal.

Let’s have a look at such raw string literal with interpolation. Notice that a raw string literal necessarily starts and ends with at least 3x double quote characters """.

This is similar to how several other languages, including F# and Python, do it.

Comments closed

What’s New in SynapseML

Nellie Gustafsson and Mark Hamilton share an update:

SynapseML is a massively scalable (feel free to spin up hundreds of machines!) machine learning library built on Apache Spark. SynapseML makes it easy to train production-ready models to solve problems from simple classification and regression to anomaly detection, translation, image analysis, speech to text, and just about any ML challenge you are facing.  Under the hood, SynapseML integrates a wide array of ML technologies such as LightGBM, Vowpal Wabbit, ONNX, and the Cognitive Services into a single easy to use API compatible with MLFlow. We know, we know, everyone hates when developers invent new APIs, but you can rest easy because SynapseML integrates cleanly into existing Spark ML APIs so you can embed models directly into existing pipelines. We strive to make SynapseML available to developers wherever they work, and the library is available in a variety of languages like Python, Scala, Java, R. As of this release SynapseML is also usable from .NET, C#, F#.

Saving the best language for last, I see. Click through for the list of updates.

Comments closed

Neo4j Imports and Case Sensitivity

Steve Jones is getting me in a ranting mood:

I kept editing the file and trying different things. I compared what I had locally with what was on GitHub. Eventually, I realized this is the issue:

{employeeID:row.EmployeeID}

In the GitHub csv, the first row has headers with EmployeeID. In my local file, the header is “employeeID” (lower case). As soon as I edited this, it worked.

Case sensitivity is a big historical mistake.

Comments closed

SQL Server and Golang

Shane O’Neill goes long:

It’s all very well and good to open up Golang and write a FizzBuzz (note to self: write a FizzBuzz), but I still work with databases.

So before I do anything with Golang, I’d like to know: can it interact with databases. 

Granted, everything can, but how easy is it? And does trying to interact with databases kill any interest I have in the language.

So, let’s give it a go.

Click through to see what Shane came up with.

Comments closed

Recreating a Shiny App with Plumber and React

Liam Kalita continues a series:

We’ll assume you have a basic understanding of HTML and JavaScript, but you should be able to follow along with a basic programming background. Having a little knowledge of Linux shell commands would be beneficial for some of the terminal commands for generating directories, but you can also do most of it in VSCode using the user interface instead.

Let’s attempt an exercise in creating a small React+Plumber app; this will be very similar to a previous blog post recreating this tutorial {shiny} application using Python Flask.

Click through to see how to build the app. The final part of the series will show how to host the app.

Comments closed

Saving DateOnly and TimeOnly to Databases

Hasan Savran gives us some mixed news:

DateOnly and TimeOnly are the new members of .NET family, they are introduced in version 6. Developers have been asking to declare only a date or only a time without using the DateTime object for a long time. These two new functions will make the developer’s life easy for sure. You can declare them easily but saving them to a database can be a challenge. Before we go into those details for saving them, Let’s look at how to declare and use them in the code.

But right now, they aren’t so useful for SQL Server. Which is a shame, as there are obvious mapping candidates : DATE and TIME, respectively. My hope is that the Microsoft.Data.SqlClient library gets an update pretty soon to handle them. But read on to learn how Cosmos DB handles these new types.

Comments closed

Comparing Data Analysis in Java and Python

Manu Barriola does some data analysis in a pair of quite different languages:

Python is a dynamically typed language, very straightforward to work with, and is certainly the language of choice to do complex computations if we don’t have to worry about intricate program flows. It provides excellent libraries (Pandas, NumPy, Matplotlib, ScyPy, PyTorch, TensorFlow, etc.) to support logical, mathematical, and scientific operations on data structures or arrays.

Java is a very robust language, strongly typed, and therefore has more stringent syntactic rules that make it less prone to programmatic errors. Like Python provides plenty of libraries to work with data structures, linear algebra, machine learning, and data processing (ND4J, Mahout, Spark, Deeplearning4J, etc.).

In this article, we’re going to focus on a narrow study of how to do simple data analysis of large amounts of tabular data and compute some statistics using Java and Python. We’ll see different techniques on how to do the data analysis on each platform, compare how they scale, and the possibilities to apply parallel computing to improve their performance.

Read on to see how the two compare. Note that this is base Java and Python+Pandas, not Spark/PySpark, Koalas, etc.

Comments closed