Press "Enter" to skip to content

Category: Misc Languages

Comparing Data Analysis in Java and Python

Manu Barriola does some data analysis in a pair of quite different languages:

Python is a dynamically typed language, very straightforward to work with, and is certainly the language of choice to do complex computations if we don’t have to worry about intricate program flows. It provides excellent libraries (Pandas, NumPy, Matplotlib, ScyPy, PyTorch, TensorFlow, etc.) to support logical, mathematical, and scientific operations on data structures or arrays.

Java is a very robust language, strongly typed, and therefore has more stringent syntactic rules that make it less prone to programmatic errors. Like Python provides plenty of libraries to work with data structures, linear algebra, machine learning, and data processing (ND4J, Mahout, Spark, Deeplearning4J, etc.).

In this article, we’re going to focus on a narrow study of how to do simple data analysis of large amounts of tabular data and compute some statistics using Java and Python. We’ll see different techniques on how to do the data analysis on each platform, compare how they scale, and the possibilities to apply parallel computing to improve their performance.

Read on to see how the two compare. Note that this is base Java and Python+Pandas, not Spark/PySpark, Koalas, etc.

Leave a Comment

The Future Object in Scala

Gulshan Singh visits from the future:

You have units of work that you want to run asynchronously, so you don’t block while they’re running. A future gives you a simple way to run an algorithm concurrently. A future starts running concurrently when you create it and returns a result at some point, well, in the future. In Scala, we call that a future returns eventually.

The Future instance is a handle to an eventually available result. You can continue doing other work until the future completes, either successfully or unsuccessfully.

You may also know of Futures as Promises. It’s quite similar to async calls in .NET as well.

Comments closed

Explaining Key Terms in Category Theory

Gulshan Singh describes three tricky terms for newcomers to functional programming:

Monoid is based on an associative function. Formally, a functor is a type F[A] with an operation
map with type (A => B) => F[B]. In functional programming one typically only deals with one category, the category of types. A functor is an interface with one method i.e a mapping of the category to category. Monads basically is a mechanism for sequencing computations. A monad is a way to wrap stuff, then operate on the wrapped stuff without unwrapping it.

If that wasn’t too clear, check out the post for more detail. And if you want a whole lot more detail, Bartosz Milewski’s YouTube series (and book) on category theory are great resources for dozens of hours of learning.

Comments closed

Exception Handling in Scala

Pallav Gupta shows several methods for handling errors using Scala:

Error handling is the process of handling the possibility of failure. For example, failing to read a file and then continuing to use that bad input would clearly be problematic. Noticing and explicitly managing these errors saves the rest of the program from various pitfalls.

When an exception occurs, say an Arithmetic Exception then the current operation is aborted. Then the runtime system looks for an exception handler that can accept an Arithmetic Exception. Control resumes with the innermost handler. If no such handler exists, the program terminates.

Pallav starts with the most expensive option and ends with the best option with the Either monad.

Comments closed

PyODBC vs C# ODBC Performance Differences

Jose Manuel Jurado Diaz explains a performance difference:

A customer asked today, why using ODBC Driver 17 for SQL Server in Python with PYODBC we have a slightly difference in terms of time taken if we compare with C# System.Data.Odbc. Following, I would like to share my lesson learned about it.

Read on for Jose’s explanation. My short version is, it seems particularly important when using the Python ODBC driver to write the exact query you want rather than a SELECT * or query which returns rows/columns you don’t need.

Comments closed

Entity Framework and Include Operations

Josh Darnell has a warning:

I can imagine someone reading that and not seeing the gravity of the situation. “Hey, 500 rows isn’t that many – we have modern hardware!”

I thought it was worth writing about a real world situation where this can get seriously out of hand.

Read on for a scenario in which 64 rows turns into 100,000 rows pretty quickly.

Comments closed

Currying and Partial Application

Prakhar explains the difference between currying and partial application:

Currying simply means converting a function taking more than one parameter can be into a series of functions with each taking one parameter. Example:

Click through for an example, as well as the difference between currying and partial application. As for why currying is important, this is how we tie together the concept of mathematical functions, which require exactly one parameter (a function being defined as, for every value of the domain, there is one and only one value of the range), with computer science functions, which may have multiple parameters. Currying allows us to bridge that gap without needing to write loads of intermediary functions.

Comments closed

Materializing Views on Materialized Views

Drew Furgiuele is asking for it:

Consider this: you’ve developed a data ingestion strategy that is taking in remote thermostat readings. Usually, the devices report in on a set frequency and you’re able to calculate aggregate readings an hourly interval. A materialized view could be created that does this calculation and stores the results out for querying. But what if something causes some of this data to become duplicated? You’d first have to eliminate these duplicates, re-ingest the data, and then do your calculations again.

This is where we can leverage creating a materialized view over a materialized view. Our first materialized view will handle the deduplication, and our second can handle the aggregation of the deduplicated data.

Yo dawg, I heard you like materialized views, so I put some materialized views in your materialized views so you can materialize views while you materialize views.

Comments closed

Azure Functions and Azure SQL Database

Rajendra Gupta builds a simple Azure Function:

As a Platform as a Service (PaaS) service, Azure SQL Database enables developers to deploy SQL Database in Azure Cloud without managing the infrastructure. We use SQL Server Agent to schedule jobs to run at a specific schedule in an on-prem SQL instance. However, Azure DB does not have agent functionality.

There are multiple ways to schedule job or batch processes in the Cloud. You can explore the Azure automation series for executing scripts using Azure Logic apps and automation runbooks.

This article focuses on the Azure functions for scheduling a job for Azure SQL Database.

Read on for the process.

Comments closed