Press "Enter" to skip to content

Category: Misc Languages

Mapping In Scala When Dealing With Futures & Options

Shubham Verma shows us what happens when we use the map() and flatMap() functions in Scala on fields which are marked as Options or Futures:

now let’s move towards the interesting part flatMap(), what it is supposed to do in case of Option so the flatMap() gives us the liberty to return the type of value that we want to return after the transformation, unlike map() in wherein when a parameter has the value Some the value would be of type Some what so ever, but its not with the case with flatMap()

scala> option.flatMap(x => None)
res13: Option[Nothing] = None
scala>
scala> option.map(x => None)
res14: Option[None.type] = Some(None)

the code snippet above clearly shows it, so is that it ? No not yet lets look on to one more feature of flatMap() on Option[+A]that comes to be real handy when we need to extract value out of options, supposedly we have list of type List[Option[Int]] now I am only interested in values that have some value which seems to be an obvious usecase in most of the times, we can simply do it using a ​flatMap() operation

In short, it’s a little more complex, but you can still get useful information.

Comments closed

Calculating TF-IDF Using Apache Spark

Arseniy Tashoyan shows us how to calculate Term Frequency-Inverse Document Frequency using Apache Spark:

TF-IDF is used in a large variety of applications. Typical use cases include:

  • Document search.
  • Document tagging.
  • Text preprocessing and feature vector engineering for Machine Learning algorithms.

There is a vast number of resources on the web explaining the concept itself and the calculation algorithm. This article does not repeat the information in these other Internet resources, it just illustrates TF-IDF calculation with help of Apache Spark. Emml Asimadi, in his excellent article Understanding TF-IDF, shares an approach based on the old Spark RDD and the Python language. This article, on the other hand, uses the modern Spark SQL API and Scala language.

Although Spark MLlib has an API to calculate TF-IDF, this API is not convenient to learn the concept. MLlib tools are intended to generate feature vectors for ML algorithms. There is no way to figure out the weight for a particular term in a particular document. Well, let’s make it from scratch, this will sharpen our skills.

Read on for the solution.  It seems that there tend to be better options today than TF-IDF for natural language problems, but it’s an easy algorithm to understand, so it’s useful as a first go.

Comments closed

Exception Handling In Scala

Shivangi Gupta shows off the Either keyword in Scala:

How to get values from Either?

There are many ways we will talk about all one by one.  One way to get values is by doing left and right projection. We can not perform any operation i.e, map, filter etc; on Either. Either provide left and right methods to get the left and right projection. Projection on either allows us to apply functions like map, filter etc.

For example,

scala> val div = divide(14, 7)
div: scala.util.Either[String,Int] = Right(2)
scala> div.right
res1: scala.util.Either.RightProjection[String,Int] = RightProjection(Right(2))

When we applied right on either, it returned RightProjection. Now we can extract the value from right projection using get, but if there is no value the compiler will blow up using get.

There’s more to Scala exception handling than just try-catch.

Comments closed

Serializing Data In Scala

Akhil Vijayan has a two-parter on serializing data in Scala.  In the first post, he looks at uPickle:

uPickle serializer is a lightweight Json library for scala. uPickle is built on top of uJson which are used for easy manipulation of json without the need of converting it to a scala case class. We can even use uJson as standalone too. In this blog, I will focus only on uPickle library.

Note: uPickle does not support Scala 2.10; only 2.11 and 2.12 are supported

uPickle (pronounced micro-pickle) is a lightweight JSON serialization library which is fast than many other json serializers. I will talk more about the comparison of different serializers in my next blog. This blog will cover all the basic stuff about uPickle.

Then, he follows up with a comparison to other serializers:

In my previous blog, I talked about how uPickle works. Now I will be comparing it will many other json serializers where I will be serializing and deserializing scala case class.

Before that let me discuss all the json serializers that I have used in my comparison. I will compare uPickle with PlayJson, Circe and Argonaut.

Check it out.

Comments closed

Selecting All Columns But One In Postgres

Lukas Eder shows off a BigQuery feature which you can partially implement in Postgres:

In BigQuery syntax, we could now simply write

SELECT * EXCEPT rk
FROM (...) t
WHERE rk = 1
ORDER BY first_name, last_name

Which is really quite convenient! We want to project everything, except this one column. But none of the more popular SQL databases support this syntax.

Luckily, in PostgreSQL, we can use a workaround: Nested records:

SELECT (a).*, (f).* -- Unnesting the records again
FROM ( SELECT a, -- Nesting the actor table f, -- Nesting the film table RANK() OVER (PARTITION BY actor_id ORDER BY length DESC) rk FROM film f JOIN film_actor fa USING (film_id) JOIN actor a USING (actor_id)
) t
WHERE rk = 1
ORDER BY (a).first_name, (a).last_name;

Notice how we’re no longer projecting A.* and F.* inside of the derived table T, but instead, the entire table (record). In the outer query, we have to use some slightly different syntax to unnest the record again (e.g. (A).FIRST_NAME), and we’re done.

Read the whole thing.  Lukas has a workaround for SQL Server, but I’d really like to see SELECT * EXCEPT [something] be viable syntax.  This is something I’d want to use more for ad hoc diagnostic queries, but I have one scenario where most columns on a table are narrow but then I have a big VARBINARY(MAX) (for good reason, I promise) that I almost never want to see in diagnostic queries.  I use a third-party SSMS plugin to populate all the columns and remove the one I don’t want, but it’d be nice to specify the other way because it’s so much faster to type.

Comments closed

Using map And flatMap In Scala

Shubham Verma explains the map and flatMap functions in Scala:

Consider two sets, A = {-2, -1, 0, 1, 2} and B = {0.5, 1, 1.5, 2.5, 4, 4.5, 5, 5.5} and a function          f: A => B

y = x ^ 2 + 0.5;  x is an element from set A and y corresponds to an element from set B, now we see that function f is applied to every element of set A but the result could be a subset of set B also.

So from the above text, we can draw the analogy that sets A and B can be seen as any collection in programming paradigm. Now what is “f”, so “f” could be seen as a function that takes an element from A and returns an element that exists in B, the point here to note is that, as scala promotes immutability whenever we apply map (or any other transformer) on some collection of type A, it returns a new collection of the same type with elements of type B. It would be helpful to understand it from the snippet below.

val result: List[B] = List[A].map(f: A => B)

So when a map operation is applied on a collection (here a List) of type A, with passing f as its argument it applies that function to every element of List of type A returns a new collection (again a List) of type B.

Read the whole thing.

Comments closed

The Basics Of Lambda Calculus

Kevin Sookocheff walks us through some of the basics of Lambda calculus:

Functions are a bit more complicated. Michaelson states that a λ function serves as an abstraction over a λ expression, which isn’t that informative unless we take some time to understand what abstraction actually means.

Programmers use abstraction all the time by generalizing from a specific instance of a problem to a parameterized version of it. Abstraction uses names to refer to concrete objects or values (you can call them parameters if you like), as a means to create generalizations of specific problems. You can then take this abstraction (you can call it a function if you like), and replace the names with concrete objects or values to create a particular concrete instance of the problem. Readers familiar with refactoring can view abstraction as an “Extract Method” refactoring that turns a fragment of code into a method with parameters that explain the purpose of the method.

I think having a good understanding of Lambda calculus is a huge advantage for a data platform professional, as it gives you an inroad to learning data-centric functional programming languages (e.g., Scala, R, and F#) and neatly sidesteps the impedance mismatch problem with object-oriented languages.

Comments closed

Accessing SQL Server From Scala

Sidharth Khattri shows how to use Scala Slick, a library designed to integrate with database, to connect to SQL Server:

Now moving onto our FRM (Functional Relational Mapping) and repository setup, the following import will be used for MS SQL Server Slick driver’s API

import slick.jdbc.SQLServerProfile.api._

And thereafter the FRM will look same as the rest of the FRM’s delineated on the official Slick documentation. For the example on this blog let’s use the following table structure

CREATE TABLE user_profiles ( id INT IDENTITY (1, 1) PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL
)

whose functional relational mapping will look like this:

class UserProfiles(tag: Tag) extends Table[UserProfile](tag, "user_profiles") { def id: Rep[Int] = column[Int]("id", O.PrimaryKey, O.AutoInc) def firstName: Rep[String] = column[String]("first_name") def lastName: Rep[String] = column[String]("last_name") def * : ProvenShape[UserProfile] = (id, firstName, lastName) <>(UserProfile.tupled, UserProfile.unapply) // scalastyle:ignore
}

I’m definitely going to need to learn more about this.

Comments closed

How DynamoDB Indexing Works

Shubham Agarwal explains how indexing works within DynamoDB:

Global secondary index in DynamoDb – An index with a partition key and a sort key that can be different from the base table. A global secondary index is very helpful when you need to query your data without primary key.

  •  The primary key of a global secondary index can be partition key or composite  (partition key and sort key).

  • Global secondary indexes can be created at the same time that you create a table. You can also add a new global secondary index to an existing table, or delete an existing global secondary index

  • A global secondary index lets you query over the entire table, across all partitions.

  • The index partition key and sort key (if present) can be any base table attributes of type string, number, or binary.

  • With global secondary index queries or scans, you can only request the attributes that are projected into the index. DynamoDB will not fetch any attributes from the table.

  • There are no size restrictions for global secondary indexes.

Click through to learn more about these as well as local secondary indexes.

Comments closed

The Difference Between M And DAX With Cooking

Eugene Meidinger explains the difference between M and DAX as languages using a cooking metaphor:

I like to think of M as this sous chef. It does all the grunt work that we’l like to automate. Let’s say that my boss asks for a utilization report for all of the technicians. What steps am I doing to do in M?

  1. Extract the data from the line of business system
  2. Remove extraneous
  3. Rename columns
  4. Enrich the services table with a Billable / NonBillable column
  5. Generate a date table

This is all important work, but I would have to do the same work for a variety of reports. Many of the steps tell me nothing about the final product. I would generate a date table for most of my reports, for example.

I think the metaphor holds.

Comments closed