Press "Enter" to skip to content

Category: Misc Languages

Understanding Monads in Scala

Anna Wykes continues a series on Scala for data engineers:

This is the second of my blogs in the Scala Parlour Series, in which we explore Scala, and why it is great for Data Engineering. If you haven’t already, please check out the first in the series here, in which you can read all about the core concepts of Scala, including who uses it and why 

In this article we will explore monads within the Functional Programming (FP) paradigm, and how they can be used in Scala to aid Data Engineering.  

Anna explains monads quite well here. This is a topic which is notoriously in how people perceive its difficulty, but conceptually it’s not as difficult as people take it to mean…if you understand a few concepts coming in.

Comments closed

An Overview of the T-SQL Script DOM

Dan Guzman provides a public service:

Scripts are parsed by invoking the Parse method of T-SQL script DOM library TSqlParser class. The parser understands the complex T-SQL abstract syntax tree and splits T-SQL source into atomic TSqlParserTokens of TSqlTokenTypes that represent keywords, identifiers, punctuation, literals, whitespace, etc. These low-level tokens are grouped into more meaningful TSqlFragment objects that represent language elements of the script DOM, such as batches, statements, clauses, etc. Fragments, rather than the low-level parser tokens, are most often used in practice, although the underlying tokens are available for specialized requirements

The Parse method returns a TSqlFragment object of type TSqlScript containing all fragments within the script. This top-level fragment of the DOM hierarchy provides programmatic access to all language element fragments in the script. Nearly 1,000 different fragment types exist today due to the many granular T-SQL language elements.

Dan provides several examples of how to use the script DOM, making this a must-read if you’re interested in writing code around SQL Server.

Comments closed

Bulk Loading SQL Server from .NET

Adrian Hills walks us through the SqlBulkCopy class:

Ever been in a situation where rumblings of “process X is too slow” suddenly build into a super-high priority ball of urgency when that next step up in data volume hits? Yeah, that can be fun. No, really, it can be fun because we have strategies to sort this stuff out, right?

In this blog post, I’m going to talk about one particular piece of functionality—SqlBulkCopy—that can help you with bulk data loading. If I had to single out my favorite .NET class, SqlBulkCopy would be at the top of the list. My goal is to introduce you to this class so that maybe it can become a part of your tool belt, too.

Click through to see how it works. If you’re familiar with SSIS, you’re already familiar with the concept if not the specifics.

Comments closed

Portfolio Optimization with SAS and Python

Sophia Rowland shows off the sastopypackage:

I started by declaring my parameters and sets, including my risk threshold, my stock portfolio, the expected return of my stock portfolio, and covariance matrix estimated using the shrinkage estimator of Ledoit and Wolf(2003). I will use these pieces of information in my objective function and constraints. Now I will need SWAT, sasoptpy, and my optimization model object.

Read on for a demo.

Comments closed

Cassandra Monitoring and Data Modeling

Instaclustr has put up a couple interesting posts on Cassandra. First, Anup Shirolkar explains how we can monitor Cassandra installations:

Cassandra is developed in Java and is a JVM based system. Each Cassandra node runs a single Cassandra process. JVM based systems are enabled with JMX (Java Management Extensions) for monitoring and management. Cassandra exposes various metrics using MBeans which can be accessed through JMX. Cassandra monitoring tools are configured to scrape the metrics through JMX and then filter, aggregate, and render the metrics in the desired format. There are a few performance limitations in the JMX monitoring method, which are referred to later. 

The metrics management in Cassandra is performed using Dropwizard library. The metrics are collected per node in Cassandra. However, those can be aggregated by the monitoring system. 

On the development side, the Instaclustr team walks us through data modeling guidelines:

The ultimate goal of Cassandra data modeling and analysis is to develop a complete, well organized, and high performance Cassandra cluster. Following the five Cassandra data modeling best practices outlined will hopefully help you meet that goal:

1. Cassandra is not a relational database, don’t try to model it like one
2. Design your model to meet 3 fundamental goals for data distribution
3. Understand the importance of the Primary Key in the overall data structure 
4. Model around your queries but don’t forget about your data
5. Follow a six step structured approach to building your model. 

Because Cassandra uses a variant of SQL, it’s easy to forget that data is stored completely differently and that design decisions are quite different from what we see in the relational world.

Comments closed

C# Notebooks with Cosmos DB

Hasan Savran takes us through Jupyter notebooks in Cosmos DB:

Jupyter Notebooks are in everywhere in these days. You can write chunk of code and run it on a web application without worrying about compiler is a great feeling. C# has been little bit late to the party, but we started to see C# Notebooks lately too. Azure Cosmos DB announced their version if C# Notebook this week.
     You can reach all notebook functionalities under the Data Explorer link, There are bunch of sample notebooks you will see under the Notebook link.

There are some limitations here, like needing to use the SQL API, but it’s an interesting approach to data access in Cosmos DB.

Comments closed

Big-O Notation in .NET

Camilo Reyes takes us through a useful concept in computer science as applied to .NET Core:

Performance sensitive code is often overlooked in business apps. This is because high-performance code might not affect outcomes. Concerns with execution times are ignorable if the code finishes in a reasonable time. Apps either meet expectations or not, and performance issues can go undetected. Devs, for the most part, care about business outcomes and performance is the outlier. When response times cross an arbitrary line, everything flips to less than desirable or unacceptable.

Luckily, the Big-O notation attempts to approach this problem in a general way. This focuses both on outcomes and the algorithm. Big-O notation attempts to conceptualize algorithm complexity without laborious performance tuning.

This is a rather high-level take on the idea, as it doesn’t cover any of the O(NlogN) or O(logN) algorithms out there. But if you are not familiar with the concept, it is good to know.

Comments closed

Writing a Custom Serializer Class for Kafka

Ramandeep Kaur shows how to create custom classes to serialize and deserialize data in Apache Kafka:

Need?

Basically, in order to prepare the message for transmission from the producer to the broker, we use serializers. In other words, before transmitting the entire message to the broker, let the producer know how to convert the message into a byte array we use serializers. Similarly, to convert the byte array back to the object we use the deserializers by the consumer.

Click through for an example.

Comments closed

foldLeft and foldRight in Scala

Sarfaraz Hussain explains the difference between foldLeft and foldRight in Scala:

The fold method is a Higher Order Function in Scala and it has two variant namely,
i. foldLeft
ii. foldRight
In this blog, we will look into them in detail and try to understand how they work.

Before moving ahead, I want to clarify that the fold method is just a wrapper to foldLeft, i.e. the fold method internally invokes the foldLeft method. So, now let’s get started.

Folding is an extremely powerful technique for getting rid of loops in code. Being comfortable with folding is (in my eyes) one of the signs which indicate that you’ve reached a mid-level understanding of functional programming.

Comments closed

Displaying Cosmos DB Spatial Data with .NET Core

Hasan Savran builds up a quick .NET Core app to retrieve spatial data from Cosmos DB and display it:

Cosmos DB stores geospatial data in GeoJSON format. You can not tell what raw GeoJSON represents because usually all it has is a type and bunch of coordinates. Azure Cosmos DB does not have any UI to help you what GeoJSON data looks like on a map either. Only option you have is a third party tool which might display data on a map or Azure Cosmos DB Jupyter Notebooks.

    I want to run a query in Azure Cosmos DB and see the results on a map. I decided to create a simple UI which displays spatial data on a map. I will show you how to do this step by step. I will use LeafLetJs as a map. It is open source and free! Also, I need to create .NET Core 3.1 web application and use Azure Cosmos DB Emulator for data.

Hasan walks us through the demo and promises to put the code in GitHub later.

Comments closed