Category: Misc Languages

Ring Buffers

Published 2016-12-15 by Kevin Feasel

This is of course not a new invention. The earliest instance I could find with a bit of searching was from 2004, with Andrew Morton mentioning in it a code review so casually that it seems to have been a well established trick. But the vast majority of implementations I looked at do not do this.

So here’s the question: Why do people use the version that’s inferior and more complicated? I’ve must have written a dozen ring buffers over the years, and before being forced to really think about it, I’d always just used the first definition. I can understand why a textbook wouldn’t take advantage of unsigned integer wraparound. But it seems like it should be exactly the kind of cleverness that hackers would relish using and passing on.

Check out the comments for more information, a bit of code golf, and multiple links on tying shoelaces.

Comments closed

Querying Genomic Data With Athena

Published 2016-12-08 by Kevin Feasel

Aaron Friedman explains how to use Amazon Athena to query S3 files:

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR.

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms. I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

This feels a lot like a data lake PaaS process where they’re spinning up a Hadoop cluster in the background, but one which you won’t need to manage. Cf. Azure Data Lake Analytics.

Comments closed

Debugging LINQ

Published 2016-12-07 by Kevin Feasel

Michael Sorens has an article on using OzCode to debug LINQ statements:

OzCode’s new LINQ debugging capability is tremendous, no doubt about it. But it is not a panacea; it is still constrained by Visual Studio’s own modeling capability. As a case in point, Figure 17 shows another example from my earlier article. This code comes from an open-source application I wrote called HostSwitcher. In a nutshell, HostSwitcher lets you re-route entries in your hosts file with a single click on the context menu attached to the icon in the system tray. I discussed the LINQ debugging aspects of this code in the same article I mentioned previously, LINQ Secrets Revealed: Chaining and Debugging, but if you want a full understanding of the entire HostSwitcher application, see my other article that discusses it at length, Creating Tray Applications in .NET: A Practical Guide.

This is quite interesting. My big problem with LINQ in the past was that Visual Studio’s debugger treated a LINQ statement as a black box, so if you got anything wrong inside a long chain of commands, good luck figuring it out. This lowers that barrier a bit, and once you get really comfortable with LINQ, it’s time to give F# a try.

Comments closed

The Central Limit Theorem

Published 2016-12-07 by Kevin Feasel

Vincent Granville explains the central limit theorem:

The theorem discussed here is the central limit theorem. It states that if you average a large number of well behaved observations or errors, eventually, once normalized appropriately, it has a standard normal distribution. Despite the fact that we are dealing here with a more advanced and exciting version of this theorem (discussing the Liapounov condition), this article is very applied, and can be understood by high school students.

In short, we are dealing here with a not-so-well-behaved framework, and we show that even in that case, the limiting distribution of the “average” can be normal (Gaussian.). More precisely, we show when it is and when it is not normal, based on simulations and non-standard (but easy to understand) statistical tests.

Read on for more details.

Comments closed

Azure Functions

Published 2016-12-02 by Kevin Feasel

Steph Locke has taken a shine to Azure Functions:

Azure Functions take care of all the hosting, all the retry logic, all the parallelisation, all the authentication gubbins, all the monitoring for you. The only bits of code you really have to write is the important stuff – the code that implements the business process. This makes a coding project go from >500 lines to <50, and it should be better quality too! This is super handy for data integration, and I would recommend it over and above Data Factory, unless you need to do some Hadoop stuff and maybe not even then.

The wag in me says that with F#, you could take it from 50 lines to 10… Read the whole thing.

Comments closed

Passing Values To Bash

Published 2016-10-31 by Kevin Feasel

Steph Locke shows how to send input parameters to Bash scripts:

This is a very quick post on how you can make a bash script that allows you to provide it values via the command line. Passing values to a bash script uses a 1-based array convention inside the script, that are referenced by prefixing with $ inside the script.

This means that if I provide .\dummyscript.sh value1 value2, inside the dummyscript.sh I can retrieve these by referencing their positions:

Read on for more information, including how to use named parameters. Given that Bash is now officially supported in Windows 10 (well, in beta form), it might be worth checking that scripting language out.

Comments closed

MariaDB Now Commercial

Published 2016-08-19 by Kevin Feasel

Simon Phipps reports that MariaDB is now a commercial product:

MariaDB Corp. has announced that release 2.0 of its MaxScale database proxy software is henceforth no longer open source. The organization has made it source-available under a proprietary license that promises each release will eventually become open source once it’s out of date.

MaxScale is at the pinnacle of MariaDB Corp.’s monetization strategy — it’s the key to deploying MariaDB databases at scale. The thinking seems to be that making it mandatory to pay for a license will extract top dollar from deep-pocketed corporations that might otherwise try to use it free of charge. This seems odd for a company built on MariaDB, which was originally created to liberate MySQL from the clutches of Oracle.

Interesting.

Comments closed

Analytic Tool Usage

Published 2016-08-16 by Kevin Feasel

Alex Woodie notes the increased popularity of Python for data analysis:

According to the results of the 2016 survey, R is the preferred tool for 42% of analytics professionals, followed by SAS at 39% and Python at 20%. While Python’s placing may at first appear to relegate the language to Bronze Medal status, it’s the delta here that really matters.

It’s interesting to see the breakdowns of who uses which language, comparing across industry, education, work experience, and geographic lines.

Comments closed

Getting Pagination Wrong

Published 2016-08-11 by Kevin Feasel

Lukas Eder discusses common pagination issues:

If your data source is a SQL database, you might have implemented pagination by using LIMIT .. OFFSET, or OFFSET .. FETCH or some ROWNUM / ROW_NUMBER() filtering (see the jOOQ manual for some syntax comparisons across RDBMS). OFFSET is the right tool to jump to page 317, but remember, no one really wants to jump to that page, and besides, OFFSET just skips a fixed number of rows. If there are new rows in the system between the time page number 316 is displayed to a user and when the user skips to page number 317, the rows will shift, because the offsets will shift. No one wants that either, when they click on “next”.

Instead, you should be using what we refer to as “keyset pagination” (as opposed to “offset pagination”).

He also has a good explanation of the seek method.

I will throw in one jab at Oracle (because hey, it’s been a while since I’ve lobbed a bomb at Oracle on this blog): it’d really suck to have a system where I legally wasn’t allowed to distribute relevant performance comparison benchmarks. Fortunately, I tend to work on better data stacks.

Comments closed

Thinking Functionally With Scala

Published 2016-07-20 by Kevin Feasel

Kevin Jacobs solves a simple problem using Scala in a few ways and explains functional programming concepts along the way:

Why is this code better than the functional approach? Note that it saves an enormous amount of time since this approach does not need to scan through all the integers! It are simply a few calculations (at which a computer is good at). All the code (the naive approach and the better approach) can be found on GitHub.

Having a solid understanding of mathematics and logic can help you come up with superior algorithms, but make sure you comment them in detail so that the next dev (who might not understand the underpinnings of your code) doesn’t replace it with a brute-force method because it’s “easier.”

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31