Press "Enter" to skip to content

Month: June 2022

Normalization Layers in Deep Learning Models

Zhe Ming Chng explains why data normalization matters in data science:

You’ve probably been told to standardize or normalize inputs to your model to improve performance. But what is normalization and how can we implement it easily in our deep learning models to improve performance? Normalizing our inputs aims to create a set of features that are on the same scale as each other, which we’ll explore more in this article.

Also, thinking about it, in neural networks, the output of each layer serves as the inputs into the next layer, so a natural question to ask is: If normalizing inputs to the model helps improve model performance, does standardizing the inputs into each layer help to improve model performance too?

Click through for the tutorial.

Comments closed

Comparing Data Analysis in Java and Python

Manu Barriola does some data analysis in a pair of quite different languages:

Python is a dynamically typed language, very straightforward to work with, and is certainly the language of choice to do complex computations if we don’t have to worry about intricate program flows. It provides excellent libraries (Pandas, NumPy, Matplotlib, ScyPy, PyTorch, TensorFlow, etc.) to support logical, mathematical, and scientific operations on data structures or arrays.

Java is a very robust language, strongly typed, and therefore has more stringent syntactic rules that make it less prone to programmatic errors. Like Python provides plenty of libraries to work with data structures, linear algebra, machine learning, and data processing (ND4J, Mahout, Spark, Deeplearning4J, etc.).

In this article, we’re going to focus on a narrow study of how to do simple data analysis of large amounts of tabular data and compute some statistics using Java and Python. We’ll see different techniques on how to do the data analysis on each platform, compare how they scale, and the possibilities to apply parallel computing to improve their performance.

Read on to see how the two compare. Note that this is base Java and Python+Pandas, not Spark/PySpark, Koalas, etc.

Comments closed

When SELECT * Doesn’t

Chad Callihan protects the reputation of SELECT *:

There are plenty of scenarios where using SELECT * can be an issue. Using SELECT * with EXISTS isn’t one of them.

When using EXISTS and SELECT * together, SQL Server is smart enough to realize what you’re doing and knows you don’t care about what’s in the SELECT.

Read on for an example. I’ve trained myself (been trained?) still to use SELECT 1. The reason is, I know SELECT * works exactly the same way but the benefit of using SELECT 1 is that doing this consistently allows you to do a search for SELECT * in your code base to find actual perpetrators (people writing queries expecting to return the entire result set and which may be susceptible to performance problems or future maintainability problems). Using SELECT 1 in the EXISTS clause means you get fewer false positives in that search as a result.

That said, Joe Celko chimes in to provide some of the history behind SELECT * as the convention for references in the EXISTS clause.

Comments closed

Against Tibbling

Hugo Kornelis hates tibbling:

Probably the one I hate most. And one that is stubbornly persistent. Object name prefixing.

Or, to be more precise, the standard that enforces that all table names need to start with a prefix that designates them as a table, and all view names with a different prefix to clearly mark them as a view. Typically tbl_ and vw_ are used, though I have also seen just the letters t and v, and I have seen them with or without underscores.

I hate this coding standard (or rather, naming standard) with a vengeance. For a few reasons. The perceived benefit is in fact not a benefit at all. It is detrimental to a quick understanding of what I see on the screen. But my biggest objection is that it negates one of the greatest benefits of views.

Read on to understand why this is a bad idea. I completely agree with Hugo on this.

Comments closed

Coding Standards Writ Large

Kenneth Fisher points out the downside of coding standards:

I’m betting you can start to see the problem right? Joe is supporting application A. His team wrote it, and they wrote it using a strict set of coding standards. Jane supports application C. Her team didn’t write it, it was transferred over during the reorg last March. Her team has a set of coding standards they enforce, unfortunately, application C wasn’t written with them. All of their new code is however because their manager wants strict enforcement of their coding standards. Oh, and Joe and Jane are both being moved to be part of a new team next week. They’ll be supporting some older code, that, you guessed it, is using a whole different set of coding standards, if any.

Starting from the Coasean notion of the firm as a means of internalizing externalities and achieving economies of scale + scope, this is just about where you hit the margin for added productivity… Rephrasing this not to be in economist jibber-jabber, this kind of thing is a big part of why really large companies essentially spin off mini-companies and act nearly-independently under the parent company’s umbrella. It’s essentially impossible to create and enforce a meaningful set of standards once you hit a certain threshold of developers, especially when it comes to the more opinion-heavy standards.

Comments closed

Tips for SQL Developers

Lee Markum has a few tips for you:

SQL Server Developers are under-rated.

That’s right! I’m a DBA and I said, “SQL Developers are under-rated.” Dedicated SQL Developers help I.T. teams by writing efficient code that gets just the data that is needed and in a way that leverages how the database engine works best. How do you ensure you’re doing great work for your company and building code that will stand the test of time?

I’m so glad you asked!

Click through for Lee’s advice. One big thing I’d add to Lee’s list is to understand the domain. Query writing skills are quite fungible across domains—moving from health care to auto parts sales, you don’t need to re-learn SQL—but understanding some of the arcana of the organization and its industry makes it a lot easier to know you’re writing good queries, getting valid data back, and not forgetting some important business rule.

Comments closed

SQL Server Client Tool Updates

Erin Stellato has a burndown list for us:

We introduced .NET Interactive Notebooks in Azure Data Studio through the .NET Interactive Notebooks extension. With multi-language support for Jupyter Notebooks you can now code in T-SQL, PowerShell, C#, JavaScript, and more. 

Never fear, folks: the .NET Interactive Notebooks extension does include the most important .NET language of all, F#. In all seriousness, F# is the type of language which was made for notebooks, especially with generative type providers.

Comments closed

Parameter Sensitivity Plan Optimization and Monitoring Scripts

Erik Darling gives us a warning:

You can read the full documentation here. But you don’t read the documentation, and the docs are missing some details at the moment anyway.

– It only works on equality predicates right now

– It only works on one predicate per query

– It only gives you three query plan choices, based on stats buckets

There’s also some additional notes in the docs that I’m going to reproduce here, because this is where you’re gonna get tripped up, if your scripts associate statements in the case with calling stored procedures, or using object identifiers from Query Store.

It’s not a deal-breaker but it does make things a lot harder for tool writers, as Erik points out. Hopefully there’s some way to tie this all together before GA.

Comments closed

An Overview of Clustering Algorithms

Gavita Regunath has a two-parter on clustering. First, an explanation of the concept:

Clustering, or cluster analysis, is an unsupervised machine learning method. As the name implies, unsupervised machine learning refers to how the model ‘learns’ the data. It is a learning process opposite to supervised learning. With supervised learning, models are trained or “supervised” using labelled datasets (a known function output to our data). An example of a supervised learning method is where a model is trained to recognise animals based on their labels of being a cat, dog and rabbit.

Unsupervised learning works with unlabelled data where there are no known function outputs, and the aim is to identify patterns within a dataset. There are many unsupervised learning algorithms, however, the three main types are clustering algorithms, dimensionality reduction and anomaly detection. The focus of this blog will be on clustering, as it is the most commonly used unsupervised learning technique.

Second, a review of ten clustering algorithms:

There are many clustering algorithms. In fact, there are more than 100 clustering algorithms that have been published so far. However, despite the various types of clustering algorithms, they can generally be categorised into four methods. Let’s look at these briefly:

Read on to learn more about clustering.

Comments closed