Press "Enter" to skip to content

Month: November 2018

Gradient Boosting And XGBoost

Shirin Glander has another English-language transcript from a German video, this time covering gradient boosting techniques:

Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.

Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boostingas tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.

The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function – so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.

Along with neural networks, gradient boosting has become one of the dominant algorithms for machine learning, and is well worth learning about.

Comments closed

Visualizing Traditional Japanese Color Palettes

Chisato den Engelsen looks at 465 traditional color palettes used in Japan:

Since each of colours had name, I also was curious if there are some characters that are used more often than other. Colour name was written in two ways in this website. One in Kanji and other in Hiragana.

I love wordcloud2 to visualize the wordcloud, so I can see which characters appears more often the others.

It’s an interesting exercise and all of the R code is included.  Be sure to check out the list of colors with a character representing “rat” or “mouse” in the name.  H/T R-Bloggers

Comments closed

Testing Scalar UDF Performance In SQL Server 2019 CTP 2.1

Brian Hansen takes a look at the scalar UDF performance improvements in SQL Server 2019:

In one of my sessions, Set Me Up: How to Think in Sets, I discuss a variety of performance-inhibiting query constructs, including scalar UDFs. I thought it would be interesting to take the simple scalar function that I use in the demo and see what kind of difference that scalar inlining might make.

First, I restored the CorpDB database that I use in the session to my SQL Server 2019 CTP 2.1 instance and initially set the compatibility level to 140. I also ran script 001 from the demo to create the needed database tables (no need to create the CLR objects for this test). I then ran script 030 to execute the scalar UDF test. In a nutshell, this script

  • creates a UDF

  • runs a query that calls the UDF about 13,000 times, capturing the time required to do so

  • repeated this test five times

  • discards the fastest and slowest tests

  • reports the average time for the remaining three tests

If I’m reading Brian’s notes right, it’s still slower than writing the set-based solution yourself, but a huge improvement over the prior scalar function performance.

Comments closed

Monitoring At Stack Overflow

Nick Craver has been driven off the bend by monitoring and we get to enjoy the fruits of it:

…but evidently some people think of other things. Those people are obviously wrong, but let’s continue. When I’m not a walking zombie after reading a 10,000 word blog post some idiot wrote, I see monitoring as the process of keeping an eye on your stuff, like a security guard sitting at a desk full of cameras somewhere. Sometimes they fall asleep–that’s monitoring going down. Sometimes they’re distracted with a doughnut delivery–that’s an upgrade outage. Sometimes the camera is on a loop–I don’t know where I was going with that one, but someone’s probably robbing you. And then you have the fire alarm. You don’t need a human to trigger that. The same applies when a door gets opened, maybe that’s wired to a siren. Or maybe it’s not. Or maybe the siren broke in 1984.

I know what you’re thinking: Nick, what the hell? My point is only that monitoring any application isn’t that much different from monitoring anything else. Some things you can automate. Some things you can’t. Some things have thresholds for which alarms are valid. Sometimes you’ll get those thresholds wrong (especially on holidays). And sometimes, when setting up further automation isn’t quite worth it, you just make using human eyes easier.

This is a really good post covering monitoring techniques at a high level and getting into specific implementations at Stack Overflow.

Comments closed

Power BI Aggregations

Reza Rad dives into aggregations in Power BI.  Part one introduces the topic:

Imagine a Fact table with 250 millions of rows. Such a fact table is big enough to be considered as a good candidate for DirectQuery connection. You don’t want to load such a big table into the memory, and most probably, the Power BI file size exceeds the 1GB limitation. Now, think about your reporting solution for a second. Do you always query this fact table at the finest or minimum granular level? I mean do you always look at every single transaction in this table when you do report on it?

The answer is No. In most of the times, you are querying the data by other fields or columns. As an example; you query the Sales value in the fact table, by Year. Some other times, you query the fact table’s values by Customer’s education category. Some other times, you query the values in the fact table, by each product. When you look at real-world scenarios, most of the time, you are querying the fact table by aggregations of dimension tables.

Then, Reza starts building an aggregation table:

Aggregation tables are the fast performing solution for huge DirectQuery tables in Power BI. In the previous blog post, I explained what is an aggregation, and why it is an important part of a Power BI implementation. Aggregations are part of the Composite model in the Power BI. For the aggregation set up, your first step is to create an aggregated table. In this blog post, I’ll explain how that step can be done. If you want to learn more about Power BI, read the Power BI book, from Rookie to Rock Star.

There’s a lot of detail packed into the first posts in this series, so it looks like a good one to watch.

Comments closed

Preventing Issues With Columnstore Indexes

Kevin Chant has a post covering issues in his experience with the creation of columnstore indexes:

Error due to online statement

It’s a very common error caused usually by somebody copying a rowstore “Create Index” command. The “online=on” option you use with rowstore indexes does not work with creating columnstore indexes yet. For the record the online option will work in SQL Server 2019.

This one I’ve run into, as I like clustered columnstore indexes a lot but occasionally need single-row results from them.  If the table is empty, creating an index offline is no problem.  But once you get a billion or so rows in it, that’s a non-starter in a 24×7 uptime shop.  Read the whole thing.

Comments closed

Azure Price Increases In CAD

Randolph West notes that Azure services priced in Canadian dollars will increase by five percent:

Starting December 1, 2018, prices for Azure services in the Canadian dollar will increase by 5 percent to more closely align to Azure pricing in US dollars. Even after this adjustment, customers buying in the Canadian dollar will continue to find Azure offerings highly competitive.

Microsoft periodically assesses its pricing of products and services across the globe to ensure reasonable alignment across regions. This change to Azure prices is an outcome of this assessment.

The first thing that came to mind was a particular joke from the Simpsons.  For those who don’t remember, the Simpsons was a hilarious cartoon for about ten years before it was quietly killed and replaced with something almost but not quite the same, lacking most of the humor.

Comments closed

Do You Have Trace Flag 4199 Enabled?

Andy Galbraith recommends that you enable trace flag 4199 in SQL Server:

The session was titled “Modernize Your SQL Server with Bob Ward, the Tiger Team, and CSS Escalation Engineers” and it…was…awesome!
One of the presenters was Pedro Lopes (blog/@SQLPedro), a Senior PM for the Relational Engine.  In his part of the day he talked about several features of the engine and the optimizer, but the “What…what did he say?” moment for me was when he talked about trace flag 4199…and how we should have it turned on pretty much everywhere.
Wait…what?

If you aren’t aware of trace flag 4199 and are running an edition of SQL Server prior to 2016, this is big.  One of our user group members called it out specifically at our last meeting.  As far as 2016+ instances go, Andy covers how that behavior is a little different, so check it out.

Comments closed

An Overview Of Today’s Data And Event Ecosystem

John Hammink walks us through a laundry list of tools in the modern data platform:

InfluxDB

The rapid instrumentation of the physical world due to IoT and data-collecting applications has led to an explosion of time-stamped data. Time series databases serve this evolving niche, and among them, InfluxDB is emerging as a major player. InfluxDB, like others, can handle complex logic or business rules atop massive — and fast-growing — data sets, and InfluxDB adds the advantage of a range of ingestion methods, as well as the ability to append tags to different data points. Aiven also provides a managed version, Aiven InfluxDB.

Even with this list, there are still plenty more, including some of my favorites. H/T DZone

Comments closed

Apache Samza At 1.0

Jagadish Venkatraman announces Apache Samza 1.0:

We are pleased to announce today the release of Samza 1.0, a significant milestone in the history of the project. Apache Samza is a distributed stream processing framework that we developed at LinkedIn in 2013. Samza became a top-level Apache project in 2014. Fast-forward to 2018, and we currently have over 3,000 applications in production leveraging Samza at LinkedIn. The use-cases include detecting anomalies, combating fraud, monitoring performance, notifications, real-time analytics, and many more. Today, Samza integrates not only with Apache Kafka, but also with many other systems, including Azure EventHubsAmazon Kinesis, HDFS, ElasticSearch, and Brooklin. Multiple companies like Slack, TripAdvisor, eBay, and Optimizely have adopted Samza.

We view Samza 1.0 as a step towards our vision of making stream processing universally accessible. In this post, we describe our journey in building and scaling a distributed stream processing system. We also present the key features in Samza 1.0: a rich high-level API, event-time-based processing, integration with Apache Beam, Samza SQL, a standalone mode to run Samza without YARN, and a new test framework for Samza applications.

This runs in the same space as Spark Streaming, Flink, and Kafka Streams, so there are plenty of competitors and a lot of innovation.

Comments closed