Press "Enter" to skip to content

Day: January 23, 2020

Concepts in Support Vector Machines

Abhijit Telang takes us through the calculations involved in Support Vector Machines and then gives us an example in R:

So, let’s take that out and we are back to old, classical vector algebra. It’s like a person with a bunch of sticks to figure out which one to lay where in a 2-D plane to separate one class of objects from another, provided class definitions are already known. 

The problem is which particular shape and length must be chosen to show maximum contrast between classes.

We need to arrive at a function definition, in such a way that the value a given function takes changes drastically (e.g. from a large positive value to a large negative value).

SVM is often great for two-class classification problems, and different variants also work well for multi-class problems.

Comments closed

Log Aggregation with Apache Flink

Gyula Fora and Matyas Orhidi have started a series on log aggregation with Apache Flink:

There are several off-the-shelf solutions available on the market for log aggregation, which come with their own stack of components and operational difficulties. For example, notable logging frameworks that are widely used in the industry are ELK stack and Graylog. 

Unfortunately, there is no clear cut solution that works for every application, and different logging solutions might be more suitable for certain use cases. The log processing of real-time applications should for instance also happen in real-time, otherwise, we lose timely information that may be required to successfully operate the system.

In this blog post, we dive deep into logging for real-time applications.

This post is mostly understanding and setup, but it leads into processing and visualization.

Comments closed

Migrating Oracle Exadata Workloads to Azure

Kellyn Pot’vin-Gorman shows the process of moving from an Exadata system to Oracle on Azure:

An Exadata is an engineered system-  database nodes, secondary cell nodes, (also referred to as storage nodes/cell disks), InfiniBand for fast network connectivity between the nodes, specialized cache, along with software features such as Real Application Clusters, (RAC), hybrid columnar compression, (HCC), storage indexes, (indexes in memory) offloading technology that has logic built into it to move object scans and other intensive workloads to cell nodes from the primary database nodes.  There are considerable other features, but understanding that Exadata is an ENGINEERED system, not a hardware solution is important and its both a blessing and a curse for those databases supported by one.  The database engineer must understand both Exadata architecture and software along with database administration.  There is an added tier of performance knowledge, monitoring and patching that is involved, including knowledge of the Cell CLI, the command line interface for the cell nodes.  I could go on for hours on more details, but let’s get down to what is required when I am working on a project to migrate an Exadata to Azure.

Click through for the process.

Comments closed

Power BI: Visual has Exceeded the Available Resources

Chris Webb explains why you might see an error in Power BI:

This visual has exceeded the available resources. Try filtering to decrease the amount of data displayed.Please try again later or contact support. If you contact support, please provide these details.More details Resource Governing: The query exceeded the maximum memory allowed for queries executed in the current workload group (Requested 1048580KB, Limit 1048576KB).

The official Power BI documentation has similar advice to what’s shown in this dialog about what to do here, but what’s really going on?

The information in the “More details” section of the section dialog gives you a clue: in this case it’s resource governance. When you run a DAX query in Power BI it will always use a certain amount of memory; inefficient DAX calculations can cause a query to try to grab a lot of memory. In Power BI Desktop these queries may run successfully but be slow, but the Power BI Service can’t just let a query use as many resources as it wants (if it did, it may affect the performance of other queries being run by other users) so there is a resource governor that will kill queries that are too resource hungry. In the case of the visual above the query behind it tried to use more than 1GB of memory and was killed by the resource governor.

Read on to understand where these limits are and how you can modify them.

Comments closed

Indexes for Memory-Optimized Tables

Monica Rathbun takes us through the options available when creating indexes on memory-optimized tables:

Before we dive into this subject it is VERY important to note the biggest differences.

First, ALL memory optimized indexes MUST be created when the table is created or migrated. You cannot add indexes in an existing table without dropping and recreating the table.

Secondly, currently you can only have 8 indexes per table including your primary key. Remember that every table must have a primary key to enforce a secondary copy for a minimum of schema durability This  means you can only really add 7 additional indexes so be sure to understand your workloads and plan indexing accordingly.

There are a few other differences as well, which Monica covers before detailing the specific index options.

Comments closed

Solving the Gaps and Islands Set of Problems

Ed Pollack continues a series on gap and island analysis:

Gaps and islands analysis supplies a mechanism to group data organically in ways that a standard GROUP BY cannot provide. Once we know how to perform an analysis and group data into islands, we can extend this into the realm of real data.

For all code examples in this article, we will use a set of baseball data that I’ve created and maintained over the years. This data is ideal for analytics as it is large and contains data quality that varies between very accurate and very sloppy. As a result, we are forced to consider data quality in our work, as well as scrutinize boundary conditions for correctness. This data will be used without much introduction as we will only reference two tables, and each is relatively straightforward.

The code in this article gets a bit complex, but Ed shows off some powerful techniques.

Comments closed

Copying Measure Definitions in Power BI

Erik Svensen takes us through an oddity in Power BI’s user interface:

Here is an idea you can vote for if you would find it useful as well – https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/13219620-duplicate-measure-and-format-copy

So we end up copying the formula from text in the formula bar

And click new measure and Paste it into the formula bar

But 8 of 10 times nothing is pasted (at least when I select) – WHY ???

This is a strange user experience. But regardless, I find it odd that you can’t copy a measure definition. If this is odd to you as well, upvote the Power BI suggestion.

Comments closed