Press "Enter" to skip to content

Month: July 2016

Fashion Design And Genetic Algorithms

Daragh Sibley, et al, discuss using genetic algorithms to help design clothing:

Recombination. Having selected a set of high performing blouses we can now consider how they should be recombined to form a new child. While a traditional genetic algorithm would stochastically search all combinations over many market generations, we can shortcut that process by algorithmically looking for features that have been historically preferred by our target client segment.

To achieve this, we find statistical regularities between the population of blouses’ attributes (or configurations of attributes) and client feedback. For instance, we can model the relationship between attributes of our existing blouses and client feedback via:

Genetic algorithms (and Koza-style genetic programming) have long been a favorite topic of mine.  Integrating GA with fashion was not something that came to mind, but is a very interesting solution.

Comments closed

Altering Columns In Large Tables

Kenneth Fisher discusses a problem he had with altering a column on a large table:

My first attempt was just a straight ALTER TABLE ALTER COLUMN. After about an hour I got back a log full error. I then tried a 200 GB log and a 350 GB log. These failed at 3 and 5 hours. While this was going on I checked on #sqlhelp to see if anyone knew any way to minimize the log useage so my command would finish.

The primary suggestions were:

  • Add a new column to the end of the table, populate it in batches, then remove the old column.
  • Create a new table, populate it, index it, drop the old table, re-name the new table.

I will say that I have used suggestion #1 several times, particularly in zero down-time situations.  Once you’re done backfilling the column, you can drop the old one and rename the new one in a single transaction.  Read on for Kenneth’s solution.

Comments closed

Top-Down ETL With Powershell

Max Trinidad continues his series on top-down ETL using Powershell:

After all the previous functions has been loaded, just type the following one-liner:

Process-PSObjectToSQL -SQLDataObj $SQLData;

This sample script code can serve as a startup Template to load data into SQL Server.

This sample SQL data load will fail. Here’s when the Try/Catch/Finally will work for you in trapping what went wrong. Adding the necessary code to provide that additional information to troubleshoot and fix the problem.

Parts one and two available as well.

Comments closed

Parallelism Configuration Options

Kendra Little discusses max degree of parallelism and cost threshold for parallelism:

When you run a query, SQL Server estimates how “expensive” it is in a fake costing unit, let’s call it Estimated QueryBucks.

If a query’s Estimated QueryBucks is over the “Cost Threshold for Parallelism” setting in SQL Server, it qualifies to potentially use multiple processors to run the query.

The number of processors it can use is defined by the instance level “Max Degree of Parallelism” setting.

When writing TSQL, you can specify maxdop for individual statements as a query hint, to say that if that query qualifies to go parallel, it should use the number of processors specified in the hint and ignore the server level setting. (You could use this to make it use more processors, or to never go parallel.)

Read the whole thing, or watch/listen to the video.

Comments closed

Storm 1.0 Microbenchmarks

Roshan Naik and Sapin Amin have Storm 1.0 benchmarks on a small cluster:

Numbers suggest that Storm has come a long way in terms of performance but it still has room go faster. Here are some of the broad areas that should improve performance in future:

  • An effort to rewrite much of Storm’s Clojure code in Java is underway. Profiling has shown many hotspots in Clojure code.

  • Better scheduling of workers. Yahoo is experimenting with a Load Aware Scheduler for Storm to be smarter about the way in which topologies are scheduled on the cluster.

  • Based on microbenchmarking and discussions with other Storm developers there appears potential for streamlining the internal queueing for faster message transfer.

  • Operator coalescing (executing consecutive spouts/bolts in a single thread when possible) is another area that reduces intertask messaging and improve throughput.

Even with these potential improvements, Storm has come a long way—their benchmarks show around 5x throughput and a tiny fraction of the latency of Storm 0.9.1.

Comments closed

New JDBC Driver

Microsoft has released a new version of their SQL Server JDBC driver:

Table-Valued Parameters (TVPs)

TVP support allows a client application to send parameterized data to the server more efficiently by sending multiple rows to the server with a single call. You can use the JDBC Driver 6.0 to encapsulate rows of data in a client application and send the data to the server in a single parameterized command.

There are a couple of interesting features in this driver which could help your Hadoop cluster play nice with SQL Server.

Comments closed

Limitations Of Statistics

Erik Darling discusses statistics and limitations of statistics outside of indexes:

Head rush moment: SQL may use information from histograms outside of the index it chooses for cardinality estimation.

Back to earth: If you just create a bunch of statistics instead of indexes, you’re (at best) using your Clustered Index for everything (which is still bad), or you’re using a HEAP for everything (which is usually worse). You’re still generally better off creating good indexes for your workload. They’ll get statistics objects created and associated with them, and if SQL thinks another column is interesting, it will create a single column statistics object for it, as long as you haven’t turned off auto create stats.

Sure, you can put on your black cloak and goat mask and create some multi-column or filtered statistics, but in the words of a wise man (Doug), you end up with more stats to maintain and understanding query behavior gets more difficult.

Erik argues that at the margin, keeping existing statistics up to date tends to be a better solution than creating new statistics.

Comments closed

Biml Object Model

Bill Fellows begins a series on exploring the Biml Object Model:

The most important precursor to doing this is you will need a licensed installation of Mist. Full stop. The assemblies we’re going to use have security built into them to tell whether they are licensed and you cannot use the assemblies shipped with BidsHelper or BimlExpress as they’ve hardwired to the specific apps.

We’re going to use two classes: AstRootNode and AstPackageNode.

Ast, what is that? Abstract Syntax Tree – it’s a compiler theory thing.

I’m interested in seeing where this series goes.

Comments closed

Tabular Caching

Bill Anton discusses using the Formula Engine cache with Analysis Services tabular models:

If tabular data is already in memory, what’s the point of having a cache at all? Memory is memory, right? Both are in main memory and access speed is the same, right?

Good question! Yes, access speed is the same. However, there are other benefits to a cache store.

For example, even though the data is already in memory, queries against the tabular model can still be slow… very slow even… usually in cases where the execution of the query is bound to the single threaded formula engine. To be sure, this is not a tabular specific problem… formula engine bound queries can be (and are commonly) found in both technologies and the issue (usually) speaks more to the deign of the model and/or the way the query is written (be that DAX for tabular or MDX in multidimensional). That said, performance problems related to FE-bound queries can be terribly difficult to resolve as it usually requires a redesign of the data model and rewrite of the query or measure(s) involved.

Bill points out the limitations of this solution, but within those limitations this looks like it could be a huge time-saver for end users.

Comments closed