Press "Enter" to skip to content

Day: October 4, 2018

Stateful Services With Kubernetes

Kevin Sookocheff explains some scenarios in which stateful Kubernetes services can work well:

With leader election, you begin with a set of candidates that wish to become the leader and each of these candidates race to see who will be the first to be declared the leader. Once a candidate has been elected the leader, it continually sends a heart beat signal to keep renewing their position as the leader. If that heart beat fails, the other candidates again race to become the new leader. Implementing a leader election algorithm usually requires either deploying software such as ZooKeeper, or etcd and using it to determine consensus, or alternately, implementing a consensus algorithm on your own. Neither of these are ideal: ZooKeeper and etcd are complicated pieces of software that can be difficult to operate, and implementing a consensus algorithm on your own is a road fraught with peril. Thankfully, Kubernetes already runs an etcd cluster that consistently stores Kubernetes cluster state, and we can leverage that cluster to perform leader election simply by leveraging the Kubernetes API server.

Kubernetes already uses the Endpoints resource to represent a replicated set of pods that comprise a service and we can re-use that same object to retrieve all the pods that make up your distributed system. Given this list of pods, we leverage two other properties of the Kubernetes API: ResourceVersions and Annotations. Annotations are arbitrary key/value pairs that can be used by Kubernetes clients, and ResourceVersions mark the unique version of every Kubernetes resource in the cluster. Given these two primitives, we can perform leader election in a fairly straightforward manner: query the Endpoints resource to get the list of all pods running your service, and set Annotations on those resources. Each change to an Annotation also updates the ResourceVersion metadata. Because the Kubernetes API server is backed by etcd, a strongly consistent datastore, you can use Annotations and the ResourceVersion metadata to implement a simple compare-and-swap algorithm.

Google has used this approach to implement leader election as a Kubernetes Service, and you can run that service as a sidecar to your application to perform leader election backed by etc. For more on running a leader election algorithm in Kubernetes, refer to this blog post.

This is one of the parts that container services like Docker are striving to answer, but I don’t think they have it quite nailed down yet.

Comments closed

Multi-Class Classification With vtreat

John Mount has an example of using the vtreat package for multi-class classification in R:

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are now part of vtreat.

Click through for an example of this in action.

Comments closed

Azure Data Factory Or Integration Services?

Teo Lachev contrasts use cases for Integration Services vesus Azure Data Factory V2:

So, ADF was incorrectly positioned as “SSIS for the Cloud” and unfortunately once that message made it out there was a messaging problem that Microsoft has been fighting ever since. Like Azure ML, on the glory road to the cloud things that were difficult with SSIS (installation, projects, deployment) became simple, and things that were simple became difficult. Naturally, Microsoft took a lot of criticism from the customers and community, including from your humble correspondent. ADF, or course, has nothing to do with SSIS, thus leaving many data integration practitioners with a difficult choice: should you take the risk and take the road less traveled with ADF, or continue with the tried-and-true SSIS for data integration on Azure?

To Microsoft’s credits, ADF v2 has made significant enhancements in features, usability, and maintainability. There is an also a “lift and shift” option to run SSIS inside ADF but since this architecture requires a VM, I consider it a narrow case scenario, such as when you need to extend ADF with SSIS features that it doesn’t have. Otherwise, why would you start new development with SSIS hosted under ADF, if you could provision and license the VM yourself and have full control over it?

All in all, Teo is not the biggest fan of ADF at this point and leans heavily toward SSIS; read on for the reasoning.

Comments closed

Forcing A Plan Is A Temporary Solution

Erin Stellato explains when she forces plans—and that this is not a permanent solution to a performance problem:

Whether you force plans manually, or let SQL Server force them with the Automatic Plan Correction feature, I still view plan forcing as a temporary solution.  I don’t expect you to have plans forced for years, let alone months.  The life of a forced plan will, of course, depend on how quickly code and schema changes are ported to production.  If you go the “set it and forget it route”, theoretically a manually forced plan could get used for a very long time.  In that scenario, it’s your responsibility to periodically check to ensure that plan is still the “best” one for the query.  I would be checking every couple weeks; once a month at most.  Whether or not the plan remains optimal depends on the tables involved in the query, the data in the tables, how that data changes (if it changes), other schema changes that may be introduced, and more.

Further, you don’t want to ignore forced plans because there are cases where a forced plan won’t be used (you can use Extended Events to monitor this).  When you force a plan manually, forcing can still fail.  For example, if the forced plan uses an index and the index is dropped, or its definition is changed to the point where it cannot be used in plan in the same manner, then forcing will fail.  Important note: if forcing fails, the query will go through normal optimization and compilation and it will execute; SQL Server does not want your query to fail!  If you’re forcing plans and not familiar with the reasons that it can fail, note the last_force_failure_reason values listed for sys.query_store_plan.  If you have manually forced a plan for a query, and the force plan fails, it remains forced.  You have to manually un-force it to stop SQL Server from trying to use that plan.  As you can see, there are multiple factors related to plan forcing, which is why you don’t just force a plan and forget it.

There is much sound advice in this post.

Comments closed

Azure Data Factory V2 Dependencies

Meagan Longoria has important notes on how Azure Data Factory V2 Dependencies differ from SQL Server Integration Services precedent constraints:

This sounds similar to SSIS precedence constraints, but there are a couple of big differences.

  1. SSIS allows us to define expressions to be evaluated to determine if the next task should be executed.
  2. SSIS allows us to choose whether we handle multiple constraints as a logical AND or a logical OR. In other words, do we need all constraints to be true or just one.

ADF V2 activity dependencies are always a logical AND. While we can design control flows in ADF similar to how we might design control flows in SSIS, this is one of several differences. Let’s look at an example.

Meagan gives us three methods of replicating SSIS functionality using ADF V2, so check it out.

Comments closed