Tuesday, February 18, 2020

Down the rabbit hole: Template interface around a C library (libibverbs) in C++

Calling C code from within C++ happens quite a lot. With isolation of coding standards in mind, it makes sense to try to keep the C++ code separate from the C code. How?

Here I’m going to create an interface around libibverbs, to be able to frame the issue around a real-world problem and as you might guess from the title, this post has nothing else to do with libibverbs. I won’t be discussing its internals, neither am I going to assume any knowledge about it.

One problem however is running the examples. Since you probably do not have infiniband hardware on your laptop, you won’t be able to run the examples successfully, but you’ll be able to compile them by installing libibverbs using your distro’s package manager, and also see at least the error handling part in action, as the calls will fail since there would be no device on your everyday computer for the userspace library to talk to.

Other libraries with “resource-based” APIs can probably be be treated similarly. I just don’t want to invent an imaginary C API, casting doubt on the usefulness of the discussion.

Resource based APIs?

Many C libraries, specially the ones which provide a way to access a particular hardware or a resource of a lower abstraction provide a resource management API that works in pairs: there are function calls that create/allocate/whatever the resources, and there are ones that delete/free/etc. the resources. I think the popularity of this way of thinking can be traced back to the UNIX syscall API and how the terminologies have propagated up the stack as we’ve built different layers of abstractions.

libibverbs, the userspace library for talking to infiniband devices works this way too (for the most part). The first thing you’d probably do while working with this library is opening a device:

// $ man ibv_open_deivce
       struct ibv_context *ibv_open_device(struct ibv_device *device);
       // returns a pointer to the allocated device context, or NULL if the request fails.
       int ibv_close_device(struct ibv_context *context);
       // returns 0 on success, -1 on failure.

One thing you’ll probably do after opening a device is allocating a protection domain:

       struct ibv_pd *ibv_alloc_pd(struct ibv_context *context);
       // returns a pointer to the allocated PD, or NULL if the request fails.
       int ibv_dealloc_pd(struct ibv_pd *pd);
       // returns 0 on success, or the value of errno on failure.

We don’t need to discuss what these mean. The point is these and many other resources work this way.

An interesting but kind of natural note you’ll come across in the man pages says that if you happen to free these resources in the wrong order, say, ibv_close_device the struct ibv_context* before your get rid of an ibv_pd corresponding to that class with ibv_dealloc_pd, there might be errors. Reading on, you’ll see why this reinforces the argument that we’d better encapsulate this logic when writing code in C++.

(Psst.. If you know what RAII is click here to skip over the following material that you’re probably familiar with so you won’t kill me the next time we meet.)

The problem

There’s nothing wrong with the above API per se. In fact, there’s barely enough going on for anything to go wrong. But just given the above “look and feel”, you’re probably already concerned about the use of these in your C++ application because of the Ssme old dilemma: who’s responsible for cleaning up who by calling what. The sophisticated task of resource management in the face of possible errors has to be done manually, and with the imposed order in creating/deleting the resources, it’s likely as complex as all the other low level resource management scenarios, (e.g. file descriptors), for which we have wrappers in C++ library. (e.g. std::fstream)

All the complexity, machinery, rules, and conventions in C++ alongside a number of the books about the language are there to accomplish probably two and a half things, one of which is that in the following piece of code (from cppreference) …

#include <iostream>
#include <fstream>
#include <string>
int main() {
  std::string filename = "Test.b";
    std::ofstream ostrm(filename, std::ios::binary);
    double d = 3.14;
    ostrm.write(reinterpret_cast<char*>(&d), sizeof d); // binary output
    ostrm << 123 << "abc" << '\n';                      // text output
  // read back
  std::ifstream istrm(filename, std::ios::binary);
  double d;
  istrm.read(reinterpret_cast<char*>(&d), sizeof d);
  int n;
  std::string s;
  istrm >> n >> s;
  std::cout << " read back: " << d << " " << n << " " << s << '\n';

… you (unless willing to deal with consequences (exceptions)) can refrain from explicitly calling close on the std::ifstream and the std::ofstream objects after you’re done with them. Why this is desired and how managing resources correctly (mainly defining their construction, duplication, and destruction) makes a lot of things click and fall into place (e.g. the fact that there’s no finally block in C++) is a longer story, however by accepting that there’s value in managing resources like this, we now have the incentive to create a C++ friendly interface around libibverbs’s resources.

A simple take

From this point on I’ll be referencing code that lives at https://github.com/farnasirim/c-iface. This section’s code is under c-iface/a-simple-take.

(If you know how to write a class like std::unique_ptr from scratch, click here to skip over some stuff you already know)

The first thing that comes to my mind: let’s encapsulate resource management. As an example, let’s try to encapsulate the management of an ibv_pd. Here’s one way:

// c-iface/a-simple-take/ib.h
#ifndef IB_H_
#define IB_H_
#include <infiniband/verbs.h>
class IbPd {
  IbPd(struct ibv_context *);
  struct ibv_pd *get() const;
  IbPd(const IbPd&) = delete;
  IbPd& operator=(const IbPd&) = delete;
  struct ibv_pd *pd_;
#endif  // IB_H_
// c-iface/a-simple-take/ib.cc
#include "ib.h"
#include <exception>
IbPd::IbPd(struct ibv_context *ib) {
  struct ibv_pd *maybe_pd = ibv_alloc_pd(ib);
  if(maybe_pd == NULL) {
    // for example
    throw std::exception();
  pd_ = maybe_pd;
struct ibv_pd *IbPd::get() const {
  return pd_;
IbPd::~IbPd() {
  if(pd_ != NULL) {
    int status = ibv_dealloc_pd(pd_);
    if(status) {
      // Oops, cannot throw out of the dtor. Not much we can do anyway.
// c-iface/a-simple-take/main.cc
#include <infiniband/verbs.h>
#include "ib.h"
int main() {
  // In the real world this will be an actual ibv_context obtained by calling
  // `ibv_device_open`.
  struct ibv_context *device_context = NULL;
  IbPd pd(device_context);
  // Pass the underlying pointer where required in the future: pd.get()
  return 0;
A few important things to notice:
  • We had to explicitly delete IbPd(const IbPd&) = delete;. Otherwise we would have been allowed to write code like IbPd duplicated_pd = pd; in main(). This would be a disaster as we’d be ibv_dealloc_pding the same underlying ibv_pd two times when the two IbPds (pd and duplicated_pd) would eventually go out of scope. Similar story for the assignment operator, although we can manage the underlying resource in a well defined manner and provide a correct assignment operator. That’s not the main discussion here therefore I have decided to delete it for now.
  • Notice that we were unable to effectively handle the possible error in the destructor. Pretty much the only solution would be a providing a public method IbPd::close() similar to fstream::close() and allow the caller to explicitly run the destructor and catch any exceptions that it throws. This issue is important, but it’s parallel to what I’m discussing in this post. This will be present in the other code snippets too, but I’m going to ignore it here to concentrate on the main problem.
  • The fact that the “ibv_alloc_pd” is being called behind the scenes is, well, hidden from the API user.

A decent take

(No forward links this time: Even if you’re a C++ wizard, for the love of God read this section so you’ll know what the discussion is about)

We can greatly simplify IbPd by using std::unique_ptrs:

// c-iface/a-decent-take/ib.h
#ifndef IB_H_
#define IB_H_
#include <functional>
#include <memory>
#include <infiniband/verbs.h>
class IbPd {
  IbPd(struct ibv_context *);
  struct ibv_pd *get() const;
  std::unique_ptr<struct ibv_pd, std::function<void(struct ibv_pd*)>> pd_;
#endif  // IB_H_
// c-iface/a-decent-take/ib.cc
#include "ib.h"
#include <exception>
IbPd::IbPd(struct ibv_context *ib):
      [](struct ibv_pd *pd) {
        if(pd != NULL) { ibv_dealloc_pd(pd); /* ignore return */ } 
  struct ibv_pd *maybe_pd = ibv_alloc_pd(ib);
  if(maybe_pd == NULL) {
    // for example
    throw std::exception();
struct ibv_pd *IbPd::get() const {
  return pd_.get();
Once more:
  • This makes things much simpler. There’s no need to remember out of the blue that you need the copy constructor deleted, as the compiler cannot automatically generate it for us now since std::unique_ptr deletes it. The dangerous IbPd duplicate_pd = pd; will therefore not compile.
  • We have to remember to override the std::unique_ptr’s deleter to avoid the disastrous default, namely [](struct ibv_pd *pd) { delete pd; } to take place upon destruction.

This is far from good: the next thing you’ll do after creating a protection domain (pd) is creating a memory region (mr). Before you’re able to do that, you’ll have to write a whole bunch of boilerplate code to encapsulate the mr creation too:

       ibv_reg_mr, ibv_dereg_mr - register or deregister a memory region (MR)
       struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, size_t length, int access);
       int ibv_dereg_mr(struct ibv_mr *mr);

So here we go again.

class IbMr {
  IbMr(const IbPdHandle&, void*, size_t, int);
  std::unique_ptr<struct ibv_mr, void_deleter<ibv_mr>> ibv_mr_;
IbMr::IbMr(const IbPdHandle& pd, void* addr, size_t length, int flags):
  ibv_mr_(nullptr, ibv_dereg_mr ) /* return value will be ignored by std::unique_ptr */
  struct ibv_mr *mr = ibv_reg_mr(pd.ibv_pd_.get(), addr, length, flags);
  if(mr == NULL) {

That’s too much duplicate boilerplate code to achieve a seemingly simple, or at least easy to describe (in human language) operation. There has to be a better way.

Enter templates

Imagine an xkcd about C++ templates here. (I’m actually surprised that I couldn’t find one)

Well I wish I could explain this final part in actual snapshots, showing how I actually came up with it. Unfortunately I found myself unable to switch back and forth between documenting the process and actually going forward. Even if I had written everything down, my failed solutions would have made a huge pile of code, only useful for checking to see if your compiler will include specific personal life threats against you among the compile errors.

Therefore I’ll take a “horizontal” approach here, presenting the fragments of the solution component by component. Caveat: Some steps may seem magical. Bear with me.

What would make me happy

I’d like to refrain from writing multiple resource handler classes by hand. Here’s roughly what I’d like to do in ib.h to be able to use interfaces to ibv_alloc_pd and ibv_reg_mr in my C++ application:

using IbPd = some_magical_template<ibv_alloc_pd, maybe some other params>;
using IbMr = some_magical_template<ibv_reg_mr, maybe some other params>;

A possibly controversial design decision that I made was a one to one mapping from C++ classes to resource function pairs in the C library. That is, instead of having for example an IbPd class, I went for creating an IbAllocPd from the ib_alloc_pd resource allocator. And, I don’t want maybe some other params to be too expressive and “difficult” to write. Oh no. I want it be a constant number of parameters, only dependent on the names of the allocator and the deleter. We’re way past the point of sanity. Let’s get crazy.

I hear you shouting and you have every right to hate me now:
  • Q: What about resources that can be created using multiple functions? (e.g. ibv_reg_mr, ibv_reg_mr_iova)
    A: Either forget about them for now (don’t use them interchangeably, and/or duplicate your interfaces, which surprisingly does work in most scenarios I thought about), or… well, give up and create a new resource type (e.g. IbPd) and make the two “factory” classes inherit the resource class, and use the resource super class in interfaces. (This is still not a “solution” because a decision to add a new “factory” of a type that already has a factory requires a surgery on every single interface that has been using the previous one.

  • Q(Hyperbole? Concern?): Class names will end up sounding like function names.
    A: Well I would have liked them to sound like actual resource names too, but making them just a transliteration of the ibv_* functions makes writing the class definitions like using IbMr = some_magical_template<ibv_reg_mr, maybe some other params>; that go in ib.h almost trivial: no more decision making. If you have a pair of resource allocator/deleters, you can quickly get your resource manager class in compile time as a template instance. The only complex task that we have to go through to add a new function to the interface would be capitalizing the original name and removing the underscores from it, for which you can type the following into bash: sed -Ee 's/(^.)?(_.)?/\U\1\2\E/g' <(echo "ibv_function_name"). But really, naming them anything other than the function that is actually being called will mask important information at the call site. Why not make the constructor call self documenting?

I’ll explain how I did it but it’s dangerous to go alone: we’ll have to equip our sleeves with a couple of handy tricks.

Capturing variadic template args

It easy to realize that we’ll need a way to somehow capture any number of parameters of any type, as the resource allocator functions come in all sorts of shapes and sizes:

       struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, size_t length, int access);
       struct ibv_pd *ibv_alloc_pd(struct ibv_context *context);

(Once again, the code is here if you’d like to clone and play around https://github.com/farnasirim/c-iface.)

This is not trivial because we’re not able to directly “name” the variadic args. Let’s write the following FnpTypes traits class to capture the return type and the argument types of any function pointer using template specialization:

// 1
template<typename FactoryType>
struct FnpTypes;
// 2
template <typename Res, typename... Args>
struct FnpTypes<Res (*)(Args...)> {
  using ReturnType = Res;
  using FuncArgs = Args; // or = Args...; Won't compile either way

With the hope of using the likes of FnpTypes<ibv_open_deivce>::ReturnType or FnpTypes<ibv_reg_mr>::FuncArgs later on.

During template declaration (1), we say hey, FnpTypes is a struct template and takes a single type argument. That’s all. EXCEPT, we’ll only provide the definition for function pointers (2). If you thought you knew what template specialization was, but you’re still skeptical of what’s going on with the template parameters in 2, have a look here:

// 1
template<typename T>
class Vector;
// 2
class Vector<int> {
    // definition of a vector class

It’s the exact same thing and now we can conclude that the template params in 2 are just “parametrizing” (=helping to define) the type(s) that the specialization will apply to.

Back to the main issue: using FuncArgs = Args; won’t compile. Let’s go to the solution:

// c-iface/a-template-take/fnp_types.h
#ifndef FNP_TYPES_H_
#define  FNP_TYPES_H_
template<typename... Args>
struct Pack { };
template<typename FactoryType>
struct FnpTypes;
template <typename Res, typename... Args>
struct FnpTypes<Res (*)(Args...)> {
  using ReturnType = Res;
  using PackedArgs = Pack<Args...>;
#endif  // FNP_TYPES_H_

How would this help? Well, first of all its sole presence doesn’t cause a compile error. More importantly, the Pack can be pulled away by pattern matching in the type parameters. Here’s an example usage:

#include <functional>
// The second template parameter has a default type of ↓
template<typename FunctionPointer, typename = typename FnpTypes<FunctionPointer>::PackedArgs>
class Example;
using SpecType = int (*)(int);
template<typename... Args>
class Example<SpecType, Pack<Args...>> {
// Example specilization, where the first argument is "int (*)(int)", and the
// second one "pattern matches away" the `Pack<>`, and names the inner
// `Args...`, allowing us to use it directly:
  Example(SpecType fptr): fptr_(fptr) { }
  int forwarder(Args&&... args) {
    return close(std::forward<Args...>(args)...);
  SpecType fptr_;

(On a side note, if you liked how that little arrow was correctly aligned with the start of the type description, you’re gonna like the alt text of https://xkcd.com/688/)

An example usage of the above:

#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
int main() {
  Example<decltype(&close)> e(close);
  e.forwarder(open("/dev/stdin", O_RDONLY));  // close(open(...));
  return 0;

This concludes our trait for type utilities:

// c-iface/a-template-take/fnp_types.h
#ifndef FNP_TYPES_H_
#define  FNP_TYPES_H_
template<typename... Args>
struct Pack { };
template<typename FactoryType>
struct FnpTypes;
template <typename Res, typename... Args>
struct FnpTypes<Res (*)(Args...)> {
  using ReturnType = Res;
  using PackedArgs = Pack<Args...>;
#endif  // FNP_TYPES_H_

One more thing that is not hard to realize is that we need some way of intervention when the creations/deletions of the resources go wrong, at least for debugging purposes. For now lets assume we’re going to log a message and exit the program if anything wrong happens for all of the resource creation/deletion functions. Here’s one way to do this:

// c-iface/a-template-take/ib_utils.h
#ifndef IB_UTILS_H_
#define IB_UTILS_H_
#include <functional>
#include <memory>
#include <iostream>
template<typename T>
using VoidDeleter = std::function<void(T*)>;
template<typename T>
VoidDeleter<T> int_deleter_wrapper(int (*orig_deleter)(T*), std::string msg)
  return [orig_deleter, msg_capture = std::move(msg)](T *obj) -> void {
    if(orig_deleter(obj)) {
template<typename Res, typename ...Args>
using ResourceFactory= std::function<Res*(Args...)>;
template<typename Res, typename ...Args>
ResourceFactory<Res, Args...> factory_wrapper(Res *(*orig_factory)(Args...), std::string msg) {
  return [orig_factory, msg_capture = std::move(msg)](Args&&... args) -> Res* {
    Res *ret = orig_factory(std::forward<Args...>(args...));
    if(ret == NULL) {
    return ret;
#endif  // IB_UTILS_H_

All they do is forward the args to the underlying function pointer and print a message if anything goes wrong. The usage would look something like:

    auto new_deleter = int_deleter_wrapper(ibv_dealloc_pd, "ibv_dealloc_pd");

And then you’ll probably pass the new_deleter as the deleter to a std::unique_ptr somewhere.

But that requires writing that little string literal. Why do I not like it? The compiler will never complain about its correctness. (what if you copy/paste the above line for different types and forget to change the message string?) So how do we do this? (If you’re thinking of using macros, click the button to claim your reward: )

// c-iface/a-template-take/fnp_traits.h
#ifndef FNP_TRAITS_H_
#define FNP_TRAITS_H_
#include <typeinfo>
#include <algorithm>
#include <string>
// Want to accept any function pointer as value:
// For every type T, accept the type and a value of that type: template<typename T, T f>
// As can see this is not "tight": f can have any type really, but that's good
// enough for a function that will only be used for debugging purposes. And if
// you're concerned, you can make it as "tight" as you need with very little
// work.
template<typename T, T f>
struct FnpTraits {
  static std::string name() {
    std::string pretty_name = __PRETTY_FUNCTION__;
    // type of f will be written in front of it like so:
    // f = ibv_alloc_pd
    std::string start_marker = "f = ", maybe_end_chars= ";,]";
    // Diferent delimiters are there because of clang/gcc, and the possibility
    // that the f might be the last thing written in __PRETTY_FUNCTION__
    // as it will look something like [ T = text text, f = text text ].
    auto start_index = pretty_name.find(start_marker) + start_marker.size();
    auto end_index = pretty_name.find_first_of(maybe_end_chars, start_index);
    return pretty_name.substr(start_index, end_index - start_index);
#endif  // FNP_TRAITS_H_

This is not exactly the same thing, but it will extract the type of f from the pretty name and works both in gcc and clang (and probably nothing else). Of course you’ll not use this for anything serious, right? RIGHT?

But seriously these last two class templates work well for debugging purposes and make things convenient, specially since they allow you to specialize per class for a correct/different behavior.

Believe it or not, at this point we have all we need for the IbResource class template that will provide a C++ style resource management scheme for our C types from libibverbs:

// c-iface/a-template-take/ib.h
#ifndef IB_H_
#define IB_H_
#include <memory>
#include <functional>
#include <type_traits>
#include <infiniband/verbs.h>
#include "fnp_traits.h"
#include "fnp_types.h"
#include "ib_utils.h"
// Ask for the type of the Factory (allocator func), and the factory itself.
// Same goes for the deleter.
// Finally using the packing technique, ask ask for the Args that are
// extracted from the function pointer in the `FnpTypes` trait.
template<typename Factory, Factory f, typename Deleter, Deleter d,
  typename = typename FnpTypes<decltype(f)>::PackedArgs>
class IbResource;
// The only reason that we need to specilize (as opposed to defining the template
// right away is pattern matching the Args... out of the decltype(f)>::PackedArgs.
template<typename Factory, Factory f, typename Deleter, Deleter d, typename... Args>
class IbResource <Factory, f, Deleter, d, Pack<Args...>> {
  // Removing the pointer, and then working with the type makes things a bit
  // more readable in the uniquee_ptr and the get function. Again, we'll have to
  // do a bit more work to do the correct thing in the rare cases when the
  // functions that return a struct as opposed to a pointer to a struct.
  using ResourceType = std::remove_pointer_t<typename FnpTypes<decltype(f)>::ReturnType>;
  // This is hiding the flexibility of std::unique_ptr from the user of
  // IbResource. Might make more sense to directly ask for the final deleter
  // type when you're creatin a concrete IbResource from the IbResource template.
  std::unique_ptr<ResourceType, VoidDeleter<ResourceType>> ptr_;
  IbResource(Args... args): ptr_(
      factory_wrapper(f, FnpTraits<Factory, f>::name())(std::forward<Args...>(args)...),
      int_deleter_wrapper(d, FnpTraits<Deleter, d>::name())) { }
  ResourceType *get() const {
    return ptr_.get();
// A function like ibv_device_context_by_name is not present in the ibv library.
// The way it works is you obtain a list of devices, iterate over them, and open
// whichever you chose.
// What do we do? We create our own function, do the resource allocation
// manually, and wrap it in an IBResource similar to original ibv_* allocators.
// See ib.cc for the implementation.
struct ibv_context *ibv_device_context_by_name_(const char *name);
using IbvDeviceContextByName = IbResource<
                      decltype(&ibv_device_context_by_name_), &ibv_device_context_by_name_,
                      decltype(&ibv_close_device), &ibv_close_device
using IbvAllocPd = IbResource<
                      decltype(&ibv_alloc_pd), &ibv_alloc_pd,
                      decltype(&ibv_dealloc_pd), &ibv_dealloc_pd
using IbvRegMr = IbResource<
                      decltype(&ibv_reg_mr), &ibv_reg_mr,
                      decltype(&ibv_dereg_mr), &ibv_dereg_mr
#endif  // IB_H_

And of course the usage will look like this:

// c-iface/a-template-take/main.cc
#include <infiniband/verbs.h>
#include "ib.h"
int main() {
  IbvDeviceContextByName device_context("mlnx_0");  
  IbvAllocPd pd(device_context.get());  
  return 0;

Is the flaw with naming, and the fact that for example IbvAllocPd is the name of an object which is essentially a struct ibv_pd, and the possibility of a future shotgun surgery because of multiple allocators worth the easy creation of new interfaces? I think it depends on the usage and how deep these classes are expected to go in your class hierarchy. Intuitively, the fact that you’re using a very low level resource hints that regardless of the interface you choose, its a good idea to keep them encapsulated and far away from the logical core of the application. This does make sense in my case, particularly with libibverbs. YMMV.

Here’s your well deserved cat video. Looking for more? Bottom of the page, this blog post.

Tuesday, July 30, 2019

Moving existing multi-node data-intensive applications to Kubernetes on AWS

Moving existing multi-node data-intensive applications to Kubernetes on AWS

The title screams many things. One of those things is Cassandra:


a daughter of the Trojan king Priam, who was given the gift of prophecy by Apollo.
When she cheated him, however, he turned this into a curse by causing her prophecies,
though true, to be disbelieved.

Naming that database “Cassandra” therefore, does not sound very smart in retrospect, but does sound surprisingly honest to me when I look back at my past experience with it. The latest curse which “The daughter of the Trojan king” has cast upon me is demanding to be moved from a classical deployment on EC2 Instances on AWS to deployment on kubernetes on AWS (eks). In my case “existing multi-node data-intensive application” translates to cassandra and this is an overview of my solution and how I got to it.

Just a cassandra

You can get yourself a vanilla cassandra by using the universal install script:

which suggests:

docker run --rm -it --name cassandra cassandra:3.0.10

Cassandra on kubernetes

So kubernetes is all fun and games until you need to actually persist your data. The lack of “fun and games” translates to writing a few more lines of configuration if you’ve only used kubernetes hosted in third party clouds. Otherwise, it means you start to reconsider your whole kubernetes migration plan. Thankfully, in my current scenario, I have the chance to be naive and abstract persistence headaches away to AWS.

With that in place, you can get yourself a Cassandra cluster in the cloud fairly easily following the guide in Kuberentes documents. You’ll use StatefulSets to name the pods ordinally:

You’ll use PersistentVolumeClaims to ask for persistent storage:

And mount it on the correct path:

That’s about it. Everything else is mostly application configuration.

An important point here is volume-pod affinity. That means after pod cassandra-2 gets deleted and recreated, it will be passed the same volume, both in the sense that its modifications on the volume will be persisted between restarts, and it won’t be handed a different pod’s volume. That means the application running inside the pod can correctly count on the hostname to stay consistent. There are a few pitfalls though. The most important being the fact that the IP addresses will not necessarily be preserved and an application instance is therefore not allowed to “think” in a lower level than the Cluster DNS when referring to its peers (cassandra-0 talks to cassandra-1.cassandra.default.svc.cluster.local, as opposed to talking to, as this ip address won’t be preserved through delete & restart of cassandra-1).

With that, you have pods running a cassandra cluster, which will tolerate failures on the node (pod) level and can be scaled up easily.

These give us an idea of what would the final state of things should look like. Now let’s try to reach this state, starting from the “legacy” setup: Three Bare EC2 Instances running cassandra with huge loads of data accumulated in each one’s block storage. The terminal state would hopefully be a functional cluster inside kubernetes, with all the data of course. I’m going to refer to the former as “the legacy cluster” and to the latter as “the new cluster”.

Moving to kubernetes: First attempt

The legacy database was under production load, so we didn’t really want to think of adding the new cluster nodes to the previous cluster, replicating the data to the new cluster, and then removing the legacy nodes. We wanted it to be clean and safe.

The first “safe” idea therefore was to use cassandra’s snapshot tool. On the legacy nodes:

Piece of cake, right?


This was transferring data to the new cluster with the rate of about One Million Bytes every second. Fortunately we had amazing mathematicians in our team, who correctly pointed out that this would take months – a bit too much.

Moving to kubernetes: Second attempt

Well of course, network latency. Everyone knows about network latency. The next idea therefore was copying the data over to the new nodes and perform the loading of cassandra sstables from there. Unfortunately, we hit the 1MB/S again as we tried moving the snapshots to the new server over https. Another attempt was to mount a new empty ebs volume on a legacy instance and to copy the snapshot to the empty volume (to be brought over to the new cluster later on where we would execute sstableloader -d localhost ... to try to escape sstable over network). This attempt also failed miserably as the data transfer between ebs volumes was taking place as slow as 1MB/s.

Moving to kubernetes: Third attempt

We took a snapshot of the whole ebs volume and created a new volume from that. Fortunately it finished (within hours). We attached the new volume to the new instances and ran sstableloader. Nope. Same problem.

Moving to kubernetes: Fourth attempt and beyond

It seemed like the only plausible way to this would be using the exact snapshots in kubernetes without copying/loading/moving anything: To move a data-intensive application to kubernetes (in the aws context at least), it probably makes sense to create the cloud environment in a way that the application instance, given the correct data from its legacy state, would just work correctly without any further supervision. At that point, the problem to solve would just be moving the data to the cloud correctly (assuming you have previously solved the networking problem to make your distributed application cloud native), regardless of the actual application that is being moved. It is important to notice that we’re actually erasing a problem in a higher level (application) by attacking from a lower layer in the stack (storage).

To do this, we need to “feed” the replicated legacy volumes to the new cluster through kubernetes PersistentVolumes. Can we trick the StatefulSet controller into using our supplied volume instead of its own? Let’s look at the code:

This is how the name of a PVC of an ordinal pod in the StatefulSet is inferred. This function is used later when the controller tries to find the correct volumes to attach to a pod:

Therefore all the StatefulSet controller knows about the PVCs is names! That’s great! We can create our own PVCs from the PVs that we have created from the new volumes, and by naming the properly, we’re basically introducing our manual volume provisioning.

The first step is creating the PersistentVolumes. Here’s a gist:

Where we’re providing the volume id of the newly created volume that needs to be mounted on one of the new pods.

After that we need to create a PVC from the above PV. Here’s the important parts:

The name, as we concluded, is of particular importance here.

At this point, you’re surprisingly almost done! You may also want to add particular labels to your custom volumes and corresponding selectors to the PVC template that goes inside the StatefulSet if you want absolute full control over how the storage is handled in a particular StatefulSet.

Sunday, March 3, 2019

Python, min-cost max-flow, and a loose end


There was this particular weighted matching problem that I needed to solve some time ago. The problem specification is not relevant here but its got only a few more constraints than a normal maximum weighted matching and instead of actually thinking, I decided to reduce the whole thing to min-cost flow, similar to what one would do for a normal max matching.
For my actual usage, everything was fun and games. My C++ implementation would give me the answer almost instantly on a synthesized small input and in a few minutes on my actual data.
Now for some reason, I needed to implement the same thing in Python too. You see where this is going, right?
This is basically my status report while trying to understand why the exact same code Runs 1000s (not kidding) of times slower in Python.

Some background

I used the augmenting path algorithm for finding a min-cost flow in the graph. On the very high level, it looks like this:
While we haven't reached the desired flow:
    find the cheapest path from source to sink
    send flow along the path
Your algorithm for the shortest path subtask has to support negative edges (or you can use potential functions to handle those separately). I used D´Esopo-Pape algorithm there.
On my small input, the C++ version takes about 0.1s and the Python version takes about 1000s.
Here’s a part of the Python version:
And the shortest path subroutine:
And I assure you that the C++ version is essentially the same code, only written in C++.

Show me the slow

Well, I would guess that the shortest_paths_from is the suspect here and has the most share in the total running time. Let’s verify that. I wrote this decorator to measure the time a piece of code takes to run:
Let’s skip the cryptic syntax and see the usage:
And you can see the results with:
You can pull off lots of ideas in the decorator, including saving different function params/args (excuse my illiteracy for not knowing which one to use when we’re in the “middle” of passing the values to the function) along with their corresponding running times. (Fun fact: This actually makes sense here. The bipartite graph is super “simple” in the beginning. This causes the shortest path problem to become “harder”, that is, to involve more edges in the residual graph as we send more flow, and this impacts D´Esopo-Pape greatly)
This is pretty handy as you can measure time on a finer grain (than functions) too. For example to measure the time it takes to manipulate the paths and apply the flow inside the while loop in min_cost_flow, we can indent the relevant lines into a local closure, capture the local variables, and measure the time it takes for the closure run:
Pop quiz: Do you see why you should be careful when using this (or any method of measuring time) when functions call each other recursively?
Anyway, Here’s the result in our case (for a VERY small input):
shortest_paths_from took 1.428 seconds
min_cost_flow took 1.431 seconds
I’ve heard deques are slow:
deque operations took 0.012 seconds
shortest_paths_from took 1.800 seconds
path_stuff took 0.000 seconds
min_cost_flow took 1.804 seconds
Well, not really. In hindsight though, this is obvious. Each operation on the queue may be expensive, but their count is way less than the number of times we just check weather or not we can update the shortest path using an edge:
This is proved to be the case when I looked at the number of times this is executed on average and simulated the process with deterministic number of iterations and a few memory accesses:
The running time for this little piece of code is almost equal to the whole min-cost max-flow algorithm program.
To understand what’s happening here, I decided to look at the byte code of the original function:
Here’s the important part:

 37          94 SETUP_LOOP             162 (to 258)
             96 LOAD_FAST                0 (self)
             98 LOAD_ATTR                7 (adj)
            100 LOAD_FAST                6 (u)
            102 BINARY_SUBSCR
            104 GET_ITER
        >>  106 FOR_ITER               148 (to 256)
            108 STORE_FAST               7 (v)

 38         110 LOAD_FAST                0 (self)
            112 LOAD_ATTR                8 (capacity)
            114 LOAD_FAST                6 (u)
            116 BINARY_SUBSCR
            118 LOAD_FAST                7 (v)
            120 BINARY_SUBSCR
            122 LOAD_CONST               2 (0)
            124 COMPARE_OP               4 (>)
            126 POP_JUMP_IF_FALSE      106

 39         128 LOAD_FAST                2 (d)
            130 LOAD_FAST                7 (v)
            132 BINARY_SUBSCR
            134 LOAD_FAST                2 (d)
            136 LOAD_FAST                6 (u)
            138 BINARY_SUBSCR
            140 LOAD_FAST                0 (self)
            142 LOAD_ATTR                9 (cost)
            144 LOAD_FAST                6 (u)
            146 BINARY_SUBSCR
            148 LOAD_FAST                7 (v)
            150 BINARY_SUBSCR
            152 BINARY_ADD
            154 COMPARE_OP               4 (>)
            156 POP_JUMP_IF_FALSE      106

 40         158 LOAD_FAST                2 (d)
            160 LOAD_FAST                6 (u)
            162 BINARY_SUBSCR
            164 LOAD_FAST                0 (self)
            166 LOAD_ATTR                9 (cost)
            168 LOAD_FAST                6 (u)
            170 BINARY_SUBSCR
            172 LOAD_FAST                7 (v)
            174 BINARY_SUBSCR
            176 BINARY_ADD
            178 LOAD_FAST                2 (d)
            180 LOAD_FAST                7 (v)
            182 STORE_SUBSCR

 41         184 LOAD_FAST                6 (u)
            186 LOAD_FAST                3 (p)
            188 LOAD_FAST                7 (v)
            190 STORE_SUBSCR

 42         192 LOAD_FAST                4 (m)
            194 LOAD_FAST                7 (v)
            196 BINARY_SUBSCR
            198 LOAD_CONST               3 (2)
            200 COMPARE_OP               2 (==)
            202 POP_JUMP_IF_FALSE      224

 43         204 LOAD_CONST               4 (1)
            206 LOAD_FAST                4 (m)
            208 LOAD_FAST                7 (v)
            210 STORE_SUBSCR

 44         212 LOAD_FAST                5 (q)
            214 LOAD_METHOD              4 (append)
            216 LOAD_FAST                7 (v)
            218 CALL_METHOD              1
            220 POP_TOP
            222 JUMP_ABSOLUTE          106

 45     >>  224 LOAD_FAST                4 (m)
            226 LOAD_FAST                7 (v)
            228 BINARY_SUBSCR
            230 LOAD_CONST               2 (0)
            232 COMPARE_OP               2 (==)
            234 POP_JUMP_IF_FALSE      106

 46         236 LOAD_CONST               4 (1)
            238 LOAD_FAST                4 (m)
            240 LOAD_FAST                7 (v)
            242 STORE_SUBSCR

 47         244 LOAD_FAST                5 (q)
            246 LOAD_METHOD             10 (appendleft)
            248 LOAD_FAST                7 (v)
            250 CALL_METHOD              1
            252 POP_TOP
            254 JUMP_ABSOLUTE          106
        >>  256 POP_BLOCK
        >>  258 JUMP_ABSOLUTE           64
        >>  260 POP_BLOCK

All I really saw at the first glance was the number of LOAD operations. I’m sure you now agree with a completely unbiased opinion.
Now can we reduce the number of LOADs? Turns out we can. If we start thinking like the compiler and keep an open mind about losing code beauty while accounting for the shortcomings of the actual compiler, we can see that there are a few extra LOADs as the variable u is invariant throughout the inner loop. Therefore we can change this
To this
extracting out the duplicate references.
Let’s see the results:
real    0m0.790s
user    0m0.776s
sys     0m0.010s
Unreal! The compiler hasn’t been doing any of these things. That’s about a 2x improvement in performance in the small synthesized test case from the original code (about 1.5 seconds).
On the bigger test case, this improves the running time from ~15 minutes to ~7 minutes.

Bottom line

Unfortunately I did lots more without anything to show for it. Almost everything else that I did (from using c native types to fooling around with how I stored things) resulted in degrading the performance.
I ended up achieving the performance that I was looking for by using another algorithm, but the question of why this (still) runs thousands of times slower than the cpp version remains open for me.