Tuesday, July 30, 2019

Moving existing multi-node data-intensive applications to Kubernetes on AWS

Moving existing multi-node data-intensive applications to Kubernetes on AWS

The title screams many things. One of those things is Cassandra:


a daughter of the Trojan king Priam, who was given the gift of prophecy by Apollo.
When she cheated him, however, he turned this into a curse by causing her prophecies,
though true, to be disbelieved.

Naming that database “Cassandra” therefore, does not sound very smart in retrospect, but does sound surprisingly honest to me when I look back at my past experience with it. The latest curse which “The daughter of the Trojan king” has cast upon me is demanding to be moved from a classical deployment on EC2 Instances on AWS to deployment on kubernetes on AWS (eks). In my case “existing multi-node data-intensive application” translates to cassandra and this is an overview of my solution and how I got to it.

Just a cassandra

You can get yourself a vanilla cassandra by using the universal install script:

which suggests:

docker run --rm -it --name cassandra cassandra:3.0.10

Cassandra on kubernetes

So kubernetes is all fun and games until you need to actually persist your data. The lack of “fun and games” translates to writing a few more lines of configuration if you’ve only used kubernetes hosted in third party clouds. Otherwise, it means you start to reconsider your whole kubernetes migration plan. Thankfully, in my current scenario, I have the chance to be naive and abstract persistence headaches away to AWS.

With that in place, you can get yourself a Cassandra cluster in the cloud fairly easily following the guide in Kuberentes documents. You’ll use StatefulSets to name the pods ordinally:

You’ll use PersistentVolumeClaims to ask for persistent storage:

And mount it on the correct path:

That’s about it. Everything else is mostly application configuration.

An important point here is volume-pod affinity. That means after pod cassandra-2 gets deleted and recreated, it will be passed the same volume, both in the sense that its modifications on the volume will be persisted between restarts, and it won’t be handed a different pod’s volume. That means the application running inside the pod can correctly count on the hostname to stay consistent. There are a few pitfalls though. The most important being the fact that the IP addresses will not necessarily be preserved and an application instance is therefore not allowed to “think” in a lower level than the Cluster DNS when referring to its peers (cassandra-0 talks to cassandra-1.cassandra.default.svc.cluster.local, as opposed to talking to, as this ip address won’t be preserved through delete & restart of cassandra-1).

With that, you have pods running a cassandra cluster, which will tolerate failures on the node (pod) level and can be scaled up easily.

These give us an idea of what would the final state of things should look like. Now let’s try to reach this state, starting from the “legacy” setup: Three Bare EC2 Instances running cassandra with huge loads of data accumulated in each one’s block storage. The terminal state would hopefully be a functional cluster inside kubernetes, with all the data of course. I’m going to refer to the former as “the legacy cluster” and to the latter as “the new cluster”.

Moving to kubernetes: First attempt

The legacy database was under production load, so we didn’t really want to think of adding the new cluster nodes to the previous cluster, replicating the data to the new cluster, and then removing the legacy nodes. We wanted it to be clean and safe.

The first “safe” idea therefore was to use cassandra’s snapshot tool. On the legacy nodes:

Piece of cake, right?


This was transferring data to the new cluster with the rate of about One Million Bytes every second. Fortunately we had amazing mathematicians in our team, who correctly pointed out that this would take months – a bit too much.

Moving to kubernetes: Second attempt

Well of course, network latency. Everyone knows about network latency. The next idea therefore was copying the data over to the new nodes and perform the loading of cassandra sstables from there. Unfortunately, we hit the 1MB/S again as we tried moving the snapshots to the new server over https. Another attempt was to mount a new empty ebs volume on a legacy instance and to copy the snapshot to the empty volume (to be brought over to the new cluster later on where we would execute sstableloader -d localhost ... to try to escape sstable over network). This attempt also failed miserably as the data transfer between ebs volumes was taking place as slow as 1MB/s.

Moving to kubernetes: Third attempt

We took a snapshot of the whole ebs volume and created a new volume from that. Fortunately it finished (within hours). We attached the new volume to the new instances and ran sstableloader. Nope. Same problem.

Moving to kubernetes: Fourth attempt and beyond

It seemed like the only plausible way to this would be using the exact snapshots in kubernetes without copying/loading/moving anything: To move a data-intensive application to kubernetes (in the aws context at least), it probably makes sense to create the cloud environment in a way that the application instance, given the correct data from its legacy state, would just work correctly without any further supervision. At that point, the problem to solve would just be moving the data to the cloud correctly (assuming you have previously solved the networking problem to make your distributed application cloud native), regardless of the actual application that is being moved. It is important to notice that we’re actually erasing a problem in a higher level (application) by attacking from a lower layer in the stack (storage).

To do this, we need to “feed” the replicated legacy volumes to the new cluster through kubernetes PersistentVolumes. Can we trick the StatefulSet controller into using our supplied volume instead of its own? Let’s look at the code:

This is how the name of a PVC of an ordinal pod in the StatefulSet is inferred. This function is used later when the controller tries to find the correct volumes to attach to a pod:

Therefore all the StatefulSet controller knows about the PVCs is names! That’s great! We can create our own PVCs from the PVs that we have created from the new volumes, and by naming the properly, we’re basically introducing our manual volume provisioning.

The first step is creating the PersistentVolumes. Here’s a gist:

Where we’re providing the volume id of the newly created volume that needs to be mounted on one of the new pods.

After that we need to create a PVC from the above PV. Here’s the important parts:

The name, as we concluded, is of particular importance here.

At this point, you’re surprisingly almost done! You may also want to add particular labels to your custom volumes and corresponding selectors to the PVC template that goes inside the StatefulSet if you want absolute full control over how the storage is handled in a particular StatefulSet.

Sunday, March 3, 2019

Python, min-cost max-flow, and a loose end


There was this particular weighted matching problem that I needed to solve some time ago. The problem specification is not relevant here but its got only a few more constraints than a normal maximum weighted matching and instead of actually thinking, I decided to reduce the whole thing to min-cost flow, similar to what one would do for a normal max matching.
For my actual usage, everything was fun and games. My C++ implementation would give me the answer almost instantly on a synthesized small input and in a few minutes on my actual data.
Now for some reason, I needed to implement the same thing in Python too. You see where this is going, right?
This is basically my status report while trying to understand why the exact same code Runs 1000s (not kidding) of times slower in Python.

Some background

I used the augmenting path algorithm for finding a min-cost flow in the graph. On the very high level, it looks like this:
While we haven't reached the desired flow:
    find the cheapest path from source to sink
    send flow along the path
Your algorithm for the shortest path subtask has to support negative edges (or you can use potential functions to handle those separately). I used D´Esopo-Pape algorithm there.
On my small input, the C++ version takes about 0.1s and the Python version takes about 1000s.
Here’s a part of the Python version:
And the shortest path subroutine:
And I assure you that the C++ version is essentially the same code, only written in C++.

Show me the slow

Well, I would guess that the shortest_paths_from is the suspect here and has the most share in the total running time. Let’s verify that. I wrote this decorator to measure the time a piece of code takes to run:
Let’s skip the cryptic syntax and see the usage:
And you can see the results with:
You can pull off lots of ideas in the decorator, including saving different function params/args (excuse my illiteracy for not knowing which one to use when we’re in the “middle” of passing the values to the function) along with their corresponding running times. (Fun fact: This actually makes sense here. The bipartite graph is super “simple” in the beginning. This causes the shortest path problem to become “harder”, that is, to involve more edges in the residual graph as we send more flow, and this impacts D´Esopo-Pape greatly)
This is pretty handy as you can measure time on a finer grain (than functions) too. For example to measure the time it takes to manipulate the paths and apply the flow inside the while loop in min_cost_flow, we can indent the relevant lines into a local closure, capture the local variables, and measure the time it takes for the closure run:
Pop quiz: Do you see why you should be careful when using this (or any method of measuring time) when functions call each other recursively?
Anyway, Here’s the result in our case (for a VERY small input):
shortest_paths_from took 1.428 seconds
min_cost_flow took 1.431 seconds
I’ve heard deques are slow:
deque operations took 0.012 seconds
shortest_paths_from took 1.800 seconds
path_stuff took 0.000 seconds
min_cost_flow took 1.804 seconds
Well, not really. In hindsight though, this is obvious. Each operation on the queue may be expensive, but their count is way less than the number of times we just check weather or not we can update the shortest path using an edge:
This is proved to be the case when I looked at the number of times this is executed on average and simulated the process with deterministic number of iterations and a few memory accesses:
The running time for this little piece of code is almost equal to the whole min-cost max-flow algorithm program.
To understand what’s happening here, I decided to look at the byte code of the original function:
Here’s the important part:

 37          94 SETUP_LOOP             162 (to 258)
             96 LOAD_FAST                0 (self)
             98 LOAD_ATTR                7 (adj)
            100 LOAD_FAST                6 (u)
            102 BINARY_SUBSCR
            104 GET_ITER
        >>  106 FOR_ITER               148 (to 256)
            108 STORE_FAST               7 (v)

 38         110 LOAD_FAST                0 (self)
            112 LOAD_ATTR                8 (capacity)
            114 LOAD_FAST                6 (u)
            116 BINARY_SUBSCR
            118 LOAD_FAST                7 (v)
            120 BINARY_SUBSCR
            122 LOAD_CONST               2 (0)
            124 COMPARE_OP               4 (>)
            126 POP_JUMP_IF_FALSE      106

 39         128 LOAD_FAST                2 (d)
            130 LOAD_FAST                7 (v)
            132 BINARY_SUBSCR
            134 LOAD_FAST                2 (d)
            136 LOAD_FAST                6 (u)
            138 BINARY_SUBSCR
            140 LOAD_FAST                0 (self)
            142 LOAD_ATTR                9 (cost)
            144 LOAD_FAST                6 (u)
            146 BINARY_SUBSCR
            148 LOAD_FAST                7 (v)
            150 BINARY_SUBSCR
            152 BINARY_ADD
            154 COMPARE_OP               4 (>)
            156 POP_JUMP_IF_FALSE      106

 40         158 LOAD_FAST                2 (d)
            160 LOAD_FAST                6 (u)
            162 BINARY_SUBSCR
            164 LOAD_FAST                0 (self)
            166 LOAD_ATTR                9 (cost)
            168 LOAD_FAST                6 (u)
            170 BINARY_SUBSCR
            172 LOAD_FAST                7 (v)
            174 BINARY_SUBSCR
            176 BINARY_ADD
            178 LOAD_FAST                2 (d)
            180 LOAD_FAST                7 (v)
            182 STORE_SUBSCR

 41         184 LOAD_FAST                6 (u)
            186 LOAD_FAST                3 (p)
            188 LOAD_FAST                7 (v)
            190 STORE_SUBSCR

 42         192 LOAD_FAST                4 (m)
            194 LOAD_FAST                7 (v)
            196 BINARY_SUBSCR
            198 LOAD_CONST               3 (2)
            200 COMPARE_OP               2 (==)
            202 POP_JUMP_IF_FALSE      224

 43         204 LOAD_CONST               4 (1)
            206 LOAD_FAST                4 (m)
            208 LOAD_FAST                7 (v)
            210 STORE_SUBSCR

 44         212 LOAD_FAST                5 (q)
            214 LOAD_METHOD              4 (append)
            216 LOAD_FAST                7 (v)
            218 CALL_METHOD              1
            220 POP_TOP
            222 JUMP_ABSOLUTE          106

 45     >>  224 LOAD_FAST                4 (m)
            226 LOAD_FAST                7 (v)
            228 BINARY_SUBSCR
            230 LOAD_CONST               2 (0)
            232 COMPARE_OP               2 (==)
            234 POP_JUMP_IF_FALSE      106

 46         236 LOAD_CONST               4 (1)
            238 LOAD_FAST                4 (m)
            240 LOAD_FAST                7 (v)
            242 STORE_SUBSCR

 47         244 LOAD_FAST                5 (q)
            246 LOAD_METHOD             10 (appendleft)
            248 LOAD_FAST                7 (v)
            250 CALL_METHOD              1
            252 POP_TOP
            254 JUMP_ABSOLUTE          106
        >>  256 POP_BLOCK
        >>  258 JUMP_ABSOLUTE           64
        >>  260 POP_BLOCK

All I really saw at the first glance was the number of LOAD operations. I’m sure you now agree with a completely unbiased opinion.
Now can we reduce the number of LOADs? Turns out we can. If we start thinking like the compiler and keep an open mind about losing code beauty while accounting for the shortcomings of the actual compiler, we can see that there are a few extra LOADs as the variable u is invariant throughout the inner loop. Therefore we can change this
To this
extracting out the duplicate references.
Let’s see the results:
real    0m0.790s
user    0m0.776s
sys     0m0.010s
Unreal! The compiler hasn’t been doing any of these things. That’s about a 2x improvement in performance in the small synthesized test case from the original code (about 1.5 seconds).
On the bigger test case, this improves the running time from ~15 minutes to ~7 minutes.

Bottom line

Unfortunately I did lots more without anything to show for it. Almost everything else that I did (from using c native types to fooling around with how I stored things) resulted in degrading the performance.
I ended up achieving the performance that I was looking for by using another algorithm, but the question of why this (still) runs thousands of times slower than the cpp version remains open for me.

Friday, January 25, 2019

Simple, stupid, effective tests in go


This is not about fancy TDD practices, what test frameworks you must use, or why [insert your favorite oldish (as in 2010s) jargon] cannot answer the exceeding amount of sophistication in our new software and [insert your favorite 2020s acronym] is the new shit.

This is about how we can overcome simple, yet annoying problems that come up in the process of writing code by writing tests.
I’m gonna talk about go here. Probably none of these can be applied to any other language/environment directly because of the differences in the language features, best practices, and/or conventions. However, I believe the analogy is a reusable one and therefore more general than just golang since I have caught myself thinking about the same practices in other situations.
Throughout this post I’ll sometimes refer to a hypothetical caching interface defined as:
in the package github.com/someone/cache. (No it does not actually exist [as of this writing])

1. Test struct “constructors”

The first thing that we write in a new .go file is often a struct. Suppose we’re going to implement the above interface on top of a redis backend. Let’s call the implementing struct RedisCacheService. Let’s start off by writing a test:
file: cache/redis/service_test.go
file: cache/redis/service.go
This is a very simple test. It may even look too simple since there is not even an assertion in the body. That’s because the compiler can do all the necessary work here by checking the function signature.
Writing this seemingly trivial test against the NewRedisCacheService(string, string) function not only allows you to advance src_test.go and src.go together (in that order) from early on in a full TDD manner, but also allows you to think about the usage of the “ctor” before actually writing it. Its easy to stay with an obviously problematic “ctor” interface as a result of justifying it in hindsight after you’ve written it down.
Also regarding the general testing philosophy, you may not be the actual user of NewRedisCacheService and therefore may have no way of finding out about breaking changes to this interface when you recompile.
This could also be library code and it could well be the case that you’re testing each functionality of the CacheService by manually initializing the RedisCacheService struct (as opposed to calling NewRedisCacheService) for example to mock some resource internal to RedisCacheService to which you don’t have access through the “ctor”.
The consistency of the ctor is easy to overlook when you’re writing tests, since it’s not an actual “functionality” of a defined interface, and it is obviously very destructive if an unwanted change is introduced.

2. Test interface implementations

Your implementation of the CacheService needs to be correct in terms of language semantics: The struct needs to define the methods specified by the interface.
This is also hard to keep track of as there might not always be explicit calls to every interface functions or pieces of code to “cast” up from structs to interfaces for the implementations of the functions to be checked. Also in the early developments of a new module, you would risk pushing semantically incorrect code to the upstream since there could be no usage of the new code yet.
Let’s keep things simple:
Add to file: cache/redis/service.go Again, there’s no need to assert anything. The compiler is doing all the work by checking the whether or not RedisCacheService implements cache.CacheService. Note how we kept it as minimal as possible by not calling the ctor and creating the object directly.

3. Test generated assets: mocks

When you specify the interfaces through which your program modules are supposed to talk to each other in the top level and only communicate through those interfaces, interesting stuff tends to happen. Here’s an example from digest:
A particularly pleasant thing is that you can generate all (well, except third parties) the mocks that you would need (you only communicate through these, remember?) from this file.
The weird thing about auto generated stuff however is that you don’t know when they get out dated, and I don’t think you should. Let’s write a test to handle this.
I tend to write this kind of test in a file right next to the services file. In digest the file is named digest.go in the root, and therefore:
file: digest_test.go
Now everytime you run your tests (speaking of which, I really suggest making use of goconvey which runs them automatically for you), you’re going to be notified of a possible change in your interfaces’ api that you had forgot to reflect in your mocks by regenrating them.

4. Test generated assets: other things

I really liked go-bindata and still find it quite useful eventhough its been archived. However a problem similar to that of mocks is waiting to hunt you down if you use generated/bundled static assets through language apis. Forgetting to regenrate your bindings can cause problems that are hard to track down and waste some time that we might as well spend watching cat videos on youtube.
This one came really handy in shopapi where I was using graphql. graph-gophers/graphql-go nicely verifies your code against your schema but as I was using go-bindata to feed my schema to graphql-go, I would repeatedly forget to regenerate the static assets, which basically made the library unaware of my new changes to the graphql schema. The solution, as you might guess, is pretty simple:
Now let’s watch some cat videos.

Thursday, January 17, 2019

Changes to my notes in google docs sent to my mailbox


Organizing information. I tend to write stuff down a lot. Things that I learn, books I read, etc. and I found out that paper does not allow me to Ctrl-F (or / or Ctrl-S or Ctrl-R. Sorry I don’t ^W) and is hard to carry around. It’s not easy to replicate paper and even worse: I have observed that when you edit one paper copy, the content on the other copies stays the same.
My solution was to write all of my notes in google docs, and when I started doing so I naturally started to complicate what I needed from my digital pen and paper: I hardly ever find the time to go over my previous notes, so why not make it a little easier to read every piece of note that I take at least once?
My solution? I thought that it would be convenient to receive an email containing my newest notes every night.


At first I was planning to wrap this into a web app that everyone could use, but I remembered how I think of app developers/users who require/give unnecessary access to personal data, and I decided not to create another information honeypot.
The first roadblock was that google drive api does not allow you to retrieve google docs revisions. That would leave me to either generate the diffs myself, or program to a browser engine to acquire the revisions through the UI (similar to this unbelievably amazing piece of code). I lack superpowers so I chose to go with the former. (But to save my ego I’m going to pretend that I did that to make the solution more versatile)
To access the drive api you have to use oauth2 and therefore need to go through the somewhat complex and slow process of creating a project in your google developer dashboard and enable the drive api key for it. You’ll obtain a credentials.json that contains:
So keep it safe.
You can then download (includes waiting for google drive go api bindings to download the whole universe to satisfy its dependencies), compile and install digest by doing:
After which given that $GOPATH/bin is in your $PATH, you’ll be able to execute digest --help.
Remember that credentials.json file? put it here:
Your first execution will probably look something like:
where $GOOGLE_DRIVE_FOLDER_NAME is the name of the folder which contains all of your notes and preferably nothing else. (If you are organized (read: obsessive) enough to write everything in Google docs, and more importantly are willing to review what you write in a systematic manner, I assume that you have 10+ levels of nesting in your documents structure. Just use the name of the top level folder).
Also viper allows you to seamlessly accept environment variables for the flags:
Which means $DIG_SMTP_PASS conveniently can be set as an environment variable instead of passing in --smtp-pass=yourpass.
And that’s it. This first execution takes the first snapshot, and does not yet send you an email. Your subsequent usages will probably look like this:
This will send you an email containing the diff with the last snapshot. A snapshot is taken (and put in ~/.digest/data/) whenever digest is successfully run. If no difference is found between two consecutive snapshots, you’ll enjoy some free panic in your email:

And the content of the email gets a little bit more interesting when you’ve actually applied some changes to your docs:
I issue the $ digest command on my pc everyday, just before I call it a night and read the emailed diff on my phone on my way back home. However if you only hang out with the very cool kids you may already be thinking about:
Feed it to the cron engine on your ser… who am I kidding. If you’re into that kind of stuff you already know how to work your magic.
Anyway, if digest fits in your routine, do tell me about it! Specially if it involves crazy automation scenarios (Bonus points if you manage to trigger it with your office door lock).
UPD: Again if the following condition is true about you, you won’t need no mortal’s guidance but if you’re a systemd freak, you may want to use hooks to digest your notes before a shutdown/hibernate. Hint: DefaultDependencies, oneshot.


Here’s part of the project directory structure:
├── cmd
│   └── digest
│       ├── digest.go
│       └── main.go
├── diff
│   ├── diff.go
├── drive
│   ├── auth.go
│   ├── drive.go
├── smtp
│   ├── smtp.go
I have implemented the following services in their respective directories and have hooked everything up together (a little uglier than usual I must say) in cmd/digest.
I have used go modules for maintaining the dependencies of this project. If you haven’t heard of them I suggest starting at Russ Cox’s talk on the subject.