##
*Update: *The code is here https://github.com/farnasirim/max-flow-optimization.

*Update:*The code is here https://github.com/farnasirim/max-flow-optimization.

## What

There was this particular weighted matching problem that I needed to solve some time ago. The problem specification is not relevant here but its got only a few more constraints than a normal maximum weighted matching and instead of actually thinking, I decided to reduce the whole thing to min-cost flow, similar to what one would do for a normal max matching.For my actual usage, everything was fun and games. My C++ implementation would give me the answer almost instantly on a synthesized small input and in a few minutes on my actual data.

Now for

*some*reason, I needed to implement the same thing in Python too. You see where this is going, right?

This is basically my status report while trying to understand why the exact same code Runs 1000s (not kidding) of times slower in Python.

## Some background

I used the augmenting path algorithm for finding a min-cost flow in the graph. On the very high level, it looks like this:```
While we haven't reached the desired flow:
find the cheapest path from source to sink
send flow along the path
```

Your algorithm for the shortest path subtask has to support negative edges (or you can use potential functions to handle those separately). I used D´Esopo-Pape algorithm there.On my small input, the C++ version takes about

`0.1s`

and the Python version takes about `1000s`

.Here’s a part of the Python version:

```
def min_cost_flow(self, desired_flow, src, sink):
# Initialization
self.adj = [[]] * self.n
self.cost = [[0] * self.n for i in range(self.n)]
self.capacity = [[0] * self.n for i in range(self.n)]
# Some boring init code omitted...
flow, cost = 0, 0
d, p = [], []
while flow < desired_flow:
d, p = self.shortest_paths_from(src)
f = desired_flow - flow
cur = sink
while cur != src:
f = min(f, self.capacity[p[cur]][cur])
cur = p[cur]
# Send the flow
flow += f
cost += f * d[sink]
cur = sink
while cur != src:
self.capacity[p[cur]][cur] -= f
self.capacity[cur][p[cur]] += f
cur = p[cur]
return flow, cost;
```

```
def shortest_paths_from(self, v0):
d = [INF] * self.n
p = [-1] * self.n
d[v0] = 0
m = [2] * self.n
q = collections.deque()
q.append(v0)
while len(q) > 0:
u = q.popleft()
m[u] = 0;
for v in self.adj[u]:
if self.capacity[u][v] > 0 and \
d[v] > d[u] + self.cost[u][v]:
d[v] = d[u] + self.cost[u][v]
p[v] = u
if m[v] == 2:
m[v] = 1
q.append(v)
elif m[v] == 0:
m[v] = 1
q.appendleft(v)
return d, p
```

## Show me the slow

Well, I would guess that the`shortest_paths_from`

is the suspect here and has the most share in the total running time. Let’s verify that. I wrote this decorator to measure the time a piece of code takes to run:```
global_info = collections.Counter()
def measure_time_by_key(time_key):
def func_decorator(func):
def decorator(*args, **kwargs):
st = time.process_time_ns()
return_value = func(*args, **kwargs)
fin = time.process_time_ns()
global_info[time_key] += fin - st
return return_value
return decorator
return func_decorator
```

```
@measure_time_by_key("min_cost_flow")
def min_cost_flow(self, desired_flow, src, sink):
...
@measure_time_by_key("shortest_paths_from")
def shortest_paths_from(self, v0):
...
```

You can pull off lots of ideas in the decorator, including saving different function params/args (excuse my illiteracy for not knowing which one to use when we’re in the “middle” of passing the values to the function) along with their corresponding running times. (Fun fact: This actually makes sense here. The bipartite graph is super “simple” in the beginning. This causes the shortest path problem to become “harder”, that is, to involve more edges in the residual graph as we send more flow, and this impacts D´Esopo-Pape greatly)

This is pretty handy as you can measure time on a finer grain (than functions) too. For example to measure the time it takes to manipulate the paths and apply the flow inside the

`while`

loop in `min_cost_flow`

, we can indent the relevant lines into a local closure, capture the local variables, and measure the time it takes for the closure run:```
while ...:
...
f = desired_flow - flow
@measure_time_by_key("applying the flow")
def _internal_apply_flow():
nonlocal flow, cost, f
cur = sink
while cur != src:
f = min(f, self.capacity[p[cur]][cur])
cur = p[cur]
# Send the flow
flow += f
cost += f * d[sink]
cur = sink
while cur != src:
self.capacity[p[cur]][cur] -= f
self.capacity[cur][p[cur]] += f
cur = p[cur]
_internal_apply_flow()
```

Anyway, Here’s the result in our case (for a VERY small input):

```
shortest_paths_from took 1.428 seconds
min_cost_flow took 1.431 seconds
```

I’ve heard deques are slow:```
deque operations took 0.012 seconds
shortest_paths_from took 1.800 seconds
path_stuff took 0.000 seconds
min_cost_flow took 1.804 seconds
```

Well, not really. In hindsight though, this is obvious. Each operation on the queue may be expensive, but their count is way less than the number of times we just check weather or not we can update the shortest path using an edge:This is proved to be the case when I looked at the number of times this is executed on average and simulated the process with deterministic number of iterations and a few memory accesses:

```
def slow():
for i in range(3363):
for v in adj:
u = (v + i)//100
result = capacity[v][u] > 0 and \
d[v] > d[v] + cost[v][u]
```

To understand what’s happening here, I decided to look at the byte code of the original function:

Here’s the important part:

```
37 94 SETUP_LOOP 162 (to 258)
96
```**LOAD**_FAST 0 (self)
98 **LOAD**_ATTR 7 (adj)
100 **LOAD**_FAST 6 (u)
102 BINARY_SUBSCR
104 GET_ITER
>> 106 FOR_ITER 148 (to 256)
108 STORE_FAST 7 (v)
38 110 **LOAD**_FAST 0 (self)
112 **LOAD**_ATTR 8 (capacity)
114 **LOAD**_FAST 6 (u)
116 BINARY_SUBSCR
118 **LOAD**_FAST 7 (v)
120 BINARY_SUBSCR
122 **LOAD**_CONST 2 (0)
124 COMPARE_OP 4 (>)
126 POP_JUMP_IF_FALSE 106
39 128 **LOAD**_FAST 2 (d)
130 **LOAD**_FAST 7 (v)
132 BINARY_SUBSCR
134 **LOAD**_FAST 2 (d)
136 **LOAD**_FAST 6 (u)
138 BINARY_SUBSCR
140 **LOAD**_FAST 0 (self)
142 **LOAD**_ATTR 9 (cost)
144 **LOAD**_FAST 6 (u)
146 BINARY_SUBSCR
148 **LOAD**_FAST 7 (v)
150 BINARY_SUBSCR
152 BINARY_ADD
154 COMPARE_OP 4 (>)
156 POP_JUMP_IF_FALSE 106
40 158 **LOAD**_FAST 2 (d)
160 **LOAD**_FAST 6 (u)
162 BINARY_SUBSCR
164 **LOAD**_FAST 0 (self)
166 **LOAD**_ATTR 9 (cost)
168 **LOAD**_FAST 6 (u)
170 BINARY_SUBSCR
172 **LOAD**_FAST 7 (v)
174 BINARY_SUBSCR
176 BINARY_ADD
178 **LOAD**_FAST 2 (d)
180 **LOAD**_FAST 7 (v)
182 STORE_SUBSCR
41 184 **LOAD**_FAST 6 (u)
186 **LOAD**_FAST 3 (p)
188 **LOAD**_FAST 7 (v)
190 STORE_SUBSCR
42 192 **LOAD**_FAST 4 (m)
194 **LOAD**_FAST 7 (v)
196 BINARY_SUBSCR
198 **LOAD**_CONST 3 (2)
200 COMPARE_OP 2 (==)
202 POP_JUMP_IF_FALSE 224
43 204 **LOAD**_CONST 4 (1)
206 **LOAD**_FAST 4 (m)
208 **LOAD**_FAST 7 (v)
210 STORE_SUBSCR
44 212 **LOAD**_FAST 5 (q)
214 **LOAD**_METHOD 4 (append)
216 **LOAD**_FAST 7 (v)
218 CALL_METHOD 1
220 POP_TOP
222 JUMP_ABSOLUTE 106
45 >> 224 **LOAD**_FAST 4 (m)
226 **LOAD**_FAST 7 (v)
228 BINARY_SUBSCR
230 **LOAD**_CONST 2 (0)
232 COMPARE_OP 2 (==)
234 POP_JUMP_IF_FALSE 106
46 236 **LOAD**_CONST 4 (1)
238 **LOAD**_FAST 4 (m)
240 **LOAD**_FAST 7 (v)
242 STORE_SUBSCR
47 244 **LOAD**_FAST 5 (q)
246 **LOAD**_METHOD 10 (appendleft)
248 **LOAD**_FAST 7 (v)
250 CALL_METHOD 1
252 POP_TOP
254 JUMP_ABSOLUTE 106
>> 256 POP_BLOCK
>> 258 JUMP_ABSOLUTE 64
>> 260 POP_BLOCK

All I really saw at the first glance was the number of **LOAD**operations. I’m sure you now agree with a completely unbiased opinion.

Now can we reduce the number of

**LOAD**s? Turns out we can. If we start thinking like the compiler and keep an open mind about losing code beauty while accounting for the shortcomings of the actual compiler, we can see that there are a few extra

*LOAD*s as the variable

`u`

is invariant throughout the inner loop. Therefore we can change this```
...
while len(q) > 0:
u = q.popleft()
m[u] = 0;
for v in self.adj[u]:
if self.capacity[u][v] > 0 and \
d[v] > d[u] + self.cost[u][v]:
d[v] = d[u] + self.cost[u][v]
...
```

```
...
while len(q) > 0:
u_capacities = self.capacity[u]
u_current_d = d[u]
u_costs = self.cost[u]
for v in self.adj[u]:
if u_capacities[v] > 0 and \
d[v] > u_current_d + u_costs[v]:
d[v] = u_current_d + u_costs[v]
...
```

Let’s see the results:

```
real 0m0.790s
user 0m0.776s
sys 0m0.010s
```

Unreal! The compiler hasn’t been doing any of these things. That’s about a 2x improvement in performance in the small synthesized test case from the original code (about 1.5 seconds).On the bigger test case, this improves the running time from ~15 minutes to ~7 minutes.

## Bottom line

Unfortunately I did lots more without anything to show for it. Almost everything else that I did (from using c native types to fooling around with how I stored things) resulted in degrading the performance.I ended up achieving the performance that I was looking for by using another algorithm, but the question of why this (still) runs thousands of times slower than the cpp version remains open for me.

Hey there! Try to run it with pypy also, was a huge boost here on the shared code ;)

ReplyDeleteHey Ebrahim! Thank you for the suggestion! I'm planning to do a second pass over the same problem and write about it. Will try out pypy too!

Delete