Branch-and-Bound from First Principles

In the late 1950s, two RAND researchers solved the largest Traveling Salesman Problem ever attempted: 49 cities, optimally. The number of possible tours through 49 cities is around $10^{60}$ — more than the atoms in your body — so they did not check them all. They invented a way to check almost none of them, and prove they had the answer anyway. The technique was branch-and-bound, and seventy years later it is still how every commercial Mixed-Integer Linear Program solver — Gurobi, CPLEX, FICO Xpress, the multi-billion-dollar industry that powers airline scheduling, logistics planning, and supply chain optimization — finds the optimal solution.

This post is the foundation for a research diary I wrote about CHOP, an attempt to use deep reinforcement learning to make branch-and-bound smarter. To enjoy that post, you need to know what branch-and-bound is and why a smarter version of it is even worth chasing. So we’ll get there together.

What This Post Covers

Why optimization is the secret backbone of modern industry — and why the math behind it is surprisingly accessible
Linear programs vs integer programs — why one is easy and the other is the rest of the iceberg
The branch-and-bound algorithm — step-by-step on a tiny example with diagrams
Heuristics: the human decisions inside the algorithm — node selection, variable selection, and why they matter
When the strongest classical heuristic isn’t actually best — the perverse instances where folklore fails

This post assumes nothing. If you’ve never seen a linear program, you’ll leave knowing what one is. If you have, skim the first few sections.

The Quiet Empire of Optimization

Most of the moments your day works without friction were arranged by an optimization solver. Your morning flight got a gate, a crew, and a tail number from one solver. The package you ordered yesterday took a route through five sortation hubs picked by another. The wind farm contract your utility signed last quarter exists because a third decided which turbines to build, in what order, and where to draw the lines for transmission.

These problems all have the same shape: make a discrete sequence of decisions to maximize value or minimize cost, subject to constraints. They’re called combinatorial optimization problems because the answer is some combination of choices. They are also, for reasons that will become clear, brutally hard. The decision variables are usually binary — assign this driver to that route, or don’t; build this turbine, or don’t — and the number of possible combinations explodes faster than physics will let you check them.

When the constraints and objective are linear, the problem is called a Mixed-Integer Linear Program (MILP). MILPs are the workhorse formulation: expressive enough to capture most of industrial planning, restrictive enough that we have algorithms that actually solve them. Branch-and-bound is that algorithm.

To appreciate it, we need to start one rung lower.

Linear Programs: The Easy Case

A linear program (LP) asks for the largest value of a linear objective $\mathbf{c}^\top \mathbf{x}$ subject to linear constraints $\mathbf{A}\mathbf{x} \le \mathbf{b}$ with continuous variables $\mathbf{x} \ge \mathbf{0}$. Geometrically, the constraints carve out a convex region in $\mathbb{R}^n$ — think of it as a many-faced gem — and the objective points in some direction. The optimum sits at one of the gem’s corners.

Here’s a two-variable LP: maximize $3x + 4y$ subject to $x + 2y \le 10$, $3x + y \le 15$, with $x, y \ge 0$:

A linear program with two constraints. The feasible region is the hatched quadrilateral; the optimum sits at the corner where both constraints bind.

The simplex algorithm, invented in 1947, walks from corner to corner, always moving in the objective’s direction, and proves it has reached the optimum when no neighboring corner is better. In practice it solves problems with millions of variables in seconds. LPs, in short, are solved — the polynomial-time interior-point methods that came later only made an already-fast algorithm faster.

The LP optimum to our example is exactly $(x, y) = (4, 3)$, with objective value $24$. Notice that both numbers happened to be integers. That is luck.

Integer Programs: The Hard Case

Now add one constraint: $x$ and $y$ must be integers. The feasible region is no longer a continuous polygon — it’s the lattice points inside the polygon, the dots:

The same problem with the added constraint that x, y must be integers. The feasible region collapses to a finite set of lattice points.

This particular instance is friendly: the LP optimum landed on a lattice point, so the integer optimum is the same. In general, the LP optimum is some real-valued $(x^\ast, y^\ast)$ with fractional coordinates, and the true integer optimum is some lattice point nearby — but possibly far from $(x^\ast, y^\ast)$, with a substantially worse objective value. The gap between the LP optimum and the integer optimum is the integrality gap, and bridging it is what makes integer programming hard.

How hard? NP-hard. There is no known polynomial-time algorithm for general MILP, and the consensus assumption (P ≠ NP) is that there isn’t one. The number of integer points inside the feasible region grows combinatorially with the number of variables: a problem with $n$ binary variables has up to $2^n$ candidate solutions. At $n=50$ you’ve already exceeded the lifetime of the universe at one nanosecond per check.

So we don’t check them all. We do something cleverer.

Branch-and-Bound: Prove You Don’t Have to Look

Branch-and-bound has two ideas, both from 1960 (the Land and Doig paper), and both deceptively simple.

Idea 1 — Branching: split a hard problem into two easier ones. Take the LP relaxation. If a variable $x_i$ comes back fractional — say $x_i = 2.7$ in the LP solution — then in any integer solution it must satisfy either $x_i \le 2$ or $x_i \ge 3$. Both can’t be true; one must be. So the original problem decomposes into two sub-problems with the extra constraint added on each side. Recurse.

Idea 2 — Bounding: prove a whole subtree is hopeless. When you solve the LP relaxation of a sub-problem, you get an upper bound on what the integer solution in that subtree can possibly achieve. (LP is a relaxation of MILP, so its optimum is at least as good as the integer optimum.) Meanwhile, you have an incumbent — the best integer-feasible solution you’ve found so far. If the LP upper bound of a sub-problem is $\le$ the incumbent value, no integer solution in that subtree can beat the incumbent. Prune the entire subtree without exploring it.

Together: build a search tree, expand it, prune ruthlessly. You only explore the subtrees that could contain a better solution than what you already have.

Here is the algorithm running on a tiny problem — maximize $5x_1 + 4x_2$ subject to $x_1 + x_2 \le 5$, $10x_1 + 6x_2 \le 45$, with $x_1, x_2 \in {0, 1, 2, …}$:

A small branch-and-bound tree. Each node is a sub-problem; we solve its LP relaxation to get a bound; if it's fractional, we branch.

A few things to notice:

The LP relaxation is doing the heavy lifting. Every sub-problem just adds bound constraints to the parent’s LP. Solving each LP is fast; solving the integer version of even one of them would defeat the purpose.
The bound prunes aggressively. Node B₁₁ found an integer solution with value 20, but the incumbent was already 23, so we discard it without exploring further. The incumbent acts as a moving floor.
We branch on a fractional variable. If the LP solution is already integer, we’re done — no branching needed. Some problem classes are lucky like that. Most aren’t.
The tree shape depends on choices we made. We branched on $x_1$ first, but we could have branched on $x_2$. We expanded the right subtree before the left. Different choices, different trees.

That last point is the entire reason this post exists. Branch-and-bound’s correctness doesn’t depend on those choices — any choice will find the optimum eventually. But its speed depends entirely on them. A great solver explores few nodes; a bad one explores exponentially many. The choices are heuristics, and the heuristics are the soul of the algorithm.

The Two Decisions That Define a Solver

Inside the main loop of every modern MILP solver, two decisions are made over and over:

Variable selection (branching): which fractional variable do we branch on? If the LP returned $(x_1, x_2, x_3) = (1.5, 0.5, 4.2)$, all three are candidates. Branching on $x_1$ gives one tree shape; branching on $x_3$ gives a different one. The “correct” choice would require knowing the future — which split will close the gap fastest? — and we don’t have it. Heuristics try to estimate.

Node selection: of all the open sub-problems, which one do we expand next? When you’ve branched a few levels deep, your “to-do list” of unexplored sub-problems can have hundreds of entries. You have to pick one. The “correct” choice would, again, require seeing the future. Heuristics estimate.

CHOP focuses on the second decision. It’s the cleaner one to study — the action space is just “pick one of the open nodes”, and the per-step compute is small enough to do interesting RL on a laptop.

Four standard rules cover most of what classical solvers actually do:

Best-bound. Always expand the open node with the highest LP value. The intuition: that node is the most “promising” — it has the most room to improve over the incumbent. Best-bound is the gold standard in MILP folklore. If you only know one heuristic, know this one.
Depth-first. Always expand the deepest open node. The intuition: dive to a leaf, find an integer solution, raise the incumbent, then start pruning. Memory-efficient (the open list stays shallow), and good when finding any integer solution is the bottleneck.
Breadth-first. Always expand the shallowest open node. Explore level by level. Rarely the right answer in MILP, but a useful baseline.
Random. Pick uniformly at random from the open list. The control group.

If you ran these four on the airline-scheduling MILPs that real solvers face, best-bound would win most of the time. The asymptotic theory backs this up: in the limit of perfect bound functions, best-bound minimizes the number of nodes explored. It’s the right thing to do given the bound.

But “given the bound” is doing a lot of work in that sentence.

When Best-Bound Isn’t Best

The Set Cover problem asks: given a universe of elements and a collection of sets that each cover some elements at some cost, pick a minimum-cost collection of sets that together cover the whole universe. It’s the canonical NP-hard problem behind crew scheduling, sensor placement, and a hundred other applications.

Set Cover’s LP relaxation is famously fractional. If you solve the LP without the integer constraint, you typically get a solution where many variables sit around $0.5$ — half this set, half that one. The LP value ends up close to the integer optimum, but the LP solution is far from an integer point. Worse, the LP value barely changes between adjacent search nodes — branching on one variable nudges the LP value by a tiny amount, then the next node looks almost identical.

Best-bound, in this regime, dives into deep fractional subtrees. Every step looks promising by the LP value, but progress toward an actual integer solution is slow. Random and breadth-first, on the other hand, get bounced around the tree and stumble onto integer-feasible leaves earlier. They raise the incumbent sooner, and the incumbent’s job is to prune the rest.

Empirically, on SetCover(50 elements × 80 sets, density 0.10) with my in-house simplex backend:

Heuristic	Mean nodes ± std	vs best_bound
best_bound	19.1 ± 16.1	1.00x
depth_first	10.9 ± 10.5	1.75x better
breadth_first	11.7 ± 9.8	1.63x better
random	13.2 ± 9.8	1.45x better

Best-bound — the textbook recommendation — explores roughly twice as many nodes as any of the alternatives on this distribution. It’s not a quirk of one instance; the result is stable over 40 random instances drawn from the same distribution.

Two things to take away:

Heuristic quality is problem-distribution-dependent. No single rule wins everywhere. The shape of the LP relaxation, the structure of constraints, the typical integrality gap — these vary by problem class, and the right heuristic varies with them.
There is room above the classical baseline. On Set Cover specifically, anything would be better than best-bound. That’s the regime where a learned heuristic — one that adapts to the problem distribution — could plausibly help.

Where We Go Next

Picking heuristics by hand is the way it has always been done. Practitioners stare at instances, eyeball patterns, write code. The result is the impressive-but-finicky hybrid branching rule inside SCIP, the proprietary recipes inside Gurobi. Each rule is a person’s intuition crystallized into a function.

But intuition has a ceiling. And modern industrial MILP problems don’t come from nowhere — they come from distributions. An airline solves a similar scheduling problem every Monday. A logistics company solves the same vehicle routing structure every morning. The instances aren’t arbitrary; they’re samples from a stationary process that the operator has been running for years.

That’s exactly the regime where machine learning should win. Given a distribution of similar instances, can we learn a heuristic that’s better than the hand-crafted one for that specific distribution? The 2019 NeurIPS paper from Maxime Gasse and collaborators showed convincingly that the answer is “yes, by a clear margin” for variable selection (branching) on Set Cover, Combinatorial Auctions, Capacitated Facility Location, and Maximum Independent Set. They imitated the strong-branching expert with a graph neural network and beat the SCIP defaults on every problem class they tested.

CHOP picks up the question for node selection and tries to answer it on a laptop, with reinforcement learning instead of imitation, using whatever combination of architectures we can stand up in a few iterations.

That story — from “it doesn’t even import” to seven trained architectures, three trainers, and a ~2x improvement over best-bound — is Part 2.

A Note on Honesty

Optimization research, and combinatorial optimization in particular, is a field where it is easy to overclaim. Numbers move with random seeds. Variance can hide trends. Benchmarks are gameable. Throughout the next post I try to report results honestly — including the results that don’t support the headline. If you pick up only one thing from this series, let it be that any claim of “X% better than baseline” should be read with a question mark until you’ve seen the seed-by-seed table. The Bipartite-GCN-with-attention architecture I’ll show you in Part 2 is the headline champion on one seed. On a different seed it underperforms the simpler bare-encoder version. We’ll get to that.

For now: branch-and-bound is the algorithm. Heuristics are the soul. Best-bound is the textbook. And on Set Cover, the textbook is wrong.