Using OR-Tools CP-SAT for Scheduling Problems

A atalaykutlay.com ↗

▲ 87 points • 28 comments • by akutlay • 2w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is primarily human-written, with some AI-generated content detected

17 %

AI likelihood · overall

Mixed

89% human-written 11% AI-generated

SEGMENTS · HUMAN 4 of 5

SEGMENTS · AI 1 of 5

WORD COUNT 1,560

PEAK AI % 100% · §3

Analyzed

May 13

backend: pangram/v3.3

Segments scanned

5 windows

avg 312 words each

Distribution

89 / 11%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 1,560 words · 5 segments analyzed

Human AI-generated

§1 Human · 14%

2026-05-12

I’ve been working on improving how we schedule maintenance in Akamai’s cloud infrastructure, especially disruptive maintenance on hypervisor hosts serving hundreds of thousands of guest VMs. The problem is fairly complex, with competing constraints such as capacity, customer disruption SLAs, and concurrency limits across hosts, racks, and datacenters.

While prototyping solutions, I tried several optimization tools, including commercial and open-source MIP solvers. After exploring different options, I found Google’s OR-Tools library, particularly its CP-SAT solver, to be a strong fit for scheduling problems. In this post, I’ll walk through a simple scheduling model in OR-Tools and explain why it works well for this kind of problem.

Maintenance Scheduling in Cloud Infrastructure

First, let me explain the problem I’m trying to solve. Cloud providers manage physical servers called “hypervisor hosts” that run virtual machines (VMs) for customers. These hosts need periodic maintenance to ensure security and reliability. Some of these maintenance tasks can be done with live patches that don’t require a reboot. Handling live patches is relatively straightforward, and the main challenge is safely rolling out updates. However, some maintenance tasks require a reboot of the host. Before the host is rebooted, all VMs on it need to be migrated to other hosts. This process is disruptive to customers, so it requires careful scheduling to minimize the impact while still making sure all maintenance tasks are completed in a reasonable time frame.

The process of evacuating multiple hosts for maintenance involves a complex scheduling problem with several constraints. The main challenges can be summarized as the “3Cs”:

Capacity: Evacuating hosts requires spare capacity in the datacenter to accommodate the migrated VMs.

Concurrency: Migrations are resource-intensive operations. They use CPU, disk I/O, and network bandwidth. Only a limited number of migrations can be performed concurrently without overloading the hosts or the network.

Conflict: Customers tolerate a certain amount of disruption during maintenance, but planned maintenance should not cause too much disruption. This means only a small portion of any given customer’s VMs can be migrated at the same time.

With these challenges in mind, the goal of the scheduling problem is to find a schedule for the maintenance tasks that minimizes the total time to complete all maintenance while respecting these constraints.

§2 Human · 24%

If you’re interested in this problem, feel free to check out my SRECon26 presentation: Keeping a Hypervisor Fleet Up to Date with Minimal Customer Disruption

How to Model the Problem

Operations Research (OR) is a well-established field that uses algorithms and mathematical models to solve complex decision-making problems. OR researchers have been working on scheduling problems for decades, and they have come up with various “problem types” that capture the essence of different scheduling challenges. Some well-known problem types include Job Shop Scheduling, Flow Shop Scheduling, and Resource-Constrained Project Scheduling. Knowing the right problem type for your scheduling problem helps you leverage the existing research and algorithms that have already been developed for it.

Maintenance scheduling in the cloud is close to a Resource-Constrained Project Scheduling Problem (RCPSP) without the precedence constraints. In RCPSP, you have a set of tasks that need to be scheduled, each with its own duration and resource requirements. The goal is to find a schedule that minimizes the overall project duration while respecting resource constraints. Similarly, in maintenance scheduling, each VM migration can be seen as a task that requires certain resources (see the 3Cs above), and the goal is to minimize the total time needed to complete all maintenance tasks.

Modeling the Problem with OR-Tools CP-SAT Solver

OR-Tools is an open-source software suite developed by Google that provides a collection of tools for solving combinatorial optimization problems. One of its key components is the CP-SAT solver, which is a versatile portfolio solver that can handle a wide range of scheduling problems. For any given problem, CP-SAT tries a range of algorithms and shares information between them.

CP-SAT is particularly well-suited for scheduling problems because it has specialized variables and constraints for modeling time, which makes modeling scheduling problems more intuitive and efficient.

Let me give you code examples to illustrate this point. First, let’s set up a toy problem for ourselves, where we have a single hypervisor host that needs to be maintained, and we have 3 VMs that need to be migrated off of it.

§3 AI · 100%

Let’s assume each migration takes 10 time units, and we want to consider solutions that can be completed within 100 time units.

hosts = ["host_1"] vms = {"host_1": ["vm_1", "vm_2", "vm_3"]} migration_duration = 10 planning_horizon = 100

The core of the problem is to schedule the migrations of these VMs while respecting the constraints related to capacity, concurrency, and conflict. So we need to model the timing of these migrations and the resources they consume. This is where CP-SAT’s interval variables come in handy. In CP-SAT, you can use interval variables to represent tasks that have a start time, end time, and duration. Let’s create an interval variable for each VM migration and save them in a dictionary for easy access later on.

vm_interval_vars = {} # Dictionary to hold interval variables for each VM

for host in hosts: vm_interval_vars[host] = {} for vm in vms[host]: start_var = model.new_int_var(0, planning_horizon, f"{vm}_start") end_var = model.new_int_var(0, planning_horizon, f"{vm}_end") migration_duration = 10

vm_interval_var = model.new_interval_var( start=start_var, size=migration_duration, end=end_var, name=f"{vm}_interval" )

vm_interval_vars[host][vm] = vm_interval_var

Executing the above code will give you a dictionary vm_interval_vars that contains interval variables for each VM’s migration. Each interval variable has a start time, end time, and a fixed duration of 10 time units.

{ 'host_1': { 'vm_1': vm_1_interval(start = vm_1_start, size = 10, end = vm_1_end), 'vm_2': vm_2_interval(start = vm_2_start, size = 10, end = vm_2_end), 'vm_3': vm_3_interval(start = vm_3_start, size = 10, end = vm_3_end) } }

Interval variables make it easy to express constraints related to concurrency and resource usage.

§4 Human · 8%

For example, you can use the AddNoOverlap constraint to ensure that no two migrations that require the same resource, such as network bandwidth, overlap in time.

for host, vms_dict in vm_interval_vars.items(): vm_intervals = list(vms_dict.values()) model.AddNoOverlap(vm_intervals)

Or better, you can use the AddCumulative constraint to model limited resource capacity and ensure that total resource usage at any given time does not exceed the available capacity. The constraint below sets a limit on the number of concurrent migrations that can happen at the same time. To match the AddNoOverlap constraint above, we can set the maximum number of concurrent migrations to 1:

maximum_migrations_per_host = 1

for host, vms_dict in vm_interval_vars.items(): vm_intervals = list(vms_dict.values())

model.AddCumulative( vm_intervals, # List of interval variables [1] * len(vm_intervals), # Resource usage for each interval variable maximum_migrations_per_host # Maximum resource capacity )

The second argument in AddCumulative is the list of resource usages for each interval variable, which in this case is simply 1 for each migration since each migration consumes one unit of the resource. A more powerful use of AddCumulative is to assign different resource usages to different migrations, which is more useful for modeling real constraints such as throughput. If some VMs can be migrated with higher throughput than others, instead of limiting the number of concurrent migrations, you can limit total throughput at any given time:

migration_throughput = {"vm_1": 5, "vm_2": 3, "vm_3": 2} host_maximum_throughput = 5

for host, vms_dict in vm_interval_vars.items(): vm_intervals = list(vms_dict.values()) vm_throughputs = [migration_throughput[vm] for vm in vms_dict.keys()]

model.AddCumulative(vm_intervals, vm_throughputs, host_maximum_throughput)

Since the maximum throughput is 5, we would expect the schedule to migrate VMs 2 and 3 at the same time, but not VM 1 with any other VM, because VM 1 alone consumes all available throughput.

§5 Human · 3%

At this point, we have a tiny model that captures one aspect of the maintenance scheduling problem, which is the concurrency constraint. Let’s solve this model and see what the solution looks like:

solver = cp_model.CpSolver() status = solver.Solve(model) print(status) # Print the schedule if status == cp_model.OPTIMAL or status == cp_model.FEASIBLE: for host, vms_dict in vm_interval_vars.items(): print(f"Schedule for {host}:") for vm, interval_var in vms_dict.items(): start = solver.Value(interval_var.StartExpr()) end = solver.Value(interval_var.EndExpr()) print(f" {vm}: Start at {start}, end at {end}") else: print("No solution found.")

The solution prints the start and end times for each VM migration, which gives you a schedule that respects the concurrency constraint.

CpSolverStatus.OPTIMAL Schedule for host_1: vm_1: Start at 10, end at 20 vm_2: Start at 0, end at 10 vm_3: Start at 0, end at 10

As expected, we see VM 2 and VM 3 being migrated at the same time, while VM 1 is migrated separately to respect the maximum throughput constraint.

Why OR-Tools?

To understand why OR-Tools is a great choice, it helps to understand the alternatives. Another common technique for solving scheduling problems is Mixed Integer Programming (MIP). There are many open-source MIP solvers out there, but none of them give you tools to model time and scheduling constraints as naturally.

There are two ways to solve this scheduling problem with MIP: a time-indexed formulation and a time-continuous formulation. The time-indexed formulation is intuitive, but it does not scale well as the number of tasks and the planning horizon increase. The time-continuous formulation is more efficient, but it is less intuitive and requires a lot of effort to get right.

“Easy” MIP Method: Time-Indexed Formulation

The easy way to implement a scheduling problem with MIP is to use a time-indexed formulation, where you create binary variables that indicate whether a task is active at a specific time.