Scheduling and Parallelization

Overview

Chronon automatically parallelizes simulation based on dependency analysis and lookahead scheduling.

For wall-clock scheduler diagnosis, Chronon can also emit a Perfetto/Chrome Trace timeline of logical execution streams, unit tick slices, cross-thread dependency spin waits, and epoch spans. See Scheduler Timeline Trace.

Dependency Graph

The dependency graph captures unit interconnections using Floyd-Warshall for all-pairs shortest path analysis:

class DependencyGraph {
public:
    // Build from units and connections
    void build(const std::vector<Unit*>& units,
               const std::vector<ConnectionBase*>& connections);

    // Lookahead queries
    uint32_t lookahead(Unit* source, Unit* dest) const;  // Min path delay
    bool hasPath(Unit* source, Unit* dest) const;

    // Dependency queries
    std::vector<Unit*> getDependencies(Unit* unit) const;  // All predecessors
    std::vector<Unit*> getDependents(Unit* unit) const;     // All successors

    // Direct neighbor queries (returns pairs of Unit* and delay)
    std::vector<std::pair<Unit*, uint32_t>> predecessors(Unit* unit) const;
    std::vector<std::pair<Unit*, uint32_t>> successors(Unit* unit) const;

    // Graph access
    size_t unitIndex(Unit* unit) const;
    Unit* unitAt(size_t index) const;
    const std::vector<std::vector<uint32_t>>& distances() const;

    // Modification
    void addConnection(Unit* source, Unit* dest, uint32_t delay);
    void recomputeLookahead();
};

Cycle Classification

Cycles are classified based on total delay:

Type	Total Delay	Handling	Example
Tight	= 0	Invalid topology (initialization error)	A <--(0)--> B
Loose	> 0	Lookahead (parallel)	A <--(3,2)--> B

Tight Cycles

When total delay = 0, the topology has combinational feedback with no registered state boundary. Chronon rejects this during TickSimulation::initialize() because the tick() API permits side effects and does not provide a pure combinational function contract for convergence iteration:

Loose Cycles (Lookahead)

When total delay > 0, units can run ahead within the lookahead window:

Cycle Analyzer

Uses Tarjan's SCC and Johnson's algorithm to detect and classify cycles:

struct CycleInfo {
    std::vector<Unit*> units;     // Units in cycle order
    std::vector<uint32_t> delays; // Delays between consecutive units
    uint32_t total_delay;          // Sum of all delays

    bool isTight() const { return total_delay == 0; }
    uint32_t minEdgeDelay() const;  // Minimum delay on any edge
    bool contains(Unit* unit) const;
};

struct AnalysisResult {
    // All detected cycles
    std::vector<CycleInfo> all_cycles;

    // Cycles classified by type
    std::vector<CycleInfo> tight_cycles;   // delay = 0, invalid for TickSimulation
    std::vector<CycleInfo> loose_cycles;   // delay > 0, can use lookahead

    // Units involved in tight cycles
    std::set<Unit*> tight_cycle_units;

    // Independent groups (no dependencies between groups)
    std::vector<std::vector<Unit*>> independent_groups;

    // Strongly connected components
    std::vector<std::vector<Unit*>> sccs;

    // Lookahead map: {source, dest} -> minimum path delay
    std::map<std::pair<Unit*, Unit*>, uint32_t> lookahead;

    // Query methods
    bool inTightCycle(Unit* unit) const;
    uint32_t safeLookahead(Unit* source, Unit* dest) const;
    bool canParallelize(Unit* a, Unit* b) const;
};

// CycleAnalyzer - static analysis methods
class CycleAnalyzer {
public:
    static AnalysisResult analyze(const DependencyGraph& dep_graph,
                                   size_t max_cycles = 1000);
    static bool hasSelfLoop(const DependencyGraph& dep_graph, Unit* unit);
    static uint32_t minCycleLength(const DependencyGraph& dep_graph, Unit* unit);

    // TickSimulation rejects tight_cycles during initialization.
};

Zero-Delay Handling

Acyclic delay=0 paths are valid and execute in topological order, so a producer can make a message visible to its consumer in the same cycle. Zero-delay feedback loops are invalid: insert delay>0 on at least one feedback edge, or collapse the combinational logic into a single unit with explicit internal ordering.

TickSimulation Execution Model

TickSimulation uses stdexec::static_thread_pool for epoch-free parallel execution. The epoch-free scheduler and the sequential reference path both live inside TickSimulation; there is no separate scheduler class.

// TickSimulation uses stdexec
::exec::static_thread_pool pool_{num_threads};

// Epoch-free: one persistent worker launch for the complete run.
auto work = stdexec::bulk(stdexec::just(), stdexec::par, worker_count,
                          [this](size_t worker) { executeThreadRun_(worker); });
stdexec::sync_wait(stdexec::starts_on(pool_.get_scheduler(), std::move(work)));

Execution Modes

TickSimulation selects one of two execution paths during initialization:

Mode	Condition	Description
Sequential	Single-threaded, parallelism not beneficial, or an epoch-free safety gate fails	Units execute in topological order per cycle
Epoch-free lookahead	Parallelism is beneficial and every dependency/transport gate is proven	Persistent workers advance tight clusters from predecessor progress

Lazy Wakeup And Multi-Rate Ticks

Chronon can skip a unit's user tick() body on cycles where the unit is known to be inactive while still advancing its local cycle and scheduler progress. This keeps the runtime tick-driven: there is no global event queue and no callback is executed on the producer thread.

Units can use activity controls from their own tick() body:

class TimerDevice : public TickableUnit {
public:
    TimerDevice() : TickableUnit("timer") {
        setTickInterval(1000);  // only run tick() on global cycles divisible by 1000
    }

    void tick() override {
        if (!has_work) {
            sleepUntil(localCycle() + 500);
            return;
        }
        process();
        sleepForever();  // wait for a port arrival or explicit wakeAt()
    }
};

The scheduler evaluates a unit as active when both conditions are true:

global_cycle >= unit.nextActiveCycle()
global_cycle % unit.tickInterval() == 0

When the unit is inactive, Chronon runs a cheap idle path that only advances the unit's local cycle. The cluster's completed-cycle progress still advances, so downstream lookahead dependencies do not stall behind idle units.

In the progress-based lookahead scheduler, if every unit in a cluster is inactive, Chronon advances the whole cluster in one batch to the next active unit cycle, dependency boundary, or epoch end. The scheduler timeline still records this fast path because dependency progress is advancing, but it uses the unit idle category instead of unit. The slice detail includes cycles=N for batched idle advances.

Port delivery is an input-driven wakeup source. When a connection successfully enqueues a message with arrival cycle A for a scheduler-controlled destination, Chronon wakes that destination unit at A; the destination still receives the message through its InPort during its own later tick() context. This gives event-like behavior without executing target-unit code from the producer thread.

For always-active units, port delivery avoids the wakeup atomic entirely. A unit starts accepting port wakeups after it uses sleepUntil(), sleepForever(), or setTickInterval(N > 1). If delayed port messages were already queued before the unit first goes to sleep, the sleep target is seeded from the pending input arrival cycles, so those messages still wake the unit at their arrival cycle. Explicit wakeAt() requests are tracked independently and multiple future requests are preserved even if they are issued before the unit first sleeps.

wakeAt() is intentionally only a scheduler hint. If a model communicates through shared memory or another side channel outside Chronon ports, the model must still expose the causal relationship to the scheduler, for example by converting the side-channel write into a port message or an explicit wake source with a conservative dependency. Otherwise an isolated sleeping unit may have already advanced past the event's nominal cycle under lookahead and will process the wake at its next scheduled opportunity.

Epoch-Free Lookahead

Epoch-free lookahead launches persistent workers once for the complete run. The run is a single window in which run-ahead is bounded by lookahead_floor_ + max_lookahead_cycles (refreshed lazily as the global-minimum cluster advances), dependency progress, and direct-lane transport headroom. MPSC payloads need neither centralized arbitration nor a run-end flush. Results remain bit-identical to Sequential; only wall-clock behavior changes.

Normative scheduler-equivalence contract

For a fixed model, initial state, input stream, and simulated-clock configuration, sequential execution and every legal epoch-free execution must produce exactly the same model-visible behavior. This is a correctness requirement, not a best-effort determinism property. It covers:

delivery cycle and same-cycle receive order;
every send()/backpressure result;
state committed at each simulated clock edge;
receiver filter and cancellation results;
architectural output and termination cycle.

Worker placement, host wall time, wait samples, and migration diagnostics are scheduler metadata and are intentionally excluded. Static worker-count changes and whole-cluster migrations at legal scheduler fences must not alter the model-visible trace.

CI enforces this contract with an epoch-free differential harness. Each Unit writes to its own pre-reserved event stream, so recording needs no shared lock or atomic. Streams are canonicalized after the run by (cycle, component, sequence) and compared with the sequential reference. A failure reports the first divergent cycle, component, expected event, and actual event instead of only a final checksum. Deterministic migration tests split an epoch-free run at declared cycles and move a complete cluster while all workers are quiescent; this exercises live queues and receiver filters without adding a callback or branch to the production worker loop.

The old per-cycle and per-epoch barrier schedulers have been removed. If the safety gate rejects epoch-free execution, Chronon selects Sequential during initialization and reports the veto reason. epoch_size is now only a host predicate and Sequential runUntilTermination() polling interval. Scheduler timeline tracing does not veto epoch-free execution.

Dynamic rebalance remains opt-in. When enable_dynamic_rebalance: true and the epoch-free dependency gate holds, Chronon commits whole-cluster migrations only at scheduler fence points. A safety-gate rejection selects Sequential and does not run dynamic migration.

Each EpochFree worker keeps a private shadow of the last acquired progress value for every predecessor cluster. Cluster progress is release-published and monotonic. Therefore, when the shadow already reaches the cycle required by a dependency, the worker can reuse that lower bound without reading the predecessor's frequently-written atomic cache line. When it is insufficient, the worker performs an acquire load and updates its shadow. The synthetic lookahead floor has a separate reserved slot. Dynamic migration needs no cache invalidation because cluster ids and their progress slots remain stable and the published cycle never moves backward.

When it engages. Epoch-free is selected only when all of the following hold; otherwise Chronon selects Sequential without changing model results:

enable_epoch_free_lookahead is set and max_lookahead_cycles > 0;
every MPSC input port has fully-resolved per-connection producer progress;
cross-thread buffer headroom suffices for every connection (see below);
the dependency/headroom graph contains no zero-slack cluster cycle.

Cross-thread buffer headroom. In epoch-free lookahead, a producer can run ahead of a consumer and leave entries buffered in the connection's cross-thread ring — a direct per-Connection SPSC lane for a multi-producer port, or the SPSC lock-free ring for a single-producer cross-thread edge. For bounded InPorts, these rings are sized at initialization so the declared capacity fits. For unlimited-capacity InPorts, the physical lock-free rings remain bounded by the default ring size, and the port never model-side back-pressures, so a producer could silently overflow the physical ring. The fixed-layout gate therefore vetoes epoch-free unless each connection can absorb the configured run-ahead. The headroom (in cycles) a connection supports is roughly:

headroom = min(InPort capacity, ring slots) / per_cycle_send_rate - edge_delay

where per_cycle_send_rate is the source OutPort's per-cycle send cap (an uncapped source forces a veto), ring slots is the usable physical ring capacity, and edge_delay accounts for not-yet-due entries the consumer cannot drain. Same-thread connections drain synchronously and impose no bound. If any cross-thread connection cannot expose a safe progress dependency, the simulation selects Sequential. To use epoch-free with unlimited-capacity cross-thread edges, give the producing OutPort a per-cycle send cap and keep max_lookahead_cycles + edge_delay within the default physical ring, or use an explicit bounded InPort capacity large enough for the desired run-ahead.

Scheduler Timeline Diagnostics

In progress-based lookahead mode, each tight-coupling cluster publishes its completed cycle in a cache-line-aligned progress atomic. A worker stream scans the clusters assigned to it and executes any cluster whose direct predecessor clusters have reached the required cycle. The worker-local progress shadow described above avoids a remote atomic load when a prior observation already proves readiness; it does not relax the dependency or predict future progress. If no local cluster is ready, the stream spins until one becomes ready. The scheduler timeline records that time as cluster dependency events and includes the blocking predecessor cluster in the event detail.

This keeps delay=0 groups atomic while allowing independent clusters assigned to the same stream to advance out of order. Dynamic rebalance, when enabled, migrates whole clusters at scheduler fence points; it does not split delay=0 clusters or migrate individual units.

Typical investigation workflow:

Capture a short window with simulation.observation.timeline.scheduler.enabled=true.
Open the resulting .pftrace file in ui.perfetto.dev.
Inspect whether stream N lanes overlap tightly.
Identify streams with long cluster dependency slices.
Correlate the waiting streams with unit placement and connection delays.

Timeline stream N lane names are zero-based Chronon logical stream ids. The scheduler lane is separate from worker streams and only records scheduler-side spans.

Example:

./my_sim config.yaml --no-observe \
  -p simulation.observation.timeline.scheduler.enabled=true \
  -p simulation.observation.timeline.scheduler.file=out/scheduler_timeline.pftrace \
  -p simulation.observation.timeline.scheduler.end_cycle=2000

Long wait slices usually indicate that one predecessor cluster is on the critical path, that low-delay edges are forcing near-lockstep execution, or that the current partition assigned too much work to a dependency anchor stream.

Configuration

struct TickSimulationConfig {
    // Thread pool configuration
    size_t num_threads = std::thread::hardware_concurrency();

    // Scheduler selection (placement is always cluster-aware).
    //   enable_parallel=false -> Sequential
    //   all epoch-free gates proven -> Epoch-free
    //   any epoch-free gate rejected -> Sequential
    bool enable_parallel = true;
    bool enable_lookahead = true;  // false is a compatibility request for Sequential

    // Lookahead configuration
    uint32_t max_lookahead_cycles = 100;    // Max cycles a unit can run ahead
    uint64_t epoch_size = 64;               // Host predicate / Sequential poll interval
    bool enable_epoch_free_lookahead = true;  // false forces Sequential

    // Debug options
    bool trace_execution = false;           // Log execution mode selection

    // Cluster-aware partitioning (default: enabled)
    bool enable_weighted_partitioning = true;
    PartitionSolverType partition_solver = PartitionSolverType::SA;
    double initial_partition_sync_cost_ns = 8.0;  // Locality weight for placement

    // Dynamic rebalancing
    bool enable_dynamic_rebalance = true;
    double rebalance_imbalance_threshold = 1.03;
    uint64_t rebalance_check_interval_cycles = 2048;
    double rebalance_min_gain = 0.01;
    uint64_t rebalance_cooldown_cycles = 0;
};

These settings can be configured via YAML or set directly in code. Sequential and epoch-free execution produce identical cycle-accurate model behavior; they differ only in wall-clock performance and scheduler diagnostics.

Dynamic rebalance is enabled by default. It samples unit tick cost periodically, combines measured active cost with dependency topology and wait attribution, and migrates whole tight clusters at scheduler fence points when the predicted objective gain clears the configured thresholds. Set enable_dynamic_rebalance: false when a fixed initial layout is more important; epoch-free lookahead itself remains the default path when the safety gate holds. rebalance_min_gain can suppress migrations with too little predicted speedup, and rebalance_cooldown_cycles can enforce a minimum cycle gap between applied rebalances.

Exception Handling in Execution Paths

Both execution paths capture exceptions with crash context:

Mode	Strategy	Overhead
Sequential	try-catch outside outer loop	Zero (Itanium zero-cost ABI)
Epoch-free (`stdexec::bulk`)	try-catch around each worker loop; requests stop to break peer spin-waits, then rethrows through stdexec	Zero per non-exception iteration

A unified stdexec::inplace_stop_source handles both exception-driven abort and unit-initiated termination. Worker spin-waits check token.stop_requested() to exit promptly on either condition. Exceptions are wrapped as TickException with unit name and cycle, then rethrown on the main thread by stdexec::sync_wait.

Cluster-Aware Cost Partitioning

When enable_weighted_partitioning = true (default) and at least 4 units exist, TickSimulation uses a unified cluster-aware + cost-aware partitioning pipeline:

Algorithm Pipeline

Cost model selection: Uses deterministic unit cost 1.0 plus initial_partition_sync_cost_ns by default, or caller-supplied measured costs from setPrecomputedUnitCosts(...)
Solver selection: Runs partition_solver (SA by default, Weighted optional) against the same partition input
Tight cluster detection: Groups units with delay=0 connections into clusters (units within a cluster must share a thread)
Cluster-level graph partitioning: Treats each cluster as a super-node with aggregated cost and delay-aware edges
Thread assignment: Maps cluster assignments back to per-unit thread assignments
Queue optimization: Selects optimal queue type per connection based on thread placement

WeightedPartitioner

Four-phase algorithm in src/sender/schedule/WeightedPartitioner.hpp:

Phase 1 (LPT): Longest Processing Time first — sorts units by decreasing cost and assigns each to the thread with the minimum current load. Pure makespan minimization (no coupling considered). Provides a 4/3-OPT approximation for the multiprocessor scheduling problem.
Phase 2 (FM Refinement): Iteratively moves units from heaviest to lightest thread when the move reduces max thread time (up to 5 passes). Accounts for sync cost changes from the move.
Phase 3 (Pairwise Swap): Tries swapping units between all pairs of threads to escape local minima (handles balanced-but-suboptimal assignments where tightly coupled units were arbitrarily separated by LPT).
Phase 4 (Multi-Unit Relocate): Tries removing pairs of units from the heaviest thread and distributing them to the two lightest threads. Handles cases where no single move or swap improves the makespan.

Delay-Aware Sync Cost Model

The partition adjacency graph uses directed edges — each Connection object creates one adjacency entry (source → destination). For bidirectional communication (e.g., wakeup buses), separate Connection objects in each direction naturally produce edges in both directions. This avoids double-counting bus connections, which expand to N×M individual connections at config load time.

Cross-thread synchronization cost scales inversely with connection delay:

sync_cost(edge) = platform_sync_ns * num_connections * delay_factor(min_delay)

Delay	Factor	Rationale
0	100.0	Inline/same-cycle: prohibitively expensive to split
1	1.0	Tight spin-waiting every cycle
N > 1	1/N	Higher delay = less frequent synchronization

This ensures delay=0 connections force co-location, while high-delay connections can tolerate cross-thread placement.

Parallel Execution Decision

With weighted partitioning, parallel execution is beneficial when:

max_thread_cost * 1.10 < total_sequential_cost

The 10% overhead factor accounts for synchronization costs. This heuristic correctly accepts parallelization at moderate imbalance (e.g., 1.75x speedup) while rejecting extreme cases where one thread dominates.

Placement Fallbacks

When weighted partitioning is disabled or fewer than 4 units exist:

Condition	Path
`tight_connections` present	Topology-only cluster assignment
Otherwise	Greedy thread assignment with unit-count heuristic

The unit-count heuristic requires: no thread has >50% of units AND >= 3 units per active thread.

Queue Optimization

Based on thread assignment, connections are optimized:

Connection Type	Queue Implementation	Overhead
Same thread (intra-cluster)	SingleThreadMessageQueue	Zero (no synchronization)
Cross-thread SPSC	LockFreeMessageQueue	Atomics only
Cross-thread MPSC	MultiProducerQueueAdapter	Per-thread queues + merge

Impact: Eliminates ~18% mutex overhead from message queues when units are properly clustered.

When parallelism is not beneficial, simulation falls back to optimized sequential execution with single-thread queues and non-atomic cycle counters.

See port-system.md for detailed queue implementation and performance characteristics.

Performance

Threads	Throughput
1	~9.35 Mcycles/sec
2	~10.00 Mcycles/sec (7% faster with lock-free)
4+	Workload dependent

Multi-thread performance depends on dependency structure. Tight coupling (delay=0) must execute on same thread.

Overview​

Dependency Graph​

Cycle Classification​

Tight Cycles​

Loose Cycles (Lookahead)​

Cycle Analyzer​

Zero-Delay Handling​

TickSimulation Execution Model​

Execution Modes​

Lazy Wakeup And Multi-Rate Ticks​

Epoch-Free Lookahead​

Normative scheduler-equivalence contract​

Scheduler Timeline Diagnostics​

Configuration​

Exception Handling in Execution Paths​

Cluster-Aware Cost Partitioning​

Algorithm Pipeline​

WeightedPartitioner​

Delay-Aware Sync Cost Model​

Parallel Execution Decision​

Placement Fallbacks​

Queue Optimization​

Performance​