added cross compatiblity between shutdown flag and state entries

2026-01-15 00:22:55 +01:00
parent f4b20f824d
commit 75ab1969c7
14 changed files with 850 additions and 543 deletions
--- a/SHUTDOWN_STATE_COORDINATION.md
+++ b/SHUTDOWN_STATE_COORDINATION.md
@@ -0,0 +1,373 @@
+# Shutdown Flag & State Management Orchestration
+
+## Problem Statement
+
+Previously, the shutdown flag and StateManager worked independently:
+- **Shutdown Flag**: `Arc<AtomicBool>` signals code to stop execution
+- **StateManager**: Tracks completion of work with hash validation and dependencies
+
+This caused a critical issue: **when shutdown occurred mid-process, no state was recorded**, so on restart the entire step would be retried from scratch, losing all progress.
+
+## Solution: Coordinated Lifecycle Management
+
+### Overview
+
+The shutdown flag and StateManager now work together in a coordinated lifecycle:
+
+```
+Work In Progress
+        ↓
+    Shutdown Signal (Ctrl+C)
+        ↓
+    Record Incomplete State
+        ↓
+    Return & Cleanup
+        ↓
+    Next Run: Retry From Checkpoint
+```
+
+### Core Concepts
+
+#### 1. **StateEntry Lifecycle**
+
+Each checkpoint has two completion states:
+
+```rust
+// Happy Path: Work Completed Successfully
+StateEntry {
+    completed: true,                    // ✓ Finished
+    completed_at: Some(timestamp),      // When it finished
+    validation_status: Valid,           // Hash is current
+}
+
+// Shutdown Path: Work Interrupted
+StateEntry {
+    completed: false,                   // ✗ Incomplete
+    completed_at: None,                 // Never finished
+    validation_status: Invalid {        // Won't be skipped
+        reason: "Incomplete due to shutdown"
+    }
+}
+```
+
+#### 2. **State Management Functions**
+
+Two key functions orchestrate the shutdown/completion dance:
+
+```rust
+// Normal Completion (happy path)
+manager.update_entry(
+    "step_name".to_string(),
+    content_reference,
+    DataStage::Data,
+    None,
+).await?;
+
+// Shutdown Completion (incomplete work)
+manager.mark_incomplete(
+    "step_name".to_string(),
+    Some(content_reference),
+    Some(DataStage::Data),
+    "Incomplete: processed 50 of 1000 items".to_string(),
+).await?;
+```
+
+### Implementation Pattern
+
+Every long-running function should follow this pattern:
+
+```rust
+pub async fn process_large_dataset(
+    paths: &DataPaths,
+    shutdown_flag: &Arc<AtomicBool>,
+) -> Result<usize> {
+    // 1. Initialize state manager and content reference
+    let manager = StateManager::new(&paths.integrity_dir()).await?;
+    let step_name = "process_large_dataset";
+    let content_ref = directory_reference(&output_dir, None, None);
+    
+    let mut processed_count = 0;
+    
+    // 2. Main processing loop
+    loop {
+        // CRITICAL: Check shutdown at key points
+        if shutdown_flag.load(Ordering::SeqCst) {
+            logger::log_warn("Shutdown detected - marking state as incomplete").await;
+            
+            // Record incomplete state for retry
+            manager.mark_incomplete(
+                step_name.to_string(),
+                Some(content_ref.clone()),
+                Some(DataStage::Data),
+                format!("Incomplete: processed {} items", processed_count),
+            ).await?;
+            
+            return Ok(processed_count);
+        }
+        
+        // 3. Do work...
+        processed_count += 1;
+    }
+    
+    // 4. If we reach here, work is complete
+    // Shutdown check BEFORE marking complete
+    if shutdown_flag.load(Ordering::SeqCst) {
+        manager.mark_incomplete(
+            step_name.to_string(),
+            Some(content_ref),
+            Some(DataStage::Data),
+            format!("Incomplete during final stage: processed {} items", processed_count),
+        ).await?;
+    } else {
+        // Only mark complete if shutdown was NOT signaled
+        manager.update_entry(
+            step_name.to_string(),
+            content_ref,
+            DataStage::Data,
+            None,
+        ).await?;
+    }
+    
+    Ok(processed_count)
+}
+```
+
+### Why Two Functions Are Different
+
+| Aspect | `update_entry()` | `mark_incomplete()` |
+|--------|------------------|-------------------|
+| **Use Case** | Normal completion | Shutdown/abort |
+| `completed` | `true` | `false` |
+| `completed_at` | `Some(now)` | `None` |
+| `validation_status` | `Valid` | `Invalid { reason }` |
+| Next Run | **Skipped** (already done) | **Retried** (incomplete) |
+| Hash Stored | Always | Optional (may fail to compute) |
+| Semantics | "This work is finished" | "This work wasn't finished" |
+
+### Shutdown Flag Setup
+
+The shutdown flag is initialized in `main.rs`:
+
+```rust
+let shutdown_flag = Arc::new(AtomicBool::new(false));
+
+// Ctrl+C handler
+fn setup_shutdown_handler(
+    shutdown_flag: Arc<AtomicBool>,
+    pool: Arc<ChromeDriverPool>,
+    proxy_pool: Option<Arc<DockerVpnProxyPool>>,
+) {
+    tokio::spawn(async move {
+        tokio::signal::ctrl_c().await.ok();
+        logger::log_info("Ctrl+C received – shutting down gracefully...").await;
+        
+        // Set flag to signal all tasks to stop
+        shutdown_flag.store(true, Ordering::SeqCst);
+        
+        // Wait for tasks to clean up
+        tokio::time::sleep(tokio::time::Duration::from_secs(2)).await;
+        
+        // Final cleanup
+        perform_full_cleanup(&pool, proxy_pool.as_deref()).await;
+        std::process::exit(0);
+    });
+}
+```
+
+### Multi-Level Shutdown Checks
+
+For efficiency, shutdown is checked at different levels:
+
+```rust
+// 1. Macro for quick checks (returns early)
+check_shutdown!(shutdown_flag);
+
+// 2. Loop check (inside tight processing loops)
+if shutdown_flag.load(Ordering::SeqCst) {
+    break;
+}
+
+// 3. Final completion check (before marking complete)
+if shutdown_flag.load(Ordering::SeqCst) {
+    manager.mark_incomplete(...).await?;
+} else {
+    manager.update_entry(...).await?;
+}
+```
+
+### Practical Example: Update Companies
+
+The `update_companies` function shows the full pattern:
+
+```rust
+pub async fn update_companies(
+    paths: &DataPaths,
+    config: &Config,
+    pool: &Arc<ChromeDriverPool>,
+    shutdown_flag: &Arc<AtomicBool>,
+) -> anyhow::Result<usize> {
+    let manager = StateManager::new(&paths.integrity_dir()).await?;
+    let step_name = "update_companies";
+    let content_reference = directory_reference(...);
+    
+    // Process companies...
+    loop {
+        if shutdown_flag.load(Ordering::SeqCst) {
+            logger::log_warn("Shutdown detected").await;
+            break;
+        }
+        // Process items...
+    }
+    
+    // Final checkpoint
+    let (final_count, _, _) = writer_task.await.unwrap_or((0, 0, 0));
+    
+    // CRITICAL: Check shutdown before marking complete
+    if shutdown_flag.load(Ordering::SeqCst) {
+        manager.mark_incomplete(
+            step_name.to_string(),
+            Some(content_reference),
+            Some(DataStage::Data),
+            format!("Incomplete: processed {} items", final_count),
+        ).await?;
+    } else {
+        manager.update_entry(
+            step_name.to_string(),
+            content_reference,
+            DataStage::Data,
+            None,
+        ).await?;
+    }
+    
+    Ok(final_count)
+}
+```
+
+### State Tracking in `state.jsonl`
+
+With this pattern, the state file captures work progression:
+
+**Before Shutdown:**
+```jsonl
+{"step_name":"update_companies","completed":false,"validation_status":{"Invalid":"Processing 523 items..."},"dependencies":["lei_figi_mapping_complete"]}
+```
+
+**After Completion:**
+```jsonl
+{"step_name":"update_companies","completed":true,"completed_at":"2026-01-14T21:30:45Z","validation_status":"Valid","dependencies":["lei_figi_mapping_complete"]}
+```
+
+**After Resume:**
+- System detects `completed: false` and `validation_status: Invalid`
+- Retries `update_companies` from checkpoint
+- Uses `.log` files to skip already-processed items
+- On success, updates to `completed: true`
+
+## Benefits
+
+### 1. **Crash Safety**
+- Progress is recorded at shutdown
+- No lost work on restart
+- Checkpoints prevent reprocessing
+
+### 2. **Graceful Degradation**
+- Long-running functions can be interrupted
+- State is always consistent
+- Dependencies are tracked
+
+### 3. **Debugging**
+- `state.jsonl` shows exactly which steps were incomplete
+- Reasons are recorded for incomplete states
+- Progress counts help diagnose where it was interrupted
+
+### 4. **Consistency**
+- `update_entry()` only used for complete work
+- `mark_incomplete()` only used for interrupted work
+- No ambiguous states
+
+## Common Mistakes to Avoid
+
+### ❌ Don't: Call `update_entry()` without shutdown check
+```rust
+// BAD: Might mark shutdown state as complete!
+manager.update_entry(...).await?;
+```
+
+### ✅ Do: Check shutdown before `update_entry()`
+```rust
+// GOOD: Only marks complete if not shutting down
+if !shutdown_flag.load(Ordering::SeqCst) {
+    manager.update_entry(...).await?;
+}
+```
+
+### ❌ Don't: Forget `mark_incomplete()` on shutdown
+```rust
+if shutdown_flag.load(Ordering::SeqCst) {
+    return Ok(()); // Lost progress!
+}
+```
+
+### ✅ Do: Record incomplete state
+```rust
+if shutdown_flag.load(Ordering::SeqCst) {
+    manager.mark_incomplete(...).await?;
+    return Ok(());
+}
+```
+
+### ❌ Don't: Store partial data without recording state
+```rust
+// Write output, but forget to track in state
+write_output(...).await?;
+// If shutdown here, next run won't know it's incomplete
+```
+
+### ✅ Do: Update state atomically
+```rust
+// Update output and state together
+write_output(...).await?;
+manager.update_entry(...).await?;  // Or mark_incomplete if shutdown
+```
+
+## Testing the Orchestration
+
+### Test 1: Normal Completion
+```bash
+cargo run  # Let it finish
+grep completed state.jsonl  # Should show "true"
+```
+
+### Test 2: Shutdown & Restart
+```bash
+# Terminal 1:
+cargo run  # Running...
+# Wait a bit
+
+# Terminal 2:
+pkill -f "web_scraper"  # Send shutdown
+
+# Check state:
+grep update_companies state.jsonl  # Should show "completed: false"
+
+# Restart:
+cargo run  # Continues from checkpoint
+```
+
+### Test 3: Verify No Reprocessing
+```bash
+# Modify a file to add 1000 test items
+# Run first time - processes 1000, shutdown at 500
+# Check state.jsonl - shows "Incomplete: 500 items"
+# Run second time - should skip first 500, process remaining 500
+```
+
+## Summary
+
+The coordinated shutdown & state system ensures:
+
+1. **Work is never lost** - Progress recorded at shutdown
+2. **No reprocessing** - Checkpoints skip completed items
+3. **Transparent state** - `state.jsonl` shows exactly what's done
+4. **Easy debugging** - Reason for incompleteness is recorded
+5. **Graceful scaling** - Works with concurrent tasks and hard resets