bos55-nix-config-cicd/docs/binary-cache/implementation_plan.md

14 KiB

NixOS CI/CD Automated Deployment with deploy-rs

Overview

Implement a push-based automated deployment pipeline using deploy-rs for the homelab NixOS fleet. The pipeline builds on every push/PR, deploys on merge to main, and supports test-<hostname> branches for non-persistent trial deployments.


Architecture

┌─────────────┐     push     ┌──────────────────┐
│  Developer   │────────────▶│  Forgejo (Git)    │
└─────────────┘              └────────┬─────────┘
                                      │
                     ┌────────────────┼────────────────┐
                     ▼                ▼                 ▼
              ┌─────────────┐  ┌───────────┐   ┌──────────────┐
              │ CI: Build   │  │ CI: Check │   │ CI: Deploy   │
              │ all hosts   │  │ flake +   │   │ (main only)  │
              │ (every push)│  │ deployChk │   │ via deploy-rs│
              └──────┬──────┘  └───────────┘   └──────┬───────┘
                     │                                 │ SSH
                     ▼                                 ▼
              ┌─────────────┐              ┌──────────────────┐
              │  Harmonia   │◀─── push ────│  Target Hosts    │
              │ Binary Cache│─── pull ────▶│ (NixOS machines) │
              └─────────────┘              └──────────────────┘

Key Design Decisions

Test branch activation (test-<hostname>)

deploy-rs's activate.nixos calls switch-to-configuration switch by default, which updates the bootloader. For test branches, we create a separate profile using activate.custom that calls switch-to-configuration test instead — this activates the configuration immediately but does not update the bootloader. On reboot, the host falls back to the last switch-deployed generation.

Magic rollback still works on test deployments: deploy-rs confirms the host is reachable after activation and auto-reverts if it can't connect.

# Test activation: active now, but reboot reverts to previous boot entry
activate.custom base.config.system.build.toplevel ''
  cd /tmp
  $PROFILE/bin/switch-to-configuration test
''

Zero duplication in flake.nix

Use builtins.mapAttrs over self.nixosConfigurations to generate deploy.nodes automatically. Host metadata (IP, whether to enable deploy) is stored once per host config.

Renovate bot compatibility

The pipeline is fully compatible with Renovate:

  • Minor/patch updates: Renovate opens a PR → CI builds all hosts → Renovate auto-merges → CI deploys (uses switch, updates bootloader)
  • Major updates: Renovate opens PR → CI builds → waits for manual review → merge → deploy with switch (persists across reboot)
  • The deploy step differentiates using the branch name, not the commit source, so Renovate PRs behave identically to human PRs

System version upgrades (kernel, etc.)

When a deployment requires a reboot (e.g., kernel upgrade):

  1. CI deploys with --boot flag → calls switch-to-configuration boot (sets new generation as boot default without activating)
  2. A separate reboot step (manual or scheduled) activates the change

Important

deploy-rs does not auto-detect whether a reboot is needed. The workflow can check if the kernel or initrd changed and conditionally use --boot instead, or always use switch and document that the operator should reboot when nixos-rebuild would have shown reboot required.


Security & Trust Boundaries

Trust model diagram

┌─────────────────────────────────────────────────────┐
│                    TRUST ZONE 1                      │
│               Developer Workstations                 │
│  • Holds sops-nix age keys (decrypt secrets)        │
│  • Holds GPG/SSH keys for signed commits            │
│  • Can manually deploy via deploy-rs                │
│  • Can push to any branch                           │
└──────────────────────┬──────────────────────────────┘
                       │ git push (signed commits)
                       ▼
┌─────────────────────────────────────────────────────┐
│                    TRUST ZONE 2                      │
│              Forgejo + CI Runner                     │
│  • Holds CI SSH deploy key (DEPLOY_SSH_KEY secret)  │
│  • Does NOT hold sops-nix age keys                  │
│  • Branch protection: main requires PR + checks     │
│  • Can only deploy via the deploy user              │
│  • Builds are sandboxed in Nix                      │
└──────────────────────┬──────────────────────────────┘
                       │ SSH as "deploy" user
                       ▼
┌─────────────────────────────────────────────────────┐
│                    TRUST ZONE 3                      │
│               Target NixOS Hosts                     │
│  • deploy user: system user, no shell login         │
│  • sudo: ONLY nix-env --set and                     │
│          switch-to-configuration (NOPASSWD)          │
│  • No write access to /etc, home dirs, etc.         │
│  • sops secrets decrypted at activation via host    │
│    age keys (not CI keys)                           │
└─────────────────────────────────────────────────────┘

What each actor can do

Actor Can build Can deploy Can decrypt secrets Can access hosts
Developer (manual) (personal age keys) (personal SSH)
CI runner (deploy user) Limited (deploy user)
deploy user (sudo restricted) N/A (runs on host)
Host age key (own secrets only) N/A

Hardening measures

  1. Branch protection on main: require PR, passing checks, optional signed commits
  2. deploy user (users/deploy/default.nix): restricted sudoers, no home dir, system user
  3. CI secret isolation: SSH key only, no age keys in CI — secrets are decrypted on-host at activation time by sops-nix using host-specific age keys
  4. Magic rollback: if a deploy renders the host unreachable, deploy-rs auto-reverts within the confirm timeout
  5. nix flake check + deployChecks: validate the flake structure and deploy configuration before any deployment

Note

The deploy user SSH key is stored as a Forgejo Actions secret. Even if the CI runner is compromised, the attacker can only push Nix store paths and trigger switch-to-configuration — they cannot decrypt secrets, access user data, or escalate beyond what the restricted sudoers rules allow.


Proposed Changes

1. Flake configuration

[MODIFY] flake.nix

  • Add deploy-rs to flake inputs
  • Auto-generate deploy.nodes from self.nixosConfigurations using mapAttrszero duplication
  • Add checks output via deploy-rs.lib.deployChecks
  • Define a helper that reads each host's config.networking for hostname/IP
# Sketch of the deploy output (no per-host duplication)
deploy.nodes = builtins.mapAttrs (name: nixos: {
  hostname = nixos.config.homelab.deploy.targetHost; # defined per host
  sshUser = "deploy";
  user = "root";
  magicRollback = true;
  autoRollback = true;
  profiles.system = {
    path = deploy-rs.lib.x86_64-linux.activate.nixos nixos;
  };
}) (lib.filterAttrs
  (name: nixos: nixos.config.homelab.users.deploy.enable or false)
  self.nixosConfigurations);

2. Deploy user module

[MODIFY] default.nix

  • Add option homelab.deploy.targetHost (string, the IP/hostname for deploy-rs to SSH into)
  • Support multiple SSH authorized keys (CI key + personal workstation keys)
  • Add nix.settings.trusted-users option for the deploy user (needed for nix copy from cache)

3. Enable deploy user on target hosts

[MODIFY] Host default.nix files (per host)

  • Enable homelab.users.deploy.enable = true on all deployable hosts
  • Set homelab.deploy.targetHost to each host's IP (e.g., "192.168.0.10" for Ingress)
  • Currently only Niko has deploy enabled; extend to all non-Template hosts

4. Binary cache (Harmonia)

[NEW] modules/services/harmonia/default.nix

  • Create homelab.services.harmonia module wrapping services.harmonia
  • Generates a signing key pair for the cache
  • Configures Nginx reverse proxy with HTTPS (via ACME or internal cert)
  • All hosts configured to use the cache as a substituter via nix.settings.substituters

Tip

Harmonia is chosen over attic (simpler, no database needed) and nix-serve (better performance, streaming, zstd compression). It serves your /nix/store directly, so the CI runner can nix copy built closures to the cache host after a successful build.

[NEW] modules/common/nix-cache.nix

  • Configure all hosts to use the binary cache as a substituter
  • Add the cache's public signing key to trusted-public-keys
  • Usable by personal devices too (add the cache URL + public key to their nix.conf)

5. CI Workflows

[MODIFY] build.yml

  • Use the dynamic determine-hosts job output for the build matrix (already partially implemented)
  • Add nix flake check step for deployChecks validation
  • Build all hosts on every push/PR
  • Optionally push built closures to the Harmonia cache after successful build

[NEW] deploy.yml

  • Trigger: push to main or test-* branches (after build passes)
  • Load DEPLOY_SSH_KEY from Forgejo Actions secrets
  • For main: deploy . (all hosts, switch-to-configuration switch)
  • For test-<hostname>: deploy only the matching host with a test profile (switch-to-configuration test) — no bootloader update
  • Magic rollback enabled by default
  • Optional --boot mode for kernel upgrades (triggered by label or manual dispatch)

[NEW] check.yml

  • Runs nix flake check (includes deployChecks)
  • Runs nix eval to validate all configurations parse correctly
  • Can be required as a status check for Renovate auto-merge rules

6. Monitoring

[NEW] modules/services/monitoring/default.nix

  • Enable node exporter on all hosts for Prometheus scraping
  • Export NixOS generation info: current generation, boot generation, system version
  • Optionally integrate with the existing infrastructure (e.g., Prometheus on Production)

Script/service to export NixOS deploy state:

# Metrics like:
# nixos_current_generation{host="Niko"} 42
# nixos_boot_generation{host="Niko"} 42    # same = no pending reboot
# nixos_config_age_seconds{host="Niko"} 3600

When current_generation != boot_generation, the host has a test deployment active (or needs a reboot).


7. Local VM Testing

[NEW] test/vm-test.nix

NixOS has built-in VM testing via nixos-rebuild build-vm and the NixOS test framework. The approach:

  1. Build a VM from any host config:

    nix build .#nixosConfigurations.Testing.config.system.build.vm
    ./result/bin/run-Testing-vm
    
  2. NixOS integration test (test/vm-test.nix):

    • Spins up a minimal VM cluster (e.g., two nodes)
    • Runs deploy-rs against one VM from the other
    • Validates activation, rollback, and connectivity
    • Uses nixos-testing framework (Python test driver)
  3. Full CI pipeline test locally with act:

    # Run the GitHub Actions workflow locally using act
    act push --container-architecture linux/amd64
    

Note

The existing build.yml already uses catthehacker/ubuntu:act-24.04 containers, suggesting act is already part of the workflow. VM tests don't require actual network access to target hosts.


Verification Plan

Automated Tests

  • nix flake check — validates flake + deployChecks schema
  • nix build .#nixosConfigurations.<host>.config.system.build.toplevel for each host
  • NixOS VM integration test (test/vm-test.nix)

Manual Verification (guinea pig: Development or Testing)

  1. Push to test-Development → verify deploy-rs runs switch-to-configuration test on 192.168.0.91
  2. Reboot Development → verify it falls back to previous generation (test branch behavior)
  3. Merge to main → verify deploy-rs deploys to all enabled hosts with switch
  4. Intentionally break a config → verify magic rollback activates
  5. Push to Harmonia cache → verify another host can pull the closure
  6. Check monitoring metrics show correct generation numbers