14 KiB
NixOS CI/CD Automated Deployment with deploy-rs
Overview
Implement a push-based automated deployment pipeline using deploy-rs for the homelab NixOS fleet. The pipeline builds on every push/PR, deploys on merge to main, and supports test-<hostname> branches for non-persistent trial deployments.
Architecture
┌─────────────┐ push ┌──────────────────┐
│ Developer │────────────▶│ Forgejo (Git) │
└─────────────┘ └────────┬─────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌───────────┐ ┌──────────────┐
│ CI: Build │ │ CI: Check │ │ CI: Deploy │
│ all hosts │ │ flake + │ │ (main only) │
│ (every push)│ │ deployChk │ │ via deploy-rs│
└──────┬──────┘ └───────────┘ └──────┬───────┘
│ │ SSH
▼ ▼
┌─────────────┐ ┌──────────────────┐
│ Harmonia │◀─── push ────│ Target Hosts │
│ Binary Cache│─── pull ────▶│ (NixOS machines) │
└─────────────┘ └──────────────────┘
Key Design Decisions
Test branch activation (test-<hostname>)
deploy-rs's activate.nixos calls switch-to-configuration switch by default, which updates the bootloader. For test branches, we create a separate profile using activate.custom that calls switch-to-configuration test instead — this activates the configuration immediately but does not update the bootloader. On reboot, the host falls back to the last switch-deployed generation.
Magic rollback still works on test deployments: deploy-rs confirms the host is reachable after activation and auto-reverts if it can't connect.
# Test activation: active now, but reboot reverts to previous boot entry
activate.custom base.config.system.build.toplevel ''
cd /tmp
$PROFILE/bin/switch-to-configuration test
''
Zero duplication in flake.nix
Use builtins.mapAttrs over self.nixosConfigurations to generate deploy.nodes automatically. Host metadata (IP, whether to enable deploy) is stored once per host config.
Renovate bot compatibility
The pipeline is fully compatible with Renovate:
- Minor/patch updates: Renovate opens a PR → CI builds all hosts → Renovate auto-merges → CI deploys (uses
switch, updates bootloader) - Major updates: Renovate opens PR → CI builds → waits for manual review → merge → deploy with
switch(persists across reboot) - The deploy step differentiates using the branch name, not the commit source, so Renovate PRs behave identically to human PRs
System version upgrades (kernel, etc.)
When a deployment requires a reboot (e.g., kernel upgrade):
- CI deploys with
--bootflag → callsswitch-to-configuration boot(sets new generation as boot default without activating) - A separate reboot step (manual or scheduled) activates the change
Important
deploy-rs does not auto-detect whether a reboot is needed. The workflow can check if the kernel or initrd changed and conditionally use
--bootinstead, or always useswitchand document that the operator should reboot whennixos-rebuildwould have shownreboot required.
Security & Trust Boundaries
Trust model diagram
┌─────────────────────────────────────────────────────┐
│ TRUST ZONE 1 │
│ Developer Workstations │
│ • Holds sops-nix age keys (decrypt secrets) │
│ • Holds GPG/SSH keys for signed commits │
│ • Can manually deploy via deploy-rs │
│ • Can push to any branch │
└──────────────────────┬──────────────────────────────┘
│ git push (signed commits)
▼
┌─────────────────────────────────────────────────────┐
│ TRUST ZONE 2 │
│ Forgejo + CI Runner │
│ • Holds CI SSH deploy key (DEPLOY_SSH_KEY secret) │
│ • Does NOT hold sops-nix age keys │
│ • Branch protection: main requires PR + checks │
│ • Can only deploy via the deploy user │
│ • Builds are sandboxed in Nix │
└──────────────────────┬──────────────────────────────┘
│ SSH as "deploy" user
▼
┌─────────────────────────────────────────────────────┐
│ TRUST ZONE 3 │
│ Target NixOS Hosts │
│ • deploy user: system user, no shell login │
│ • sudo: ONLY nix-env --set and │
│ switch-to-configuration (NOPASSWD) │
│ • No write access to /etc, home dirs, etc. │
│ • sops secrets decrypted at activation via host │
│ age keys (not CI keys) │
└─────────────────────────────────────────────────────┘
What each actor can do
| Actor | Can build | Can deploy | Can decrypt secrets | Can access hosts |
|---|---|---|---|---|
| Developer | ✅ | ✅ (manual) | ✅ (personal age keys) | ✅ (personal SSH) |
| CI runner | ✅ | ✅ (deploy user) | ❌ | Limited (deploy user) |
| deploy user | ❌ | ✅ (sudo restricted) | ❌ | N/A (runs on host) |
| Host age key | ❌ | ❌ | ✅ (own secrets only) | N/A |
Hardening measures
- Branch protection on
main: require PR, passing checks, optional signed commits - deploy user (
users/deploy/default.nix): restricted sudoers, no home dir, system user - CI secret isolation: SSH key only, no age keys in CI — secrets are decrypted on-host at activation time by sops-nix using host-specific age keys
- Magic rollback: if a deploy renders the host unreachable, deploy-rs auto-reverts within the confirm timeout
nix flake check+deployChecks: validate the flake structure and deploy configuration before any deployment
Note
The deploy user SSH key is stored as a Forgejo Actions secret. Even if the CI runner is compromised, the attacker can only push Nix store paths and trigger
switch-to-configuration— they cannot decrypt secrets, access user data, or escalate beyond what the restricted sudoers rules allow.
Proposed Changes
1. Flake configuration
[MODIFY] flake.nix
- Add
deploy-rsto flake inputs - Auto-generate
deploy.nodesfromself.nixosConfigurationsusingmapAttrs— zero duplication - Add
checksoutput viadeploy-rs.lib.deployChecks - Define a helper that reads each host's
config.networkingfor hostname/IP
# Sketch of the deploy output (no per-host duplication)
deploy.nodes = builtins.mapAttrs (name: nixos: {
hostname = nixos.config.homelab.deploy.targetHost; # defined per host
sshUser = "deploy";
user = "root";
magicRollback = true;
autoRollback = true;
profiles.system = {
path = deploy-rs.lib.x86_64-linux.activate.nixos nixos;
};
}) (lib.filterAttrs
(name: nixos: nixos.config.homelab.users.deploy.enable or false)
self.nixosConfigurations);
2. Deploy user module
[MODIFY] default.nix
- Add option
homelab.deploy.targetHost(string, the IP/hostname for deploy-rs to SSH into) - Support multiple SSH authorized keys (CI key + personal workstation keys)
- Add
nix.settings.trusted-usersoption for the deploy user (needed fornix copyfrom cache)
3. Enable deploy user on target hosts
[MODIFY] Host default.nix files (per host)
- Enable
homelab.users.deploy.enable = trueon all deployable hosts - Set
homelab.deploy.targetHostto each host's IP (e.g.,"192.168.0.10"for Ingress) - Currently only
Nikohas deploy enabled; extend to all non-Templatehosts
4. Binary cache (Harmonia)
[NEW] modules/services/harmonia/default.nix
- Create
homelab.services.harmoniamodule wrappingservices.harmonia - Generates a signing key pair for the cache
- Configures Nginx reverse proxy with HTTPS (via ACME or internal cert)
- All hosts configured to use the cache as a substituter via
nix.settings.substituters
Tip
Harmonia is chosen over attic (simpler, no database needed) and nix-serve (better performance, streaming, zstd compression). It serves your
/nix/storedirectly, so the CI runner cannix copybuilt closures to the cache host after a successful build.
[NEW] modules/common/nix-cache.nix
- Configure all hosts to use the binary cache as a substituter
- Add the cache's public signing key to
trusted-public-keys - Usable by personal devices too (add the cache URL + public key to their
nix.conf)
5. CI Workflows
[MODIFY] build.yml
- Use the dynamic
determine-hostsjob output for the build matrix (already partially implemented) - Add
nix flake checkstep for deployChecks validation - Build all hosts on every push/PR
- Optionally push built closures to the Harmonia cache after successful build
[NEW] deploy.yml
- Trigger: push to
mainortest-*branches (after build passes) - Load
DEPLOY_SSH_KEYfrom Forgejo Actions secrets - For
main:deploy .(all hosts,switch-to-configuration switch) - For
test-<hostname>: deploy only the matching host with a test profile (switch-to-configuration test) — no bootloader update - Magic rollback enabled by default
- Optional
--bootmode for kernel upgrades (triggered by label or manual dispatch)
[NEW] check.yml
- Runs
nix flake check(includes deployChecks) - Runs
nix evalto validate all configurations parse correctly - Can be required as a status check for Renovate auto-merge rules
6. Monitoring
[NEW] modules/services/monitoring/default.nix
- Enable node exporter on all hosts for Prometheus scraping
- Export NixOS generation info: current generation, boot generation, system version
- Optionally integrate with the existing infrastructure (e.g., Prometheus on Production)
Script/service to export NixOS deploy state:
# Metrics like:
# nixos_current_generation{host="Niko"} 42
# nixos_boot_generation{host="Niko"} 42 # same = no pending reboot
# nixos_config_age_seconds{host="Niko"} 3600
When current_generation != boot_generation, the host has a test deployment active (or needs a reboot).
7. Local VM Testing
[NEW] test/vm-test.nix
NixOS has built-in VM testing via nixos-rebuild build-vm and the NixOS test framework. The approach:
-
Build a VM from any host config:
nix build .#nixosConfigurations.Testing.config.system.build.vm ./result/bin/run-Testing-vm -
NixOS integration test (
test/vm-test.nix):- Spins up a minimal VM cluster (e.g., two nodes)
- Runs deploy-rs against one VM from the other
- Validates activation, rollback, and connectivity
- Uses
nixos-testingframework (Python test driver)
-
Full CI pipeline test locally with
act:# Run the GitHub Actions workflow locally using act act push --container-architecture linux/amd64
Note
The existing
build.ymlalready usescatthehacker/ubuntu:act-24.04containers, suggestingactis already part of the workflow. VM tests don't require actual network access to target hosts.
Verification Plan
Automated Tests
nix flake check— validates flake + deployChecks schemanix build .#nixosConfigurations.<host>.config.system.build.toplevelfor each host- NixOS VM integration test (
test/vm-test.nix)
Manual Verification (guinea pig: Development or Testing)
- Push to
test-Development→ verify deploy-rs runsswitch-to-configuration teston 192.168.0.91 - Reboot
Development→ verify it falls back to previous generation (test branch behavior) - Merge to
main→ verify deploy-rs deploys to all enabled hosts withswitch - Intentionally break a config → verify magic rollback activates
- Push to Harmonia cache → verify another host can pull the closure
- Check monitoring metrics show correct generation numbers