forked from Bos55/nix-config
docs: add binary cache implementation documentation
This commit is contained in:
parent
c8836f2543
commit
b58d56fa53
4 changed files with 459 additions and 0 deletions
288
docs/binary-cache/implementation_plan.md
Normal file
288
docs/binary-cache/implementation_plan.md
Normal file
|
|
@ -0,0 +1,288 @@
|
|||
# NixOS CI/CD Automated Deployment with deploy-rs
|
||||
|
||||
## Overview
|
||||
|
||||
Implement a push-based automated deployment pipeline using **deploy-rs** for the homelab NixOS fleet. The pipeline builds on every push/PR, deploys on merge to `main`, and supports `test-<hostname>` branches for non-persistent trial deployments.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐ push ┌──────────────────┐
|
||||
│ Developer │────────────▶│ Forgejo (Git) │
|
||||
└─────────────┘ └────────┬─────────┘
|
||||
│
|
||||
┌────────────────┼────────────────┐
|
||||
▼ ▼ ▼
|
||||
┌─────────────┐ ┌───────────┐ ┌──────────────┐
|
||||
│ CI: Build │ │ CI: Check │ │ CI: Deploy │
|
||||
│ all hosts │ │ flake + │ │ (main only) │
|
||||
│ (every push)│ │ deployChk │ │ via deploy-rs│
|
||||
└──────┬──────┘ └───────────┘ └──────┬───────┘
|
||||
│ │ SSH
|
||||
▼ ▼
|
||||
┌─────────────┐ ┌──────────────────┐
|
||||
│ Harmonia │◀─── push ────│ Target Hosts │
|
||||
│ Binary Cache│─── pull ────▶│ (NixOS machines) │
|
||||
└─────────────┘ └──────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
### Test branch activation (`test-<hostname>`)
|
||||
|
||||
deploy-rs's `activate.nixos` calls `switch-to-configuration switch` by default, which updates the bootloader. For test branches, we create a **separate profile** using `activate.custom` that calls `switch-to-configuration test` instead — this activates the configuration immediately but **does not update the bootloader**. On reboot, the host falls back to the last `switch`-deployed generation.
|
||||
|
||||
Magic rollback still works on test deployments: deploy-rs confirms the host is reachable after activation and auto-reverts if it can't connect.
|
||||
|
||||
```nix
|
||||
# Test activation: active now, but reboot reverts to previous boot entry
|
||||
activate.custom base.config.system.build.toplevel ''
|
||||
cd /tmp
|
||||
$PROFILE/bin/switch-to-configuration test
|
||||
''
|
||||
```
|
||||
|
||||
### Zero duplication in `flake.nix`
|
||||
|
||||
Use `builtins.mapAttrs` over `self.nixosConfigurations` to generate `deploy.nodes` automatically. Host metadata (IP, whether to enable deploy) is stored once per host config.
|
||||
|
||||
### Renovate bot compatibility
|
||||
|
||||
The pipeline is fully compatible with Renovate:
|
||||
- **Minor/patch updates**: Renovate opens a PR → CI builds all hosts → Renovate auto-merges → CI deploys (uses `switch`, updates bootloader)
|
||||
- **Major updates**: Renovate opens PR → CI builds → waits for manual review → merge → deploy with `switch` (persists across reboot)
|
||||
- The deploy step differentiates using the **branch name**, not the commit source, so Renovate PRs behave identically to human PRs
|
||||
|
||||
### System version upgrades (kernel, etc.)
|
||||
|
||||
When a deployment requires a reboot (e.g., kernel upgrade):
|
||||
1. CI deploys with `--boot` flag → calls `switch-to-configuration boot` (sets new generation as boot default without activating)
|
||||
2. A separate reboot step (manual or scheduled) activates the change
|
||||
|
||||
> [!IMPORTANT]
|
||||
> deploy-rs does not auto-detect whether a reboot is needed. The workflow can check if the kernel or initrd changed and conditionally use `--boot` instead, or always use `switch` and document that the operator should reboot when `nixos-rebuild` would have shown `reboot required`.
|
||||
|
||||
---
|
||||
|
||||
## Security & Trust Boundaries
|
||||
|
||||
### Trust model diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ TRUST ZONE 1 │
|
||||
│ Developer Workstations │
|
||||
│ • Holds sops-nix age keys (decrypt secrets) │
|
||||
│ • Holds GPG/SSH keys for signed commits │
|
||||
│ • Can manually deploy via deploy-rs │
|
||||
│ • Can push to any branch │
|
||||
└──────────────────────┬──────────────────────────────┘
|
||||
│ git push (signed commits)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ TRUST ZONE 2 │
|
||||
│ Forgejo + CI Runner │
|
||||
│ • Holds CI SSH deploy key (DEPLOY_SSH_KEY secret) │
|
||||
│ • Does NOT hold sops-nix age keys │
|
||||
│ • Branch protection: main requires PR + checks │
|
||||
│ • Can only deploy via the deploy user │
|
||||
│ • Builds are sandboxed in Nix │
|
||||
└──────────────────────┬──────────────────────────────┘
|
||||
│ SSH as "deploy" user
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ TRUST ZONE 3 │
|
||||
│ Target NixOS Hosts │
|
||||
│ • deploy user: system user, no shell login │
|
||||
│ • sudo: ONLY nix-env --set and │
|
||||
│ switch-to-configuration (NOPASSWD) │
|
||||
│ • No write access to /etc, home dirs, etc. │
|
||||
│ • sops secrets decrypted at activation via host │
|
||||
│ age keys (not CI keys) │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### What each actor can do
|
||||
|
||||
| Actor | Can build | Can deploy | Can decrypt secrets | Can access hosts |
|
||||
|---|---|---|---|---|
|
||||
| Developer | ✅ | ✅ (manual) | ✅ (personal age keys) | ✅ (personal SSH) |
|
||||
| CI runner | ✅ | ✅ (deploy user) | ❌ | Limited (deploy user) |
|
||||
| deploy user | ❌ | ✅ (sudo restricted) | ❌ | N/A (runs on host) |
|
||||
| Host age key | ❌ | ❌ | ✅ (own secrets only) | N/A |
|
||||
|
||||
### Hardening measures
|
||||
|
||||
1. **Branch protection** on `main`: require PR, passing checks, optional signed commits
|
||||
2. **deploy user** ([`users/deploy/default.nix`](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/users/deploy/default.nix)): restricted sudoers, no home dir, system user
|
||||
3. **CI secret isolation**: SSH key only, no age keys in CI — secrets are decrypted on-host at activation time by sops-nix using host-specific age keys
|
||||
4. **Magic rollback**: if a deploy renders the host unreachable, deploy-rs auto-reverts within the confirm timeout
|
||||
5. **`nix flake check` + `deployChecks`**: validate the flake structure and deploy configuration before any deployment
|
||||
|
||||
> [!NOTE]
|
||||
> The deploy user SSH key is stored as a Forgejo Actions secret. Even if the CI runner is compromised, the attacker can only push Nix store paths and trigger `switch-to-configuration` — they cannot decrypt secrets, access user data, or escalate beyond what the restricted sudoers rules allow.
|
||||
|
||||
---
|
||||
|
||||
## Proposed Changes
|
||||
|
||||
### 1. Flake configuration
|
||||
|
||||
#### [MODIFY] [flake.nix](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/flake.nix)
|
||||
|
||||
- Add `deploy-rs` to flake inputs
|
||||
- Auto-generate `deploy.nodes` from `self.nixosConfigurations` using `mapAttrs` — **zero duplication**
|
||||
- Add `checks` output via `deploy-rs.lib.deployChecks`
|
||||
- Define a helper that reads each host's `config.networking` for hostname/IP
|
||||
|
||||
```nix
|
||||
# Sketch of the deploy output (no per-host duplication)
|
||||
deploy.nodes = builtins.mapAttrs (name: nixos: {
|
||||
hostname = nixos.config.homelab.deploy.targetHost; # defined per host
|
||||
sshUser = "deploy";
|
||||
user = "root";
|
||||
magicRollback = true;
|
||||
autoRollback = true;
|
||||
profiles.system = {
|
||||
path = deploy-rs.lib.x86_64-linux.activate.nixos nixos;
|
||||
};
|
||||
}) (lib.filterAttrs
|
||||
(name: nixos: nixos.config.homelab.users.deploy.enable or false)
|
||||
self.nixosConfigurations);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Deploy user module
|
||||
|
||||
#### [MODIFY] [default.nix](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/users/deploy/default.nix)
|
||||
|
||||
- Add option `homelab.deploy.targetHost` (string, the IP/hostname for deploy-rs to SSH into)
|
||||
- Support multiple SSH authorized keys (CI key + personal workstation keys)
|
||||
- Add `nix.settings.trusted-users` option for the deploy user (needed for `nix copy` from cache)
|
||||
|
||||
---
|
||||
|
||||
### 3. Enable deploy user on target hosts
|
||||
|
||||
#### [MODIFY] Host `default.nix` files (per host)
|
||||
|
||||
- Enable `homelab.users.deploy.enable = true` on all deployable hosts
|
||||
- Set `homelab.deploy.targetHost` to each host's IP (e.g., `"192.168.0.10"` for Ingress)
|
||||
- Currently only `Niko` has deploy enabled; extend to all non-`Template` hosts
|
||||
|
||||
---
|
||||
|
||||
### 4. Binary cache (Harmonia)
|
||||
|
||||
#### [NEW] [modules/services/harmonia/default.nix](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/modules/services/harmonia/default.nix)
|
||||
|
||||
- Create `homelab.services.harmonia` module wrapping `services.harmonia`
|
||||
- Generates a signing key pair for the cache
|
||||
- Configures Nginx reverse proxy with HTTPS (via ACME or internal cert)
|
||||
- All hosts configured to use the cache as a substituter via `nix.settings.substituters`
|
||||
|
||||
> [!TIP]
|
||||
> Harmonia is chosen over attic (simpler, no database needed) and nix-serve (better performance, streaming, zstd compression). It serves your `/nix/store` directly, so the CI runner can `nix copy` built closures to the cache host after a successful build.
|
||||
|
||||
#### [NEW] [modules/common/nix-cache.nix](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/modules/common/nix-cache.nix)
|
||||
|
||||
- Configure all hosts to use the binary cache as a substituter
|
||||
- Add the cache's public signing key to `trusted-public-keys`
|
||||
- Usable by personal devices too (add the cache URL + public key to their `nix.conf`)
|
||||
|
||||
---
|
||||
|
||||
### 5. CI Workflows
|
||||
|
||||
#### [MODIFY] [build.yml](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/.github/workflows/build.yml)
|
||||
|
||||
- Use the dynamic `determine-hosts` job output for the build matrix (already partially implemented)
|
||||
- Add `nix flake check` step for deployChecks validation
|
||||
- Build all hosts on every push/PR
|
||||
- Optionally push built closures to the Harmonia cache after successful build
|
||||
|
||||
#### [NEW] [deploy.yml](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/.github/workflows/deploy.yml)
|
||||
|
||||
- Trigger: push to `main` or `test-*` branches (after build passes)
|
||||
- Load `DEPLOY_SSH_KEY` from Forgejo Actions secrets
|
||||
- **For `main`**: `deploy .` (all hosts, `switch-to-configuration switch`)
|
||||
- **For `test-<hostname>`**: deploy only the matching host with a **test profile** (`switch-to-configuration test`) — no bootloader update
|
||||
- Magic rollback enabled by default
|
||||
- Optional `--boot` mode for kernel upgrades (triggered by label or manual dispatch)
|
||||
|
||||
#### [NEW] [check.yml](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/.github/workflows/check.yml)
|
||||
|
||||
- Runs `nix flake check` (includes deployChecks)
|
||||
- Runs `nix eval` to validate all configurations parse correctly
|
||||
- Can be required as a status check for Renovate auto-merge rules
|
||||
|
||||
---
|
||||
|
||||
### 6. Monitoring
|
||||
|
||||
#### [NEW] [modules/services/monitoring/default.nix](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/modules/services/monitoring/default.nix)
|
||||
|
||||
- Enable node exporter on all hosts for Prometheus scraping
|
||||
- Export NixOS generation info: current generation, boot generation, system version
|
||||
- Optionally integrate with the existing infrastructure (e.g., Prometheus on Production)
|
||||
|
||||
Script/service to export NixOS deploy state:
|
||||
```bash
|
||||
# Metrics like:
|
||||
# nixos_current_generation{host="Niko"} 42
|
||||
# nixos_boot_generation{host="Niko"} 42 # same = no pending reboot
|
||||
# nixos_config_age_seconds{host="Niko"} 3600
|
||||
```
|
||||
|
||||
When `current_generation != boot_generation`, the host has a test deployment active (or needs a reboot).
|
||||
|
||||
---
|
||||
|
||||
### 7. Local VM Testing
|
||||
|
||||
#### [NEW] [test/vm-test.nix](file:///c:/Users/tibod/Documents/projects/Bos55/bos55-nix-config-cicd/test/vm-test.nix)
|
||||
|
||||
NixOS has built-in VM testing via `nixos-rebuild build-vm` and the NixOS test framework. The approach:
|
||||
|
||||
1. **Build a VM from any host config**:
|
||||
```bash
|
||||
nix build .#nixosConfigurations.Testing.config.system.build.vm
|
||||
./result/bin/run-Testing-vm
|
||||
```
|
||||
|
||||
2. **NixOS integration test** (`test/vm-test.nix`):
|
||||
- Spins up a minimal VM cluster (e.g., two nodes)
|
||||
- Runs deploy-rs against one VM from the other
|
||||
- Validates activation, rollback, and connectivity
|
||||
- Uses `nixos-testing` framework (Python test driver)
|
||||
|
||||
3. **Full CI pipeline test locally with `act`**:
|
||||
```bash
|
||||
# Run the GitHub Actions workflow locally using act
|
||||
act push --container-architecture linux/amd64
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> The existing `build.yml` already uses `catthehacker/ubuntu:act-24.04` containers, suggesting `act` is already part of the workflow. VM tests don't require actual network access to target hosts.
|
||||
|
||||
---
|
||||
|
||||
## Verification Plan
|
||||
|
||||
### Automated Tests
|
||||
- `nix flake check` — validates flake + deployChecks schema
|
||||
- `nix build .#nixosConfigurations.<host>.config.system.build.toplevel` for each host
|
||||
- NixOS VM integration test (`test/vm-test.nix`)
|
||||
|
||||
### Manual Verification (guinea pig: `Development` or `Testing`)
|
||||
1. Push to `test-Development` → verify deploy-rs runs `switch-to-configuration test` on 192.168.0.91
|
||||
2. Reboot `Development` → verify it falls back to previous generation (test branch behavior)
|
||||
3. Merge to `main` → verify deploy-rs deploys to all enabled hosts with `switch`
|
||||
4. Intentionally break a config → verify magic rollback activates
|
||||
5. Push to Harmonia cache → verify another host can pull the closure
|
||||
6. Check monitoring metrics show correct generation numbers
|
||||
Loading…
Add table
Add a link
Reference in a new issue