May 2026 · 5-part series: migrating CI/CD from GitHub to Codeberg
Kikitoru is a Japanese-language learning platform that runs on GKE. Our CI/CD pipeline pushes code 10-20 times per day. When GitHub works, it works. When it doesn't, we're blocked — and we can't fix it, because we don't control the infrastructure.
Over the past several months, we noticed a pattern: our deploys were failing, but GitHub's status page was green. This wasn't a one-off. It happened repeatedly, and each time, we lost time investigating our own code and configuration before realizing the problem was upstream.
GitHub's status page reports uptime measured by their own internal health checks — synthetic monitors hitting their APIs from within their own network. From GitHub's perspective, Actions is "99.5-99.9% available." From our perspective — an external CI/CD pipeline pushing containers to a different cloud provider — availability was noticeably worse.
Here's what we experienced:
Our builds would get stuck in "waiting for runner" for minutes at a time. The runner would eventually pick up the job, but the delay was unpredictable. Sometimes it was 30 seconds; sometimes it was 10 minutes. No incident was ever declared on the status page.
docker push to ghcr.io would intermittently return 503 errors. We'd retry and it would eventually succeed, but the unpredictability made our pipeline unreliable. Some pushes failed for hours, and there was no GitHub status page acknowledgment.
Push events would silently not be delivered to our downstream systems. This meant our pipeline wouldn't even trigger. We'd notice hours later that a deploy we expected never happened. Again, no incident, no acknowledgment.
The core issue isn't that GitHub has outages — every platform does. The issue is the trust gap. When the status page says "All Systems Operational" while your builds are clearly broken, you lose confidence in two things simultaneously:
This double hit is worse than either problem alone. When your status page is unreliable, you can't tell whether a failure is "your problem" or "their problem." Debugging becomes guesswork. You waste hours checking your own configuration, your credentials, your container images — only to discover the registry was having a bad day and nobody told you.
When we moved to a self-hosted Forgejo runner on GKE, our CI availability became tied to GCP's SLA:
More importantly, when something breaks, we can see it, debug it, and fix it. The runner is our pod in our namespace. The logs are in our Stackdriver. The metrics are in our Grafana. If the spot pool scales up slowly, we can see the cold start in our own dashboards. No more mystery outages.
Self-hosting means we own the maintenance. We need to upgrade the runner, manage the K8s manifests, and handle DinD (Docker-in-Docker) complexity. But these are known, solvable engineering problems — not mystery outages that take hours to even identify.
We'd have moved even if self-hosting cost more. The reliability and transparency gains alone justified the migration. That the cost dropped by 10x was a bonus — not the reason.