← Back 01

GitHub Reliability & Our Breaking Point

May 2026 · 5-part series: migrating CI/CD from GitHub to Codeberg

The Reliability Problem

Kikitoru is a Japanese-language learning platform that runs on GKE. Our CI/CD pipeline pushes code 10-20 times per day. When GitHub works, it works. When it doesn't, we're blocked — and we can't fix it, because we don't control the infrastructure.

Over the past several months, we noticed a pattern: our deploys were failing, but GitHub's status page was green. This wasn't a one-off. It happened repeatedly, and each time, we lost time investigating our own code and configuration before realizing the problem was upstream.

Status Page vs Reality

GitHub's status page reports uptime measured by their own internal health checks — synthetic monitors hitting their APIs from within their own network. From GitHub's perspective, Actions is "99.5-99.9% available." From our perspective — an external CI/CD pipeline pushing containers to a different cloud provider — availability was noticeably worse.

Here's what we experienced:

Actions Runners Queueing Silently

Our builds would get stuck in "waiting for runner" for minutes at a time. The runner would eventually pick up the job, but the delay was unpredictable. Sometimes it was 30 seconds; sometimes it was 10 minutes. No incident was ever declared on the status page.

GHCR Push Failures

docker push to ghcr.io would intermittently return 503 errors. We'd retry and it would eventually succeed, but the unpredictability made our pipeline unreliable. Some pushes failed for hours, and there was no GitHub status page acknowledgment.

Webhooks Dropped Without Acknowledgment

Push events would silently not be delivered to our downstream systems. This meant our pipeline wouldn't even trigger. We'd notice hours later that a deploy we expected never happened. Again, no incident, no acknowledgment.

The Trust Gap

The core issue isn't that GitHub has outages — every platform does. The issue is the trust gap. When the status page says "All Systems Operational" while your builds are clearly broken, you lose confidence in two things simultaneously:

  1. The platform's reliability — it's worse than reported
  2. The platform's transparency — they're not telling you when things are broken

This double hit is worse than either problem alone. When your status page is unreliable, you can't tell whether a failure is "your problem" or "their problem." Debugging becomes guesswork. You waste hours checking your own configuration, your credentials, your container images — only to discover the registry was having a bad day and nobody told you.

Why Self-Hosting Solves This

When we moved to a self-hosted Forgejo runner on GKE, our CI availability became tied to GCP's SLA:

More importantly, when something breaks, we can see it, debug it, and fix it. The runner is our pod in our namespace. The logs are in our Stackdriver. The metrics are in our Grafana. If the spot pool scales up slowly, we can see the cold start in our own dashboards. No more mystery outages.

What We Gave Up

Self-hosting means we own the maintenance. We need to upgrade the runner, manage the K8s manifests, and handle DinD (Docker-in-Docker) complexity. But these are known, solvable engineering problems — not mystery outages that take hours to even identify.

What We Gained

Cost Was Secondary

We'd have moved even if self-hosting cost more. The reliability and transparency gains alone justified the migration. That the cost dropped by 10x was a bonus — not the reason.

Next: Architecture & Topology →