← Back 01

Infrastructure Root Cause: GitHub Reliability Limits

May 2026 · Part 1 of a 5-part series on migrating CI/CD from GitHub to Codeberg

The Reliability Problem

The Kikitoru platform (a Japanese-language learning application running on Google Kubernetes Engine) executes its core CI/CD pipeline between 10 and 20 times per working day. Under optimal operational conditions, the managed service layer provided by GitHub Actions performs within baseline parameters. However, infrastructure failures within their network introduce absolute blocking points that cannot be remediated externally due to a lack of host-level infrastructure control.

Over a multi-month observation window, a distinct failure pattern emerged: deployment pipelines failed while GitHub’s external status monitoring indicators reported normal operating conditions. This discrepancy repeatedly forced engineering teams to spend cycles debugging internal application configurations and security credentials before isolating the root cause as an upstream provider degradation.

Health Metrics: Status Page vs. Operational Reality

GitHub reports service availability based on synthetic internal health checks executing within its own network boundary. This methodology yields an official availability metric between 99.5% and 99.9%.

From an external architectural perspective—where an independent pipeline must compile assets and push containers across distinct cloud provider boundaries—the actual degradation rate was significantly higher.

Documented Subsystem Failures

Silent Runner Queue Latency: Automated jobs frequently entered unlogged queuing states (waiting for a runner). The execution delays fluctuated non-deterministically between 30 seconds and 10 minutes. Because the service eventually self-corrected, these events never triggered provider incident reports.
Intermittent Container Registry Degradation (GHCR): docker push operations targeting ghcr.io intermittently returned HTTP 503 Service Unavailable status codes. While subsequent retries occasionally resolved the connection, unacknowledged registry outages frequently persisted for hours, blocking artifact promotion.
Unacknowledged Webhook Drops: Upstream Git push events routinely failed to transmit payloads to downstream orchestration systems. This failure mode prevented pipeline execution entirely, causing silent deployment omissions that went undetected until manual intervention loops identified the missing builds.

The Transparency Trust Gap

Infrastructure failures occur across all cloud systems; however, the primary operational constraint is the transparency gap. When a third-party status dashboard displays All Systems Operational during an active platform disruption, it introduces a dual dependency failure:

Availability Verification Failure: True platform availability degrades below reported metrics.
Telemetry Visibility Failure: Diagnostic telemetry cannot be trusted to isolate system boundaries.

This operational blind spot eliminates the ability to rapidly determine if a pipeline failure is an internal configuration error or an upstream service fault. Engineering diagnostics degenerate into speculative troubleshooting, wasting resource hours validating local configurations, access tokens, and image layers before identifying an unannounced registry outage.

Architectural Remediation via Self-Hosting

Transitioning to a private Forgejo runner architecture natively deployed inside our existing GKE environment binds CI availability metrics directly to the target cloud provider’s structured Service Level Agreements (SLAs):

Google Kubernetes Engine (GKE): 99.95% monthly uptime SLA.
Cloud SQL Engine: 99.99% monthly uptime SLA.
Google Artifact Registry (GAR): Localized within the same regional infrastructure and identity access management (IAM) boundary as the target cluster.

More significantly, this layout shifts the diagnostic boundary into our own observability stack. The runner operates as a native pod within a managed Kubernetes namespace. Execution output routes directly to internal log aggregators, and runtime performance metrics stream to localized Grafana instances.

System behaviors—such as cold-start latency during spot instance provisioning—are visible on internal dashboards, eliminating hidden infrastructure anomalies.

Engineering Trade-Off Analysis

1. Operational Overhead (Labilities)

Transitioning away from a managed service model requires the engineering team to assume full lifecycle maintenance of the CI/CD plane. This includes executing runner engine updates, managing Kubernetes cluster manifests, and resolving Docker-in-Docker (DinD) storage and security layers. These tasks represent well-defined, predictable engineering workflows rather than untraceable infrastructure failures.

2. Infrastructure Optimization (Gains)

Observability: Complete access to all runner stdout logs, pod lifecycle events, and node autoscaling metrics via internal monitoring.
Control: Direct administrative authority to scale worker limits, pin execution dependencies, and optimize compute configurations without external dependencies.
Expenditure Efficiency: Compute resource costs on a dedicated spot pool scale down to zero during idle windows, averaging approximately $3.00 per month under normal operation.
SLA Alignment: The delivery pipeline shares the identical operational risk profile and uptime metrics as the production workload environments.

3. Financial Implications

The decision to migrate was driven entirely by system reliability and telemetry visibility requirements. The architecture shift would have been executed even if self-hosting incurred higher monthly compute costs. The 10x reduction in operational expenditures remains a secondary structural benefit rather than the primary driver of the project.