Back to projects
Mar 15, 2020
3 min read

Hyperscale Network Observability Platform

Reduced Oracle Cloud Infrastructure Gen1 & Gen2 outages by 58% using open source observability tools and a small, focused team

The Mission

When Oracle recruited me to lead OCI Gen1 & Gen2 Network Reliability Engineering, the mandate was clear: transform infrastructure reliability for some of the world’s most critical workloads.

I built a team of 12 exceptional SREs and together we delivered:

  • 58% reduction in network-related outages
  • 25% reduction in total SaaS outage minutes
  • Zero compromises on security or performance

This is the story of how we did it, working alongside the former Amazon and Google builders who were creating OCI from the ground up.

The Scale

12 engineers. 116,000 devices. 64 data centers. Zero margin for error.

Our infrastructure powered:

  • Every Oracle SaaS customer globally
  • The White House communications systems
  • U.S. Marine Corps operations
  • Critical systems for every major bank
  • Sovereign state infrastructure

One minute of downtime could impact national security or global financial markets. There was no room for “good enough.”

Our Approach: Radical Simplicity

While others threw complexity at scale, we went the opposite direction. Our entire observability stack was built on three principles:

1. Open Source Excellence

  • Prometheus for metrics (handling 10+ billion data points daily)
  • Alertmanager for intelligent routing
  • Grafana for visualization
  • Python for everything else

2. Force Multipliers, Not More People Each engineer owned ~10,000 devices. This only worked because we:

  • Built tools that built tools
  • Made observability self-service
  • Automated ourselves out of repetitive work
  • Turned every incident into a permanent fix

3. SRE as a Religion We lived by Google’s SRE principles, adapted for Oracle scale:

  • Error budgets that meant something
  • Blameless postmortems that actually fixed problems
  • Change controls that prevented disasters
  • JIRA workflows that didn’t suck

The magic wasn’t in any single tool—it was in the discipline of doing simple things exceptionally well.

The Results Speak

For Oracle:

  • 25% reduction in total SaaS outage minutes
  • 58% fewer network incidents
  • MTTR dropped from hours to minutes
  • Scaled confidently through 300% growth

For Our Customers:

  • The White House stayed connected
  • Marine Corps operations ran uninterrupted
  • Banks processed billions without network failures
  • Sovereign states maintained critical infrastructure

For the Industry:

  • Proved open source could handle nation-state scale
  • Set new benchmarks for lean engineering teams
  • Influenced how Oracle approaches all infrastructure

Lessons Learned

Leading this team taught me fundamental truths about hyperscale operations:

  1. Small teams can move mountains when given the right tools and trust
  2. Simple, reliable tools beat complex systems every time
  3. Open source at scale isn’t just viable—it’s superior
  4. Trust is earned through consistent execution, not promises
  5. Every outage is a teacher if you’re willing to learn

The experience of keeping the White House, Marine Corps, and global financial systems online with 12 engineers and open source tools shapes everything I build today. It’s proof that with the right approach, small teams can deliver nation-state level reliability.

Let's Build AI That Works

Interested in building similar solutions?