The Mission
When Oracle recruited me to lead OCI Gen1 & Gen2 Network Reliability Engineering, the mandate was clear: transform infrastructure reliability for some of the world’s most critical workloads.
I built a team of 12 exceptional SREs and together we delivered:
- 58% reduction in network-related outages
- 25% reduction in total SaaS outage minutes
- Zero compromises on security or performance
This is the story of how we did it, working alongside the former Amazon and Google builders who were creating OCI from the ground up.
The Scale
12 engineers. 116,000 devices. 64 data centers. Zero margin for error.
Our infrastructure powered:
- Every Oracle SaaS customer globally
- The White House communications systems
- U.S. Marine Corps operations
- Critical systems for every major bank
- Sovereign state infrastructure
One minute of downtime could impact national security or global financial markets. There was no room for “good enough.”
Our Approach: Radical Simplicity
While others threw complexity at scale, we went the opposite direction. Our entire observability stack was built on three principles:
1. Open Source Excellence
- Prometheus for metrics (handling 10+ billion data points daily)
- Alertmanager for intelligent routing
- Grafana for visualization
- Python for everything else
2. Force Multipliers, Not More People Each engineer owned ~10,000 devices. This only worked because we:
- Built tools that built tools
- Made observability self-service
- Automated ourselves out of repetitive work
- Turned every incident into a permanent fix
3. SRE as a Religion We lived by Google’s SRE principles, adapted for Oracle scale:
- Error budgets that meant something
- Blameless postmortems that actually fixed problems
- Change controls that prevented disasters
- JIRA workflows that didn’t suck
The magic wasn’t in any single tool—it was in the discipline of doing simple things exceptionally well.
The Results Speak
For Oracle:
- 25% reduction in total SaaS outage minutes
- 58% fewer network incidents
- MTTR dropped from hours to minutes
- Scaled confidently through 300% growth
For Our Customers:
- The White House stayed connected
- Marine Corps operations ran uninterrupted
- Banks processed billions without network failures
- Sovereign states maintained critical infrastructure
For the Industry:
- Proved open source could handle nation-state scale
- Set new benchmarks for lean engineering teams
- Influenced how Oracle approaches all infrastructure
Lessons Learned
Leading this team taught me fundamental truths about hyperscale operations:
- Small teams can move mountains when given the right tools and trust
- Simple, reliable tools beat complex systems every time
- Open source at scale isn’t just viable—it’s superior
- Trust is earned through consistent execution, not promises
- Every outage is a teacher if you’re willing to learn
The experience of keeping the White House, Marine Corps, and global financial systems online with 12 engineers and open source tools shapes everything I build today. It’s proof that with the right approach, small teams can deliver nation-state level reliability.