US
0 suggestions are available, use up and down arrow to navigate them
PROCESSING APPLICATION
Hold tight! We’re comparing your resume to the job requirements…

ARE YOU SURE YOU WANT TO APPLY TO THIS JOB?
Based on your Resume, it doesn't look like you meet the requirements from the employer. You can still apply if you think you’re a fit.
Job Requirements of Lead Site Reliability Engineer:
-
Employment Type:
Full-Time
-
Location:
San Ramon, CA (Onsite)
Do you meet the requirements for this job?
Lead Site Reliability Engineer
Bayone Solutions Inc
San Ramon, CA (Onsite)
Full-Time
Job Description:
- As a Senior/Lead Site Reliability Engineer, you ll take ownership of the reliability, performance, and scalability of high-traffic retail platforms.
- This role demands deep experience in cloud-native environments, a strong observability mindset (with New Relic as a must), and the ability to lead both incident response and system design discussions with client teams.
- You ll serve as a technical leader and mentor, partnering with engineering, DevOps, and product teams to build resilient systems for real-time retail operations including eCommerce platforms like Shopify (bonus).
Key Responsibilities:
- Lead reliability and observability strategy for large-scale retail systems.
- Architect and implement robust monitoring using New Relic dashboards, SLOs, alerts, synthetic monitoring, etc.
- Guide incident response processes and run blameless postmortems.
- Own availability, performance, and scalability of customer-facing apps and services.
- Design infrastructure for high availability using Kubernetes, Docker, and IAC tools (Terraform, CloudFormation).
- Collaborate with client engineering teams to optimize system behavior during retail surges (e.g., Black Friday).
- Mentor junior SREs and set operational best practices.
- Partner with dev and QA to integrate performance testing and failure injection into CI/CD workflows.
- Advocate for DevOps/SRE best practices (shift-left monitoring, chaos testing, performance budgets).
- 8+ years in Site Reliability Engineering, DevOps, or Platform Engineering.
- Expertise with New Relic must be able to architect observability end-to-end.
- Proven experience supporting retail or eCommerce platforms at scale.
- Strong coding/scripting (Python, Bash, or Go).
- Production experience with AWS/GCP/Azure and Kubernetes.
- Deep understanding of infrastructure automation (Terraform, Ansible, or Pulumi).
- Strong communication skills, client-facing presence, and leadership ability.
Nice to Have:
- Experience with Shopify or headless commerce stacks.
- Experience leading distributed teams.
- Familiarity with traffic-heavy retail events and strategies (caching, autoscaling, edge optimization).
- Experience integrating monitoring into microservices, APIs, and frontend apps
Get job alerts by email.
Sign up now!
Join Our Talent Network!