Skip navigation EPAM

Lead Site Reliability Engineer - Remote Remote USA

  • hot

Lead Site Reliability Engineer - Remote Description

Job #: 83573
If you are looking for a high-impact Site Reliability role with a global leader in digital transformation, EPAM is the perfect next step in your career! As an EPAMer, you’ll have the opportunity to work with a supportive team, on a variety of interesting projects for some of the biggest brands in the world. Are you ready for the next step in your career journey? Apply now!

Req.#365457185

Responsibilities

  • Lead development teams through architectural reviews and recommendations
  • Define what it means for a service to be available and develop, monitor, and alert on SLIs/SLOs
  • Define, track, and enforce error budgets
  • Review code instrumentation with development teams and ensure necessary dashboards are created to monitor SLI/SLO/SLAs
  • Establish, test, and tune alerting for varying tiers of applications
  • Participation in on-call rotation
  • Document and maintain runbooks and procedures, automate as much as possible
  • Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection)
  • Perform periodic load and scalability testing to establish baselines, drift, and capacity planning
  • Design and implement peak readiness reviews for anticipated high-volume times
  • Lead weekly operational state reviews covering performance trends, anomalies, errors and other availability events with SREs, product owners, and development teams
  • Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc
  • Socialize SRE culture across teams within the organization to publicize the value of SRE, mentor and train other engineers around proactive reliability decision making and planning

Requirements

  • 5+ years of SRE or Systems Engineering experience
  • 2+ years as team lead or SRE champion
  • Bachelor's degree in Computer Science, similar technical field of study, or equivalent practical experience
  • Proven experience troubleshooting, mitigating, and resolving issues in a distributed system
  • Strong communication and collaboration skills for varying groups of stakeholders
  • Be self-motivated and can prioritize effectively between competing priorities
  • Experience with implementing SRE practices for services and applications deployed in production in the cloud
  • Must understand most SRE concepts, including SLI/SLO/SLA, Error Budget, MTTD/MTTR/MTBF, Toil, Capacity Planning, Observability, Monitoring/Alerting, Release Engineering, and Incident Management/Blameless Post-Mortems

Benefits

  • Medical, Dental and Vision Insurance (Subsidized)
  • Health Savings Account
  • Flexible Spending Accounts (Healthcare, Dependent Care, Commuter)
  • Short-Term and Long-Term Disability (Company Provided)
  • Life and AD&D Insurance (Company Provided)
  • Employee Assistance Program
  • Unlimited access to LinkedIn learning solutions
  • Matched 401(k) Retirement Savings Plan
  • Paid Time Off
  • Legal Plan and Identity Theft Protection
  • Accident Insurance
  • Employee Discounts
  • Pet Insurance
  • Employee Stock Purchase Program

About EPAM

  • EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential

Additional

  • This position operates in a remote capacity, but you must live within driving distance to an EPAM office. Your recruiter will discuss specific details about work location during the initial interview process

Hello. How Can We Help You?

Our Offices