Skip navigation EPAM

Lead Site Reliability Engineer - Remote Remote Canada

  • hot

Lead Site Reliability Engineer - Remote Description

Job #: 83574
If you are looking for a high-impact Site Reliability role with a global leader in digital transformation, EPAM is the perfect next step in your career! As an EPAMer, you’ll have the opportunity to work with a supportive team, on a variety of interesting projects for some of the biggest brands in the world. Are you ready for the next step in your career journey? Apply now!

Req.#365457185

Responsibilities

  • Lead development teams through architectural reviews and recommendations
  • Define what it means for a service to be available and develop, monitor, and alert on SLIs/SLOs
  • Define, track, and enforce error budgets
  • Review code instrumentation with development teams and ensure necessary dashboards are created to monitor SLI/SLO/SLAs
  • Establish, test, and tune alerting for varying tiers of applications
  • Participation in on-call rotation
  • Document and maintain runbooks and procedures, automate as much as possible
  • Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection)
  • Perform periodic load and scalability testing to establish baselines, drift, and capacity planning
  • Design and implement peak readiness reviews for anticipated high-volume times
  • Lead weekly operational state reviews covering performance trends, anomalies, errors and other availability events with SREs, product owners, and development teams
  • Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc
  • Socialize SRE culture across teams within the organization to publicize the value of SRE, mentor and train other engineers around proactive reliability decision making and planning

Requirements

  • 5+ years of SRE or Systems Engineering experience
  • 2+ years as team lead or SRE champion
  • Bachelor's degree in Computer Science, similar technical field of study, or equivalent practical experience
  • Proven experience troubleshooting, mitigating, and resolving issues in a distributed system
  • Strong communication and collaboration skills for varying groups of stakeholders
  • Be self-motivated and can prioritize effectively between competing priorities
  • Experience with implementing SRE practices for services and applications deployed in production in the cloud
  • Must understand most SRE concepts, including SLI/SLO/SLA, Error Budget, MTTD/MTTR/MTBF, Toil, Capacity Planning, Observability, Monitoring/Alerting, Release Engineering, and Incident Management/Blameless Post-Mortems

Benefits

  • Extended Healthcare with Prescription Drugs, Dental and Vision Insurance (Company Paid)
  • Life and AD&D Insurance (Company Paid)
  • Employee Assistance Program (Company Paid)
  • Long-Term Disability
  • Registered Retirement Savings Plan (RRSP) with company match
  • Paid Time Off
  • Critical Illness Insurance
  • Employee Discounts
  • Unlimited access to LinkedIn learning solutions
  • Employee Stock Purchase Program

About EPAM

  • EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential

Hello. How Can We Help You?

Our Offices