Skip navigation EPAM

Site Reliability Engineer Dallas, TX, USA

  • hot

Site Reliability Engineer Description

Job #: 74441
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.


· We are looking for an experienced SRE engineer to be the key member of our SRE team: help build it, educate the development and management teams on SRE thinking, establish the best practices and help open the new clinics on time
· We use Microsoft Azure as our main cloud delivery platform, you will be required to establish the needed infrastructure (Continuous Integration/Continuous Delivery, monitoring, alerting, etc.) and to ensure stability of our services, which will now be depended on by the clinical patients
· We are responsible for ensuring high availability and low latency for the multitude of HTTP requests/events we receive every day.

Req. #294949567

What You’ll Do

  • Automate everything! Especially in Azure- Build Platform Tools and Dashboards
  • Be proficient with Splunk, building dashboards (Grafana and/or Prometheus)
  • Knowledge of Azure - dealing with production issues related to hardware, resource constraints, being able to take corrective actions, understanding from operational perspective, understanding of telemetry
  • Create and execute forward looking technology roadmap (We love DCOS & Docker)
  • Keep up with industry trends to ensure we are using the best tools and services
  • Own and improve our integration, deployment and monitoring story
  • Work closely with developers to solve systems problems
  • Collaborate with security team to deliver world class software


  • In order to support this explosive growth, we are looking for SRE engineer who has implemented SRE practices for services and applications deployed in production to the cloud (Azure and GCP). Must understand SRE concepts well, including SLIs/SLOs/SLAs, Error Budget, Toil, Capacity Planning, monitoring/observability, release engineering, and incident management. This is both hands on and advisory role, so strong communication and facilitation skills are required

Nice to have

  • Understanding of healthcare domain in general and specifically EPIC EMR: tools for observability (System Pulse)
  • Terraform
  • Google Cloud Platform

What We Offer

  • Medical, Dental and Vision Insurance (Subsidized)
  • Health Savings Account
  • Flexible Spending Accounts (Healthcare, Dependent Care, Commuter)
  • Short-Term and Long-Term Disability (Company Provided)
  • Life and AD&D Insurance (Company Provided)
  • Employee Assistance Program
  • Unlimited access to LinkedIn learning solutions
  • Matched 401(k) Retirement Savings Plan
  • Paid Time Off
  • Legal Plan and Identity Theft Protection
  • Accident Insurance
  • Employee Discounts
  • Pet Insurance

Hello. How Can We Help You?

Our Offices