Site Reliability Engineer

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of critical services by bridging development and operations.

Job Description

The Site Reliability Engineer (SRE) ensures the reliability and performance of critical services, bridging development and operations. The role focuses on scalable infrastructure, SRE practices such as SLOs and SLIs, and reducing operational toil. Collaboration with teams to improve reliability and foster a continuous learning culture is key.

Day-to-day Duties

  • Design and implement resilient system architectures for high availability and scalability.
  • Develop automation tools and scripts to improve operational efficiency.
  • Define, track, and analyze SLOs and SLIs for performance and reliability.
  • Conduct post-mortem analyses and implement improvements based on findings.
  • Collaborate on best practices for system reliability and incident management.
  • Troubleshoot and resolve database, network, and deployment issues.
  • Ensure issue resolution meets Service Level Agreements (SLAs).
  • Identify and address system performance bottlenecks with actionable recommendations.
  • Maintain documentation for processes and incident responses.

Successful candidates shall possess:

  • Proficiency in programming languages like Python, Golang, or Java.
  • Experience in system architecture with a focus on reliability and scalability.
  • Strong understanding of SRE principles (SLOs, SLIs, toil reduction).
  • Experience with cloud environments (AWS, Azure, Google Cloud).
  • Expertise in Linux system administration.
  • Problem-solving skills with a proactive approach to operational challenges.
  • Ability to work independently and collaborate in a team environment.

 

Preferred skills:

  • Familiarity with monitoring tools and performance optimization.
  • Experience with system administration automation and scripting.
  • Knowledge of networking concepts and troubleshooting.
  • Hands-on experience with cloud platforms and services.
  • Familiarity with DevOps practices (CI/CD, infrastructure as code, containerization).

Application Form