Job Description
We are seeking a dedicated Site Reliability Engineer (SRE) to enhance the reliability, scalability, and efficiency of our cloud-based systems.
- The ideal candidate will possess a strong background in cloud infrastructure, automation, and incident management, with a focus on optimizing both system performance and developer productivity.
Key Responsibilities
- Design, implement, and manage scalable and secure cloud infrastructure using Infrastructure as Code (IaC) methodologies.
- Develop and uphold Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure software reliability and performance.
- Monitor infrastructure costs, providing transparency and implementing strategies for cost optimization.
- Streamline development workflows to reduce cognitive load for engineers, enhancing efficiency and effectiveness.
- Build and maintain robust Continuous Integration/Continuous Deployment (CI/CD) pipelines to expedite the delivery of code to customers.
- Develop comprehensive observability solutions for end-to-end system monitoring, ensuring issues are detected and addressed promptly.
- Lead and continuously improve the incident management process to minimize system downtime and impact.
- Participate in the on-call rotation, acting as a first responder to swiftly address and resolve system issues.
- Create and maintain incident response playbooks and conduct post-mortem analyses to prevent future occurrences.
Competencies
- Adaptability
- Ambition
- Effective Communication
- Mentorship
- Ownership
- Technical Proficiency
- Productivity
- Trustworthiness
Hiring Team Member
