Site Reliability Engineer is one of the critical roles in the technology team and the person working in this team will be responsible for application performance, availability, reliability and system uptime. The candidate is responsible to provide consultation and strategic recommendations by quickly assessing and remediating complex platform availability issues. Site Reliability Engineer LEAD will dive head-first into creating or applying innovative solutions and techniques that advance the reliability of Digital products
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Own entire platforms (prod environments) Deploying, automating, maintaining and managing production systems, to ensure the availability, performance, scalability and security of productions systems
- Help evolve our configuration management (CM) efforts and our move to containers
- Represent production support and site reliability in stand-ups, planning sessions, code reviews, and architecture reviews
- Maintain services once they are life by measuring and monitoring availability, latency and overall system health.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Collaborate with Agile teams in defining technical requirements and best practices with containerized and cloud-native applications
- Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation and refinement.
- Help the operations head in selecting the enthusiastic and technically knowledgeable team and guide the existing team members.
Qualification & Experience:
- Good Experience of distributed systems RabbitMQ, Kafka, Redis etc