Lead Site Reliability Engineer
About this position
Responsibilities
• Lead the design, implementation, and operation of highly available and scalable infrastructure solutions to support our organization's applications and services.
• Support services before they go live such as system design consulting, capacity planning, and launch reviews.
• Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
• Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity.
• Improve monitoring, alerting and resilience of systems.
• Practice sustainable incident response and blameless postmortems.
Requirements
• Minimum of 5-10 years of experience in a Site Reliability Engineering (SRE) role, with a proven track record of designing and implementing scalable and reliable infrastructure solutions.
• Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
• Experience in designing, analyzing, and troubleshooting micro-services.
• Understanding of monitoring, logging, and tracing systems to help teams quickly detect problems such as ELK, Prometheus, Grafana, Jaeger.
• Experience with Linux and Network administration skills for troubleshooting is an advantage.
• Familiar with Cloud Platform (AWS or Google Cloud) and Kubernetes is an advantage.
• Experience programming in Go or similar is an advantage.
• Experience designing and managing MongoDB and MySQL databases is an advantage.
• Knowledge in Security and how to test is an advantage.