Cloud Site Reliability Engineer (SRE)
About this position
The Cloud Site Reliability Engineer (SRE) is responsible for designing, implementing, and managing cloud infrastructure while ensuring system performance and reliability.
Responsibilities
• Design, implement, and manage cloud infrastructure using best practices.
• Monitor system performance and reliability; troubleshoot and resolve incidents promptly.
• Collaborate with development teams to ensure services are built with reliability and scalability in mind.
• Automate manual processes through scripting and development of tools.
• Implement and manage CI/CD pipelines to streamline deployment processes.
• Conduct capacity planning and performance tuning to optimize system efficiency.
• Develop and maintain documentation for system architecture, processes, and procedures.
• Participate in on-call rotations to respond to system alerts and outages.
• Foster a culture of reliability by driving initiatives around incident management, postmortems, and root cause analysis.
Requirements
• Bachelor’s degree in computer science, Engineering, or a related field.
• Proven experience in cloud platforms (AWS, Azure, GCP).
• Strong understanding of system architecture, networking, and security best practices.
• Proficiency in programming/scripting languages (Python, Go, Bash, etc.).
• Experience with containerization and orchestration tools (Docker, Kubernetes).
• Familiarity with monitoring and logging tools (Prometheus, Grafana, ELK stack).
• Experience with infrastructure as code tools (Terraform, CloudFormation).
• Knowledge of DevOps practices and methodologies.
• Familiarity with agile development processes.
• Excellent problem-solving skills and attention to detail.
• Strong communication and collaboration skills.
• Flexible to relocation.