Cloud Site Reliability Engineer (SRE)
About this position
The Cloud Site Reliability Engineer (SRE) is responsible for designing, implementing, and managing cloud infrastructure while ensuring system performance, reliability, and scalability.
Responsibilities
• Design, implement, and manage cloud infrastructure using best practices.
• Monitor system performance and reliability; troubleshoot and resolve incidents promptly.
• Collaborate with development teams to ensure services are built with reliability and scalability in mind.
• Automate manual processes through scripting and development of tools.
• Implement and manage CI/CD pipelines to streamline deployment processes.
• Conduct capacity planning and performance tuning to optimize system efficiency.
• Develop and maintain documentation for system architecture, processes, and procedures.
• Participate in on-call rotations to respond to system alerts and outages.
• Foster a culture of reliability by driving initiatives around incident management, postmortems, and root cause analysis.
Requirements
• Bachelor’s degree in computer science, Engineering, or a related field.
• Proven experience in cloud platforms (AWS, Azure, GCP).
• Strong understanding of system architecture, networking, and security best practices.
• Proficiency in programming/scripting languages (Python, Go, Bash, etc.).
• Experience with containerization and orchestration tools (Docker, Kubernetes).
• Familiarity with monitoring and logging tools (Prometheus, Grafana, ELK stack).
• Experience with infrastructure as code tools (Terraform, CloudFormation).
• Knowledge of DevOps practices and methodologies.
• Familiarity with agile development processes.
• Excellent problem-solving skills and attention to detail.
• Strong communication and collaboration skills.
• Good command in English (Minimum 750 TOEIC score).
• Goal-Oriented, Unity, Learning, Flexible.