1,495 Devops Engineers jobs in Singapore
AliCloud DevOps Engineers
Posted today
Job Viewed
Job Description
Menrva Group is working with a key partner to AliCloud in helping them build out a team of high performing Technical Consultants with "hands on" AliCLoud implementation and migration experience. With significant investment recently made in the Southeast Asia region, AliCloud is growing its presence outside of the China market, and Singapore is a key location for them to accelerate their international expansion. The ideal candidate will have deep expertise in AliCloud services, infrastructure-as-code, CI/CD pipelines, and driving a culture of automation and reliability.
Key Responsibilities- AliCloud Infrastructure Management: Design, implement, and manage scalable, highly available, and secure infrastructure on AliCloud using services like ECS, VPC, RDS, ACK (Alibaba Cloud Container Service for Kubernetes), OSS, and CDN.
- Automation and IaC: Develop and maintain Infrastructure-as-Code (IaC) using Terraform (or similar tools like CloudFormation/ROS) to provision and manage AliCloud resources consistently and efficiently.
- CI/CD Pipeline Development: Design, build, and optimize robust and automated Continuous Integration and Continuous Delivery (CI/CD) pipelines using tools like Jenkins, GitLab CI, or Alibaba Cloud's CodePipeline/CodeFlow.
- Monitoring and Logging: Implement comprehensive monitoring, alerting, and logging solutions using AliCloud services (e.g., CloudMonitor, Log Service) and third-party tools (e.g., Prometheus, Grafana, ELK stack).
- Security and Compliance: Enforce security best practices across the infrastructure using services like RAM, Security Center, and WAF. Ensure compliance with internal and industry standards.
- Troubleshooting and Optimization: Act as a subject matter expert for deep-level troubleshooting, performance tuning, and capacity planning for mission-critical applications.
- Collaboration and Mentorship: Work closely with development and operations teams. Mentor junior engineers, promote DevOps best practices, and drive cultural change toward automation and SRE principles.
- Experience: 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or a similar role, with at least 3 years of hands-on experience specifically with Alibaba Cloud (AliCloud) .
- AliCloud Expertise: Deep knowledge of core AliCloud services (e.g., ECS, VPC, OSS, RDS, SLB, ApsaraDB, ACK/Container Service). AliCloud certification (ACP/ACE) is highly preferred.
- Containerization & Orchestration: Expert-level experience with Docker and Kubernetes (AKS, EKS, or specifically AliCloud ACK ).
- Infrastructure-as-Code (IaC): Proficient in writing and maintaining Terraform configurations for complex cloud environments.
- Scripting: Strong scripting skills in languages like Python, Bash, or Go .
- CI/CD: Proven experience designing and implementing complex CI/CD workflows.
- Operating Systems: Strong background in Linux system administration.
- Networking: Solid understanding of cloud networking concepts (VPC, subnets, routing, NAT, Load Balancing).
Please rigister your interest by uploading your latest CV to the portal.
AliCloud DevOps Engineers - All levels
Posted today
Job Viewed
Job Description
Menrva Group is working with a key partner to AliCloud in helping them build out a team of high performing Technical Consultants with "hands on" AliCLoud implementation and migration experience. With significant investment recently made in the Southeast Asia region, AliCloud is growing its presence outside of the China market, and Singapore is a key location for them to accelerate their international expansion. The ideal candidate will have deep expertise in AliCloud services, infrastructure-as-code, CI/CD pipelines, and driving a culture of automation and reliability.
- AliCloud Infrastructure Management: Design, implement, and manage scalable, highly available, and secure infrastructure on AliCloud using services like ECS, VPC, RDS, ACK (Alibaba Cloud Container Service for Kubernetes), OSS, and CDN.
- Automation and IaC: Develop and maintain Infrastructure-as-Code (IaC) using Terraform (or similar tools like CloudFormation/ROS) to provision and manage AliCloud resources consistently and efficiently.
- CI/CD Pipeline Development: Design, build, and optimize robust and automated Continuous Integration and Continuous Delivery (CI/CD) pipelines using tools like Jenkins, GitLab CI, or Alibaba Cloud's CodePipeline/CodeFlow.
- Monitoring and Logging: Implement comprehensive monitoring, alerting, and logging solutions using AliCloud services (e.g., CloudMonitor, Log Service) and third-party tools (e.g., Prometheus, Grafana, ELK stack).
- Security and Compliance: Enforce security best practices across the infrastructure using services like RAM, Security Center, and WAF. Ensure compliance with internal and industry standards.
- Troubleshooting and Optimization: Act as a subject matter expert for deep-level troubleshooting, performance tuning, and capacity planning for mission-critical applications.
- Collaboration and Mentorship: Work closely with development and operations teams. Mentor junior engineers, promote DevOps best practices, and drive cultural change toward automation and SRE principles.
- Experience: 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or a similar role, with at least 3 years of hands-on experience specifically with Alibaba Cloud (AliCloud) .
- AliCloud Expertise: Deep knowledge of core AliCloud services (e.g., ECS, VPC, OSS, RDS, SLB, ApsaraDB, ACK/Container Service). AliCloud certification (ACP/ACE) is highly preferred.
- Containerization & Orchestration: Expert-level experience with Docker and Kubernetes (AKS, EKS, or specifically AliCloud ACK ).
- Infrastructure-as-Code (IaC): Proficient in writing and maintaining Terraform configurations for complex cloud environments.
- Scripting: Strong scripting skills in languages like Python, Bash, or Go .
- CI/CD: Proven experience designing and implementing complex CI/CD workflows.
- Operating Systems: Strong background in Linux system administration.
- Networking: Solid understanding of cloud networking concepts (VPC, subnets, routing, NAT, Load Balancing).
Please rigister your interest by uploading your latest CV to the portal.
Cloud Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Description:
As a Cloud Site Reliability Engineer , you will be instrumental in ensuring the reliability, scalability, and performance of our hybrid cloud infrastructure across Azure and AWS . You will collaborate with engineering and cloud platform teams to build resilient, observable, and automated systems that support rapid delivery and high availability of services.
Key Responsibilities:
- Lead SRE initiatives to improve availability, reliability, and performance of cloud-native and hybrid applications.
- Design and implement observability frameworks across Azure and AWS using tools like CloudWatch, Azure Monitor, Prometheus, and Grafana.
- Drive automation and infrastructure-as-code practices to reduce operational toil and streamline deployments.
- Collaborate with application teams to define and implement SLIs, SLOs, and Error Budgets for cloud-hosted services.
- Champion chaos engineering and resilience testing across Azure and AWS environments.
- Work with enterprise teams to deploy and scale SRE enablers such as service mesh, auto-scaling, and CI/CD pipelines.
- Establish and enforce cloud infrastructure deployment standards , including blue-green and canary deployments.
- Support cloud migration strategies , cutover planning, and testing for applications transitioning between Azure and AWS.
Requirements:
- Minimum 10 years of experience in SRE or Cloud Engineering, preferably within the banking or financial services sector.
- Deep expertise in Azure and AWS cloud platforms , including compute, networking, storage, and security services.
- Strong understanding of ITIL and SRE frameworks , with the ability to integrate traditional operations with modern cloud practices.
- Proven leadership in coordinating with application teams and vendors for cloud deployment and migration planning.
- Hands-on experience with infrastructure-as-code tools (e.g., Terraform, Bicep, CloudFormation) and scripting (Bash, Python).
- Certifications in AWS (e.g., Solutions Architect, DevOps Engineer) and Azure (e.g., Azure Administrator, Azure Solutions Architect) are highly desirable.
- Experience with monitoring and alerting tools across both cloud platforms.
- Solid grasp of SRE principles: Toil reduction, SLIs/SLOs, Error Budgets, MTTD/MTTR .
- Strong interpersonal and communication skills to foster collaboration across teams and stakeholders.
- Agile mindset with experience in DevOps, CI/CD , and cloud-native development practices.
Cloud Site Reliability Engineer
Posted today
Job Viewed
Job Description
Cloud Site Reliability Engineer (AWS)
Job Purpose
Ensure reliable, secure, and automated cloud operations supporting mission-critical systems and compliance needs.
Job Responsibilities
Manage and support AWS cloud services ensuring uptime, scalability, and security compliance.
Design and maintain Infrastructure-as-Code pipelines using Terraform, CloudFormation, and Ansible.
Oversee operating system patching cycles across Linux and Windows environments efficiently.
Monitor and troubleshoot performance issues, proactively preventing incidents and downtime.
Document processes, maintain runbooks, and adhere to strict compliance and audit standards.
Job Requirements
8+ years’ cloud/DevOps experience with 5+ years in regulated environments.
Strong AWS expertise across compute, networking, databases, and security services.
Hands-on experience with Terraform, CloudFormation, Ansible, and automation practices.
Certified AWS Solutions Architect (Associate/Professional) with RHCE or Windows certification.
Familiarity with ITIL, incident response, and compliance-driven cloud operations.
The successful Cloud Site Reliability Engineer (AWS) must possess deep AWS expertise, strong automation skills, and proven experience in uptime-critical, compliance-driven environments.
Join a cutting-edge technology team in a growing market, ensuring uptime, automation, and reliability of large-scale AWS platforms—let’s talk at
PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)
By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.
***
#J-18808-Ljbffr
Cloud Site Reliability Engineer
Posted today
Job Viewed
Job Description
Cloud Site Reliability Engineer (AWS)
An excellent Cloud Site Reliability Engineer opportunity has just arisen in a global brand supporting mission‐critical government systems.
Job Purpose
Ensure reliable, secure, and automated cloud operations supporting mission‐critical systems and compliance needs.
Responsibilities
Manage and support AWS cloud services ensuring uptime, scalability, and security compliance.
Design and maintain Infrastructure‐as‐Code pipelines using Terraform, CloudFormation, and Ansible.
Oversee operating system patching cycles across Linux and Windows environments efficiently.
Monitor and troubleshoot performance issues, proactively preventing incidents and downtime.
Document processes, maintain runbooks, and adhere to strict compliance and audit standards.
Qualifications
8+ years’ cloud/DevOps experience with 5+ years in regulated environments.
Strong AWS expertise across compute, networking, databases, and security services.
Hands‐on experience with Terraform, CloudFormation, Ansible, and automation practices.
Certified AWS Solutions Architect (Associate/Professional) with RHCE or Windows certification.
Familiarity with ITIL, incident response, and compliance‐driven cloud operations.
The successful Cloud Site Reliability Engineer (AWS) must possess deep AWS expertise, strong automation skills, and proven experience in uptime‐critical, compliance‐driven environments.
Join a cutting‐edge technology team in a growing market, ensuring uptime, automation, and reliability of large‐scale AWS platforms—let’s talk at
PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)
By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.
#J-18808-Ljbffr
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Company Description
Higogame is a trailblazer in the mobile gaming and entertainment industry. Since our inception in late 2020, we have been dedicated to transforming the gaming landscape in Southeast Asia and beyond, delivering innovative and immersive experiences that engage millions of players around the globe.
- Our revenue has seen remarkable growth year after year, with operations extending across multiple regions worldwide.
- In just three years, we've risen to become one of the top two games of our kind in the local market.
- We proudly serve around 2 million active users daily and have a total monthly active user base of 5 million worldwide.
- Our team consists of over 200 talented employees, including a robust R&D division of more than 100 experts.
- We offer exceptional career development opportunities and foster a multinational culture that empowers everyone to reach their full potential.
Join us as we continue to push the boundaries of mobile gaming
Job Responabilities:
Responsible for the full lifecycle management of the company's global/multi-region infrastructure. Lead the setup of the Singapore physical data center and deep operations of Google Cloud (GCP) platform. Drive automation and intelligent operations systems to ensure high availability, low cost, and scalable business operations. The role requires both traditional data center operations experience and cloud-native technical vision, acting as a key technical backbone connecting physical resources with cloud capabilities.
I. Core Responsibilities
1. Physical Data Center Planning & Implementation
- Lead end-to-end management of self-built/hosted data centers: requirements analysis, architecture design (network/power/cooling/cabling), equipment selection (servers/switches/UPS), construction acceptance, and post-operations optimization.
- Design multi–data center disaster recovery architectures (e.g., active-active across two sites, three centers), including cross-site synchronization and failover strategies to ensure business continuity.
- Manage internal resource backup/disaster recovery, including art assets, code, and other data assets.
2. Google Cloud (GCP) Deep Operations & Optimization
- Design and manage GCP architecture (Compute Engine, VPC, Cloud Storage, GKE, BigQuery, etc.), supporting cloud migration and hybrid cloud deployment of core business systems.
- Lead full lifecycle management of cloud resources, including cost optimization (reserved instances, autoscaling, idle resource reclamation), performance tuning (network latency, storage IOPS, compute utilization), and security hardening (IAM governance, encryption policies, vulnerability scanning).
- Build cloud-native ops systems using Cloud Monitoring/Logging for real-time alerting and fault detection.
3. Automation & Intelligent Operations Systems
- Lead development and integration of operations toolchains (e.g., Ansible/Puppet automation, Prometheus+Grafana monitoring, ELK logging) to shift operations from manual to platform-based and intelligent.
- Integrate CI/CD pipelines with cloud platforms, optimizing deployment efficiency and stability of containerized (K8s) and serverless (Cloud Functions) workloads.
- Lead root cause analysis (RCA) and postmortems of major incidents, deliver improvement plans, and strengthen contingency planning and drills (e.g., data center power outage, cloud region failure).
4. Cross-Team Collaboration & Technical Enablement
- Collaborate with R&D, QA, and Product teams to provide infrastructure support for rapid business delivery.
- Develop operations standards and technical documentation, drive team knowledge sharing, and mentor junior engineers.
II. Requirements
Basic Qualifications
- Bachelor's degree or higher in Computer Science, Network Engineering, Cloud Computing, or related fields.
- 5+ years in IT operations, including 3+ years in physical data center build/ops, and 2+ years of hands-on GCP experience (must provide project examples).
- Experience in large-scale distributed systems, with solid knowledge of Linux, network protocols (TCP/IP, SDN), and high availability database architectures (MySQL/Redis).
- Must be able to converse in Mandarin due to the need to travel to China to communicate with Chinese speaking stakeholders
- Must be able to travel (up to 50% of the time)
Technical Skills
Data Center
- Familiar with infrastructure (power/cooling/fire safety/cabling), and optimization metrics like PUE/CUE.
- Experience in IDC hosting, custom data center builds, or third-party acceptance audits. Knowledge of industry standards (e.g., GB50174 Data Center Design Standard).
Google Cloud (GCP)
- Proficient in GCP core services: GCE, VPC, Cloud SQL/Spanner, GKE.
- Skilled in GCP cost management (Budgets & Alerts, preemptible VMs, storage tiers).
- Strong in GCP security: IAM, VPC Service Controls, Cloud Firewall, KMS.
Automation & Toolchains
- Skilled with Terraform/Ansible for IaC, scripting in Shell/Python/Go for ops tooling.
- Experienced in Prometheus+Grafana monitoring, ELK/OpenTelemetry for logging & tracing.
- Hands-on Kubernetes operations (scaling, node management, Helm) and CI/CD pipeline integration (Jenkins/GitLab CI).
Soft Skills
- Strong troubleshooting and resilience, able to quickly resolve complex incidents (e.g., data center outage, regional cloud failure).
- Excellent cross-team communication and project leadership skills.
- Fast learner, stays updated on cloud-native (CNCF), AIOps, and industry trends.
III. Nice-to-Haves
- GCP certifications (e.g., Professional Cloud Architect, Associate Cloud Engineer) or ITIL/ISO2000.
- Led/participated in large-scale data center builds (multi-million) or GCP ops projects with million+ annual cloud spend.
- Experience in hybrid cloud (GCP + on-premises) or edge computing ops.
- Published blogs, open-source contributions, or active participation in tech communities (GitHub, CNCF events).
IV. What We Offer
- Competitive salary
- Global platform: Participate in building multi-region intelligent operations infrastructure.
- Growth: Internal tech sharing, external conferences, certification & training support.
- Work environment: Flat management, flexible hours, free snacks, comprehensive medical, hospital and dental coverage
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Imagine what you could accomplish here. Bring your passion, creativity, and dedication, and there will be no limit to what you can achieve. This is not just another SRE role - it's a chance to help redefine how reliability engineering is practiced at hyper-scale. Our team is building the platforms that will autonomously operate Apple's core information security systems, setting a new bar for how critical services are managed.
Description
We are seeking exceptional engineers who thrive at the intersection of reliability, software development and automation - individuals driven to push the boundaries of what's possible. The ideal candidate has a strong foundation in modern SRE practices and a proven ability to design and implement software that solves operational challenges. You'll break new ground using the most advanced tools and approaches available, developing automation that doesn't just keep pace with scale but anticipates, reacts and stays ahead of it. You will work closely with Security Engineering, Threat Detection, Incident Response and other internal functions to ensure the scalability, availability and security of the tools and infrastructure that support Apple's cybersecurity mission. Join us, and help build the future of self-managing systems at one of the most innovative companies in the world.
Responsibilities
- Our team is highly collaborative, working closely with partner teams to deliver the best results for Apple. We strive to find the best solution while also considering the need to get things done efficiently for each engineering challenge we face. Good ideas are valued and rewarded.
- As an SRE in Apple Information Security, you will:
- Operate, monitor, and triage all aspects of our production and non-production environments
- Pioneer and implement the next generation telemetry system for AIS services
- Establish alert handling procedures, run-books, and collaborate with our global security team
- Automate deployment and orchestration of services into the cloud environment as well as other routine processes
- Actively participate in capacity planning and disaster recovery exercises
- Interact with and support partner teams across the enterprise
Cultivate and maintain relationships with internal and external third party vendors
Minimum Qualifications
- Bachelor's degree in Computer Science, or a related field, or equivalent practical experience
- Proven experience in Site Reliability Engineering or a related field
- Strong programming skills: Python, Go or Swift
- Experience working with cloud compute environments like AWS, GCP or Azure
- Experience with infrastructure as code (IaC), configuration management, CI/CD, and automation, e.g., Terraform, Pulumi, CloudFormation, Ansible, Chef, Puppet, Jenkins
- Cloud deployment and CI/CD problem diagnosis and troubleshooting
Preferred Qualifications
- Experience or experimentation building systems that leverage Agentic AI principles, tools, platforms and frameworks
- Strong understanding and experience in implementing monitoring and observability tools like Splunk, Grafana, Prometheus
- Building and operating container orchestrating systems (Docker, Kubernetes, Vagrant and micro-services)
- Experience administering and troubleshooting Linux systems including the usage of standard Linux utilities
- Experience in shell scripting (e.g., bash/zsh) and system administration
- Experience with measuring, analyzing, and optimizing system performance
- Passion for high-quality code, tests, documentation and production services
Participation in an on-call rotation
Submit CV
Be The First To Know
About the latest Devops engineers Jobs in Singapore !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities:
• Maintains platforms or products after go live by measuring and monitoring their availability, performance and overall system health
• Recovers platforms or products during production incidents to meet targeted service-level agreements
• Set up, enhance and maintain observability tools.
• Assist in incident response, perform root cause analysis, and postmortem documentation.
• Develop tools/applications/scripts to improve operational efficiency.
• Maintain and enhance CI/CD pipelines.
• Collaborate with software engineers to design scalable and resilient systems.
• Participate in on-call and on-site rotations and contribute to reducing alert fatigue.
• Document processes, configurations, and best practices.
• Support other software efficiency improvement initiatives.
Requirements:
• At least 1-3 years' experience in software development, Devops or SRE.
• Curious, Strong communicator and ready to work in a fast-paced environment and willing to pick up new skills and technologies as necessary.
• Degree in Electrical / Electronics / Computer Engineering / Computer Science or a relevant discipline
• Basic understanding of Linux/Unix systems and shell scripting.
• Familiarity with cloud platforms (e.g., AWS, Azure, GCP).
• Exposure to containerization tools (e.g., Docker, Kubernetes).
• Experience with monitoring tools (e.g., Prometheus, Grafana, ELK).
• Knowledge of CI/CD tools (e.g., Jenkins, Gitlab, Bitbucket, Jira).
• Programming/scripting skills in Python, Java, or Bash.
• Understanding of networking fundamentals and system security.
• Self-motivated, independent and a good team player
• Able to work under pressure in a fast-paced environment
• Innovative, proactive mindset and with a focus on continuous improvement
• Strong analytical and problem-solving skills
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities
- To be responsible for reliability, availability, user experience, capacity planning, toil reduction, process enhancement and digitalization of the cloud-based internet services.
- Handle SRE role for assigned cloud services owning the KPIs for reliability, issue to resolution, service deployment, business continuity management, security policy planning, capacity planning, toil reduction through automation.
- Introduce service governance initiatives based on latest technologies to consistently increase reliability and user experience components of Huawei mobile services on cloud to provide world class user experience with high reliability.
- Effectively utilize our world class AIOPS and autonomous service governance platform to ideate new ways to streamline process, accuracy of alerts, time series-based trend analysis, anomaly detection, risk identifications.
- Support platform/service expansions, migrations to new architectures, upgrades and drill activities across different technology domains.
- Incorporate mature chaos engineering for risk identification, IPDRR for security, comprehensive automation frameworks to reduce ops effort to reach lowest possible level and make time, space for engineering related focus for the team.
Requirements and Qualifications
- Bachelor/Master of computer science engineering or related majors
- Have knowledge of Linux, Network, Database,Containers, Container management systems, etc.
- Have knowledge of at least one programming language or scripting such as Java, Python, Shell, Ansible, Terraform
- Have knowledge in big data analytics.
- Explored new technology trends, opensource technologies, methodologies in internet service domain.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Join a global leader in gaming to manage the reliability of game-related platforms and infrastructure across both cloud and on-premise environments.
Responsibilities:
- Responsible for deployment, change, issues triage and infrastructure management of overseas games and relevant components and system, e.g. game monitor system, login services.
- Responsible for monitoring and dashboarding for game observability, and ensure the game is reliable, scalable and secure.
- Understand the game architecture, analyze, evaluate and respond to potential risks, such as hidden troubles and performance bottlenecks.
- Responsible for daily communication and coordination between various teams.
Requirements:
- Bachelor's Degree or above in Computer Science or comparable field.
- More than 3 years of operations experience in Linux and Windows operating system.
- Have a high sense of responsibility and teamwork spirit.
- Proficiency in scripting programming such as Bash, Python, SQL.
- Good understanding of cloud environment, such as AWS or Azure.
- Experience with containerization technologies such as Docker and orchestration platforms like Kubernetes is a plus.
- Experience with worldwide online game live operations is a plus.
Rajasekar Shirley Monisha License No.: 02C3423 Personnel Registration No.: R
Please note that your response to this advertisement and communications with us pursuant to this advertisement will constitute informed consent to the collection, use and/or disclosure of personal data by ManpowerGroup Singapore for the purpose of carrying out its business, in compliance with the relevant provisions of the Personal Data Protection Act 2012. To learn more about ManpowerGroup's Global Privacy Policy, please visit