1,380 Sre jobs in Singapore
Site Reliability Engineer (SRE)
Posted 1 day ago
Job Viewed
Job Description
Overview
Site Reliability Engineer (SRE) — An excellent opportunity in a cutting-edge, fast-growing cloud environment.
Job PurposeJob Purpose: Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure.
Responsibilities- Manage and support AWS services, ensuring uptime, performance, and security compliance.
- Automate deployments and infrastructure tasks using Terraform, CloudFormation, and Ansible tools.
- Maintain operating systems, patches, and certificates across Linux and Windows environments.
- Document processes, create runbooks, and ensure adherence to security and compliance standards.
- Mentor junior engineers while resolving incidents and supporting production-critical systems.
- 8+ years’ experience in cloud operations, SRE, or DevOps environments.
- Strong AWS knowledge, including Lambda, ECS, EKS, and cloud-native services.
- Proficiency with Infrastructure-as-Code tools like Terraform, Ansible, and CloudFormation.
- Experience with Linux/Windows server management, patching, and SSL lifecycle maintenance.
- Excellent communication, problem-solving, and teamwork skills, with compliance-focused mindset.
The successful Site Reliability Engineer (SRE) must possess deep AWS expertise and proven operational excellence.
Curious about this opportunity? Reach out now and discover how you can be part of a global technology leader—let’s talk at
PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)
By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.
#J-18808-LjbffrSite Reliability Engineer (SRE)
Posted today
Job Viewed
Job Description
Overview
Site Reliability Engineer (SRE)
— An excellent opportunity in a cutting-edge, fast-growing cloud environment.
Job Purpose
Job Purpose:
Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure.
Responsibilities
Manage and support AWS services, ensuring uptime, performance, and security compliance.
Automate deployments and infrastructure tasks using Terraform, CloudFormation, and Ansible tools.
Maintain operating systems, patches, and certificates across Linux and Windows environments.
Document processes, create runbooks, and ensure adherence to security and compliance standards.
Mentor junior engineers while resolving incidents and supporting production-critical systems.
Qualifications
8+ years’ experience in cloud operations, SRE, or DevOps environments.
Strong AWS knowledge, including Lambda, ECS, EKS, and cloud-native services.
Proficiency with Infrastructure-as-Code tools like Terraform, Ansible, and CloudFormation.
Experience with Linux/Windows server management, patching, and SSL lifecycle maintenance.
Excellent communication, problem-solving, and teamwork skills, with compliance-focused mindset.
The successful Site Reliability Engineer (SRE) must possess deep AWS expertise and proven operational excellence.
Curious about this opportunity? Reach out now and discover how you can be part of a global technology leader—let’s talk at
PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)
By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.
#J-18808-Ljbffr
Site Reliability Engineer (SRE)
Posted 5 days ago
Job Viewed
Job Description
Site Reliability Engineer (SRE)
An excellent Site Reliability Engineer (SRE) opportunity is available in a cutting-edge, fast-growing cloud environment.
Job Purpose:
Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure.
Job Responsibilities:
- Manage and support AWS services, ensuring uptime, performance, and security compliance.
- Automate deployments and infrastructure tasks using Terraform, CloudFormation, and Ansible tools.
- Maintain operating systems, patches, and certificates across Linux and Windows environments.
- Document processes, create runbooks, and ensure adherence to security and compliance standards.
- Mentor junior engineers while resolving incidents and supporting production-critical systems.
Job Requirements:
- 8+ years’ experience in cloud operations, SRE, or DevOps environments.
- Strong AWS knowledge, including Lambda, ECS, EKS, and cloud-native services.
- Proficiency with Infrastructure-as-Code tools like Terraform, Ansible, and CloudFormation.
- Experience with Linux/Windows server management, patching, and SSL lifecycle maintenance.
- Excellent communication, problem-solving, and teamwork skills, with compliance-focused mindset.
The successful Site Reliability Engineer (SRE) must possess deep AWS expertise and proven operational excellence.
Curious about this opportunity? Reach out now and discover how you can be part of a global technology leader—let’s talk at
PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)
By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.
***
Site Reliability Engineer (SRE) (GovTech)
Posted 13 days ago
Job Viewed
Job Description
We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWS and Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.
Responsibilities
As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
- Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
- Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
- Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
- Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
- Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
- Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
- Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
- Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
- Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
- Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
- Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
- Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
- Strong documentation skills and experience in knowledge sharing across teams.
- Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
- Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
- Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
- Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
- Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
- Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
- Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
- Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
- Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
- Proactive in identifying problems and recommending strategic solutions.
- Excellent problem-solving skills with a robust analytical mindset.
- Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
- Ability to remain calm and effective under pressure, especially during incident response.
- Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
- Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
- Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.
Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice.
Seniority levelMid-Senior level
Employment typeFull-time
Job functionEngineering and Information Technology
IndustriesData Security Software Products
#J-18808-LjbffrSite Reliability Engineer (SRE) (GovTech)
Posted 24 days ago
Job Viewed
Job Description
We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWSand Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.
Responsibilities:
As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
• Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
• Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
• Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
• Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
Maintenance, Optimisation & Performance
• Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
• Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
• Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
• Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
Requirements:
• Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
• Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
• Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
• Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
• Strong documentation skills and experience in knowledge sharing across teams.
• Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
• Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
• Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
• Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
• Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
• Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
• Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
• Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
• Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
Soft skills:
• Proactive in identifying problems and recommending strategic solutions.
• Excellent problem-solving skills with a robust analytical mindset.
• Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
• Ability to remain calm and effective under pressure, especially during incident response.
• Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
• Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
• Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.
Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice .
#J-18808-LjbffrSenior Site Reliability Engineer (SRE)
Posted today
Job Viewed
Job Description
Direct message the job poster from HCLTech
Deputy Manager - Talent Acquisition Growth Markets, APME at HCLTech
The following responsibilities and requirements describe the role of a Senior Site Reliability Engineer (SRE) with 10–15 years of experience. The candidate will focus on building, managing, and optimizing reliable, scalable, and secure systems across multi-cloud, hybrid cloud, and on-premises data center environments.
Job Summary
We are seeking a highly experienced Senior Site Reliability Engineer (SRE) with 10–15 years of expertise in building, managing, and optimizing reliable, scalable, and secure systems. This role requires strong proficiency in end-to-end SRE practices across multi-cloud, hybrid cloud, and on-premises data center environments. The ideal candidate will drive automation, observability, and resiliency while working closely with development, infrastructure, and operations teams to ensure seamless system performance and availability.
Responsibilities
Lead the design, implementation, and management of SRE practices across cloud, hybrid, and on-premises data center environments.
Build and optimize scalable, highly available, and secure infrastructure supporting critical enterprise applications.
Develop automation frameworks to streamline deployment, monitoring, incident response, and system recovery.
Define and enforce SLAs, SLOs, and SLIs to ensure service reliability and performance.
Implement observability solutions, including monitoring, logging, tracing, and alerting for proactive issue detection and resolution.
Partner with development teams to design and deliver resilient systems, ensuring reliability is integrated into every stage of the lifecycle.
Perform root cause analysis (RCA) and drive post-incident reviews to ensure continuous improvement.
Support capacity planning, performance tuning, and cost optimization across hybrid and multi-cloud environments.
Mentor junior engineers and lead best practices in automation, security, and operational excellence.
Collaborate with security and compliance teams to ensure infrastructure and operations align with organizational and regulatory standards.
Requirements
Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related discipline.
10–15 years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Proven expertise in managing multi-cloud (AWS, Azure, GCP), hybrid cloud, and data center environments.
Strong knowledge of Linux/Unix and Windows systems administration.
Proficiency in automation and configuration management tools (Terraform, Ansible, Puppet, Chef, SaltStack).
Hands-on experience with CI/CD pipelines, containerization (Docker, Kubernetes, OpenShift), and orchestration.
Deep knowledge of observability tools (Prometheus, Grafana, ELK, Splunk, Datadog, New Relic).
Strong understanding of networking, load balancing, storage, and security in enterprise-scale environments.
Experience defining and managing SLA/SLO/SLI frameworks.
Excellent problem-solving, incident management, and troubleshooting skills in complex distributed systems.
Strong communication and leadership skills, with experience mentoring and guiding teams.
Knowledge of compliance and governance frameworks (ISO, SOC, GDPR) is a plus.
Preferred Skills
Experience in chaos engineering and resilience testing.
Knowledge of cloud-native security practices and Zero Trust architecture.
Background in financial services, government, or large-scale enterprise IT operations.
Seniority level
Senior
Employment type
Full-time
Job function
Information Technology
Industries: IT Services and IT Consulting
Get notified about new Site Reliability Engineer jobs in Singapore, Singapore.
#J-18808-Ljbffr
Senior Site Reliability Engineer (SRE)
Posted today
Job Viewed
Job Description
Responsibilities
Design, implement, and maintain highly available, scalable, and secure infrastructure
Develop and improve observability (monitoring, logging, alerting) across all services
Own incident response lifecycle: detection, mitigation, root cause analysis, and postmortems
Collaborate with software teams to implement SLOs, SLIs, and improve system reliability
Build and maintain CI/CD pipelines to support fast, safe deployments
Manage cloud infrastructure using infrastructure-as-code (e.g., Terraform, Pulumi)
Automate operational tasks using scripting or configuration management tools
Ensure robust backup, disaster recovery, and security controls are in place
Participate in on-call rotations and continuously improve incident response processes
Job Requirements
5+ years of experience in Site Reliability Engineering, DevOps, or relevant Infrastructure roles
Strong hands-on experience with cloud platforms (AWS, GCP, or Azure)
Proficient in infrastructure-as-code tools (e.g., Terraform, CloudFormation)
Solid knowledge of Linux systems administration and networking fundamentals
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK)
Familiar with container orchestration tools like Kubernetes and Docker
Experience working with CI/CD pipelines (e.g., GitLab CI, Jenkins, ArgoCD)
Solid understanding of SRE concepts such as SLAs, SLOs, and error budgets
Strong problem-solving skills, proactive mindset, and attention to detail
Excellent communication and collaboration skills, especially in cross-functional teams
If you are passionate about technology and meet the above requirements, please don’t hesitate to apply. Please note that only shortlisted candidates will be contacted. Appreciate your understanding. Data provided is for recruitment purposes only.
Dada Consultants Pte Ltd
EA License No.:
18S9037 |
EA Registration No.
R
Business Registration Number:
W
Seniority level
Mid-Senior level
Employment type
Full-time
Job function
Information Technology
Industries
Technology, Information and Media
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr
Be The First To Know
About the latest Sre Jobs in Singapore !
Tech Lead (SRE) - Cloud Infrastructure
Posted 8 days ago
Job Viewed
Job Description
ByteDance will prioritize applicants who have the right to work in Singapore without requiring sponsorship.
About ByteDance
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. Our products include TikTok, Toutiao, Douyin, and Xigua, making it easier and more fun for people to connect, consume, and create content.
Why Join Us
Creation is at the core of ByteDance's purpose. Our teams drive innovation and growth, turning challenges into opportunities to learn and improve. We foster a culture of courage, collaboration, and impact.
Team Introduction
The Site Reliability Engineering (SRE) team combines software and systems engineering to design and operate large-scale, distributed, and resilient systems.
Within TikTok's Infrastructure SRE, our focus is on ensuring the reliability and uptime of our infrastructure services, supporting rapid improvements through automation and system optimization.
The RoleAs a Tech Lead, you will guide and build a team of software and system engineers, establishing efficient processes and promoting best engineering practices. You will coordinate with other teams and the user community.
Responsibilities- Build and lead the SRE team, including recruitment, training, system operation, and fostering a strong team culture.
- Oversee software system development and organizational unit integration.
- Develop long-term technical strategies with clear milestones to enhance team capabilities.
- Guide Proof-of-Concept and solution development, ensuring security and risk considerations.
- Establish protocols for access management, configuration, disaster recovery, and fault handling.
- Create monitoring frameworks and promote automated, intelligent governance within a service-oriented architecture.
- Collaborate with development teams to ensure system reliability from design to launch, advancing automation in operations and maintenance.
- Improve communication and collaboration with business teams, refining processes and business architecture.
What you should have:
- Bachelor's Degree in Computer Science or related field, with over 5 years of professional experience, including at least 3 in R&D.
- Proficiency in Linux systems, networking, and managing large-scale distributed systems.
- Strong planning, summarization, and project management skills.
- Responsibility, proactive attitude, and problem-solving skills.
- Experience with cloud platforms is a plus; experience in large-scale storage, scheduling, big data, or intelligent operations is preferred.
ByteDance values diversity and is committed to creating an inclusive environment where employees are valued for their unique perspectives. We aim to reflect the communities we serve and foster a workplace of creativity and innovation.
#J-18808-LjbffrTech Lead (SRE) - Cloud Infrastructure
Posted today
Job Viewed
Job Description
Responsibilities
ByteDance will prioritize applicants who have the right to work in Singapore without requiring sponsorship.
About ByteDance
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. Our products include TikTok, Toutiao, Douyin, and Xigua, making it easier and more fun for people to connect, consume, and create content.
Why Join Us
Creation is at the core of ByteDance's purpose. Our teams drive innovation and growth, turning challenges into opportunities to learn and improve. We foster a culture of courage, collaboration, and impact.
Team Introduction
The Site Reliability Engineering (SRE) team combines software and systems engineering to design and operate large-scale, distributed, and resilient systems.
Within TikTok's Infrastructure SRE, our focus is on ensuring the reliability and uptime of our infrastructure services, supporting rapid improvements through automation and system optimization.
The Role
As a Tech Lead, you will guide and build a team of software and system engineers, establishing efficient processes and promoting best engineering practices. You will coordinate with other teams and the user community.
Responsibilities
Build and lead the SRE team, including recruitment, training, system operation, and fostering a strong team culture.
Oversee software system development and organizational unit integration.
Develop long-term technical strategies with clear milestones to enhance team capabilities.
Guide Proof-of-Concept and solution development, ensuring security and risk considerations.
Establish protocols for access management, configuration, disaster recovery, and fault handling.
Create monitoring frameworks and promote automated, intelligent governance within a service-oriented architecture.
Collaborate with development teams to ensure system reliability from design to launch, advancing automation in operations and maintenance.
Improve communication and collaboration with business teams, refining processes and business architecture.
Qualifications
What you should have:
Bachelor's Degree in Computer Science or related field, with over 5 years of professional experience, including at least 3 in R&D.
Proficiency in Linux systems, networking, and managing large-scale distributed systems.
Strong planning, summarization, and project management skills.
Responsibility, proactive attitude, and problem-solving skills.
Experience with cloud platforms is a plus; experience in large-scale storage, scheduling, big data, or intelligent operations is preferred.
ByteDance values diversity and is committed to creating an inclusive environment where employees are valued for their unique perspectives. We aim to reflect the communities we serve and foster a workplace of creativity and innovation.
#J-18808-Ljbffr
Senior Cloud Infrastructure Engineer (SRE)
Posted today
Job Viewed
Job Description
Base pay range
SGD100,000.00/yr - SGD120,000.00/yr
In Digital Resiliency Engineering (DRE), we combine software and systems engineering to build and operate large-scale and distributed systems designed and/or built by the Singapore Government. We ensure Government services are reliable, meets expected performance and satisfy customer needs.
If you are someone with strong DevOps, Infrastructure engineering and/or SRE background, have experience operating mission critical production technology infrastructure at scale, and are looking for opportunities to work with a team of practitioners and leading industry experts, we welcome you to join us.
In this role, you will build central services for observability and automation of infrastructure services. You will be part of a rotation with other engineers in providing rapid response to major incidents impacting critical Government Services. You will provide technical leadership for the team and work closely with technical leads to operate highly available solutions. You will also provide guidance to other team member on managing availability and performance of mission critical services, building automation and monitoring solutions to prevent problem recurrence, and building automated responses for non-exceptional service conditions.
You will also manage execution of project priorities, deadlines and deliverables. You will also lead designs of major components, systems and features to improve availability, scalability, latency and efficiency of services design and built by the Government.
Key Responsibilities
Build Service Level Indicators (SLI), Service Level Objectives (SLO), Error Budgets, and Post-mortem Incident processes.
As part of an on-call roster, ensure reliability and performance of critical Government Services. Provide operational support and engineering for large-scale and distributed systems to drive incidents resolution effectively.
Gather and analyse metrics and logs from Operating Systems and/or applications for capacity planning, performance tuning and fault isolation.
Build automation to manage services, infrastructure, and/or applications.
Improve reliability and quality of services using proactive monitoring.
Measure and optimize system performance, with continuous improvement and pushing SRE practice forward.
Build SRE playbook for the Whole-of-Government to leverage as reference for SRE.
Identify potential and emerging technologies relevant to innovation for the Government.
Work in a cross-functional service team consisting of software engineers, infrastructure engineers, DevOps, and other specialists.
Requirements
10+ years of experience in technology operations as an Infrastructure Engineer or Site Reliability Engineer - with experience operating large-scale mission critical production systems.
Expertise in building and operating automated monitoring and incident detection systems, creating runbooks and running incident management processes.
Expertise in designing automation solutions using provisioning tools, continuous integration tools (CI/CD), and scripting languages.
Experience leading highly complex technical projects with multiple dependencies and stakeholders.
Knowledgeable and experienced in working within an Agile development environment, focusing on dynamic and rapid quality delivery.
Proficient in building and managing highly available and scalable IT infrastructure and/or application, with knowledge in Container and Virtualization technologies.
Proficiency in Python, PowerShell, or Ruby.
Proficiency with Infrastructure as Code (IaC) tools such as SaltStack, Puppet, Terraform, or Ansible.
Able to work independently and deliver results within specified deadlines.
Ability to prioritize work and strong problem-solving skills.
Good communication skills, both verbally and in writing to users, vendors and management.
Ability to communicate complex interaction concepts clearly and persuasively across different audiences and GovTech.
Join us and discover a meaningful and exciting career with Assurity Trusted Solutions!
The remuneration package will commensurate with your qualifications and experience. Interested applicants, please click "Apply Now".
We thank you for your interest and please note that only shortlisted candidates will be notified.
By submitting your application, you agree that your personal data may be collected, used and disclosed by Assurity Trusted Solutions Pte. Ltd. (ATS), GovTech and their service providers and agents in accordance with ATS's privacy statement which can be found at: or such other successor site.
Benefits
A wholly-owned subsidiary of GovTech
We promote a learning culture and encourage you to grow and learn
Annual Leave Benefits with additional perks such as Family Care and Birthday Leave
Contract Staff enjoys the same benefits as Permanent Employees
#J-18808-Ljbffr