1,380 Sre jobs in Singapore

Site Reliability Engineer (SRE)

Singapore, Singapore PERSOLKELLY SINGAPORE PTE. LTD.

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

Overview

Site Reliability Engineer (SRE) — An excellent opportunity in a cutting-edge, fast-growing cloud environment.

Job Purpose

Job Purpose: Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure.

Responsibilities
  • Manage and support AWS services, ensuring uptime, performance, and security compliance.
  • Automate deployments and infrastructure tasks using Terraform, CloudFormation, and Ansible tools.
  • Maintain operating systems, patches, and certificates across Linux and Windows environments.
  • Document processes, create runbooks, and ensure adherence to security and compliance standards.
  • Mentor junior engineers while resolving incidents and supporting production-critical systems.
Qualifications
  • 8+ years’ experience in cloud operations, SRE, or DevOps environments.
  • Strong AWS knowledge, including Lambda, ECS, EKS, and cloud-native services.
  • Proficiency with Infrastructure-as-Code tools like Terraform, Ansible, and CloudFormation.
  • Experience with Linux/Windows server management, patching, and SSL lifecycle maintenance.
  • Excellent communication, problem-solving, and teamwork skills, with compliance-focused mindset.

The successful Site Reliability Engineer (SRE) must possess deep AWS expertise and proven operational excellence.

Curious about this opportunity? Reach out now and discover how you can be part of a global technology leader—let’s talk at

PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)

By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE)

Singapore, Singapore PERSOLKELLY SINGAPORE PTE. LTD.

Posted today

Job Viewed

Tap Again To Close

Job Description

Overview
Site Reliability Engineer (SRE)
— An excellent opportunity in a cutting-edge, fast-growing cloud environment.
Job Purpose
Job Purpose:
Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure.
Responsibilities
Manage and support AWS services, ensuring uptime, performance, and security compliance.
Automate deployments and infrastructure tasks using Terraform, CloudFormation, and Ansible tools.
Maintain operating systems, patches, and certificates across Linux and Windows environments.
Document processes, create runbooks, and ensure adherence to security and compliance standards.
Mentor junior engineers while resolving incidents and supporting production-critical systems.
Qualifications
8+ years’ experience in cloud operations, SRE, or DevOps environments.
Strong AWS knowledge, including Lambda, ECS, EKS, and cloud-native services.
Proficiency with Infrastructure-as-Code tools like Terraform, Ansible, and CloudFormation.
Experience with Linux/Windows server management, patching, and SSL lifecycle maintenance.
Excellent communication, problem-solving, and teamwork skills, with compliance-focused mindset.
The successful Site Reliability Engineer (SRE) must possess deep AWS expertise and proven operational excellence.
Curious about this opportunity? Reach out now and discover how you can be part of a global technology leader—let’s talk at
PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)
By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE)

$9000 Monthly PERSOLKELLY SINGAPORE PTE. LTD.

Posted 5 days ago

Job Viewed

Tap Again To Close

Job Description

Site Reliability Engineer (SRE)

An excellent Site Reliability Engineer (SRE) opportunity is available in a cutting-edge, fast-growing cloud environment.

Job Purpose:
Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure.

Job Responsibilities:

  • Manage and support AWS services, ensuring uptime, performance, and security compliance.
  • Automate deployments and infrastructure tasks using Terraform, CloudFormation, and Ansible tools.
  • Maintain operating systems, patches, and certificates across Linux and Windows environments.
  • Document processes, create runbooks, and ensure adherence to security and compliance standards.
  • Mentor junior engineers while resolving incidents and supporting production-critical systems.

Job Requirements:

  • 8+ years’ experience in cloud operations, SRE, or DevOps environments.
  • Strong AWS knowledge, including Lambda, ECS, EKS, and cloud-native services.
  • Proficiency with Infrastructure-as-Code tools like Terraform, Ansible, and CloudFormation.
  • Experience with Linux/Windows server management, patching, and SSL lifecycle maintenance.
  • Excellent communication, problem-solving, and teamwork skills, with compliance-focused mindset.

The successful Site Reliability Engineer (SRE) must possess deep AWS expertise and proven operational excellence.

Curious about this opportunity? Reach out now and discover how you can be part of a global technology leader—let’s talk at

PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)

By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.

***

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE) (GovTech)

Singapore, Singapore AvePoint

Posted 13 days ago

Job Viewed

Tap Again To Close

Job Description

Site Reliability Engineer (SRE) (GovTech)

We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.

As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWS and Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.

Responsibilities

As a Site Reliability Engineer, you will be responsible for:

Toil Reduction & Automation

  • Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
  • Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
  • Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
  • Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
Maintenance, Optimisation & Performance
  • Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
  • Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
  • Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
  • Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
Requirements
  • Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
  • Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
  • Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
  • Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
  • Strong documentation skills and experience in knowledge sharing across teams.
  • Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
  • Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
  • Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
  • Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
  • Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
  • Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
  • Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
  • Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
  • Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
Soft Skills
  • Proactive in identifying problems and recommending strategic solutions.
  • Excellent problem-solving skills with a robust analytical mindset.
  • Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
  • Ability to remain calm and effective under pressure, especially during incident response.
  • Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
  • Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
  • Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.

Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice.

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Data Security Software Products

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE) (GovTech)

Singapore, Singapore Avepoint

Posted 24 days ago

Job Viewed

Tap Again To Close

Job Description

We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.

As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWSand Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.

Responsibilities:

As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
• Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
• Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
• Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
• Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.

Maintenance, Optimisation & Performance
• Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
• Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
• Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
• Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.

Requirements:

• Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
• Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
• Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
• Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
• Strong documentation skills and experience in knowledge sharing across teams.
• Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
• Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
• Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
• Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
• Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
• Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
• Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
• Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
• Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.

Soft skills:

• Proactive in identifying problems and recommending strategic solutions.
• Excellent problem-solving skills with a robust analytical mindset.
• Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
• Ability to remain calm and effective under pressure, especially during incident response.
• Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
• Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
• Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.

Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice .

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer (SRE)

Singapore, Singapore HCLTech

Posted today

Job Viewed

Tap Again To Close

Job Description

Direct message the job poster from HCLTech
Deputy Manager - Talent Acquisition Growth Markets, APME at HCLTech
The following responsibilities and requirements describe the role of a Senior Site Reliability Engineer (SRE) with 10–15 years of experience. The candidate will focus on building, managing, and optimizing reliable, scalable, and secure systems across multi-cloud, hybrid cloud, and on-premises data center environments.
Job Summary
We are seeking a highly experienced Senior Site Reliability Engineer (SRE) with 10–15 years of expertise in building, managing, and optimizing reliable, scalable, and secure systems. This role requires strong proficiency in end-to-end SRE practices across multi-cloud, hybrid cloud, and on-premises data center environments. The ideal candidate will drive automation, observability, and resiliency while working closely with development, infrastructure, and operations teams to ensure seamless system performance and availability.
Responsibilities
Lead the design, implementation, and management of SRE practices across cloud, hybrid, and on-premises data center environments.
Build and optimize scalable, highly available, and secure infrastructure supporting critical enterprise applications.
Develop automation frameworks to streamline deployment, monitoring, incident response, and system recovery.
Define and enforce SLAs, SLOs, and SLIs to ensure service reliability and performance.
Implement observability solutions, including monitoring, logging, tracing, and alerting for proactive issue detection and resolution.
Partner with development teams to design and deliver resilient systems, ensuring reliability is integrated into every stage of the lifecycle.
Perform root cause analysis (RCA) and drive post-incident reviews to ensure continuous improvement.
Support capacity planning, performance tuning, and cost optimization across hybrid and multi-cloud environments.
Mentor junior engineers and lead best practices in automation, security, and operational excellence.
Collaborate with security and compliance teams to ensure infrastructure and operations align with organizational and regulatory standards.
Requirements
Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related discipline.
10–15 years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Proven expertise in managing multi-cloud (AWS, Azure, GCP), hybrid cloud, and data center environments.
Strong knowledge of Linux/Unix and Windows systems administration.
Proficiency in automation and configuration management tools (Terraform, Ansible, Puppet, Chef, SaltStack).
Hands-on experience with CI/CD pipelines, containerization (Docker, Kubernetes, OpenShift), and orchestration.
Deep knowledge of observability tools (Prometheus, Grafana, ELK, Splunk, Datadog, New Relic).
Strong understanding of networking, load balancing, storage, and security in enterprise-scale environments.
Experience defining and managing SLA/SLO/SLI frameworks.
Excellent problem-solving, incident management, and troubleshooting skills in complex distributed systems.
Strong communication and leadership skills, with experience mentoring and guiding teams.
Knowledge of compliance and governance frameworks (ISO, SOC, GDPR) is a plus.
Preferred Skills
Experience in chaos engineering and resilience testing.
Knowledge of cloud-native security practices and Zero Trust architecture.
Background in financial services, government, or large-scale enterprise IT operations.
Seniority level
Senior
Employment type
Full-time
Job function
Information Technology
Industries: IT Services and IT Consulting
Get notified about new Site Reliability Engineer jobs in Singapore, Singapore.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer (SRE)

Singapore, Singapore Dada Consultants

Posted today

Job Viewed

Tap Again To Close

Job Description

Responsibilities
Design, implement, and maintain highly available, scalable, and secure infrastructure
Develop and improve observability (monitoring, logging, alerting) across all services
Own incident response lifecycle: detection, mitigation, root cause analysis, and postmortems
Collaborate with software teams to implement SLOs, SLIs, and improve system reliability
Build and maintain CI/CD pipelines to support fast, safe deployments
Manage cloud infrastructure using infrastructure-as-code (e.g., Terraform, Pulumi)
Automate operational tasks using scripting or configuration management tools
Ensure robust backup, disaster recovery, and security controls are in place
Participate in on-call rotations and continuously improve incident response processes
Job Requirements
5+ years of experience in Site Reliability Engineering, DevOps, or relevant Infrastructure roles
Strong hands-on experience with cloud platforms (AWS, GCP, or Azure)
Proficient in infrastructure-as-code tools (e.g., Terraform, CloudFormation)
Solid knowledge of Linux systems administration and networking fundamentals
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK)
Familiar with container orchestration tools like Kubernetes and Docker
Experience working with CI/CD pipelines (e.g., GitLab CI, Jenkins, ArgoCD)
Solid understanding of SRE concepts such as SLAs, SLOs, and error budgets
Strong problem-solving skills, proactive mindset, and attention to detail
Excellent communication and collaboration skills, especially in cross-functional teams
If you are passionate about technology and meet the above requirements, please don’t hesitate to apply. Please note that only shortlisted candidates will be contacted. Appreciate your understanding. Data provided is for recruitment purposes only.
Dada Consultants Pte Ltd
EA License No.:
18S9037 |
EA Registration No.
R
Business Registration Number:
W
Seniority level
Mid-Senior level
Employment type
Full-time
Job function
Information Technology
Industries
Technology, Information and Media
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Sre Jobs in Singapore !

Tech Lead (SRE) - Cloud Infrastructure

Singapore, Singapore Refine Group

Posted 8 days ago

Job Viewed

Tap Again To Close

Job Description

Responsibilities

ByteDance will prioritize applicants who have the right to work in Singapore without requiring sponsorship.

About ByteDance

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. Our products include TikTok, Toutiao, Douyin, and Xigua, making it easier and more fun for people to connect, consume, and create content.

Why Join Us

Creation is at the core of ByteDance's purpose. Our teams drive innovation and growth, turning challenges into opportunities to learn and improve. We foster a culture of courage, collaboration, and impact.

Team Introduction

The Site Reliability Engineering (SRE) team combines software and systems engineering to design and operate large-scale, distributed, and resilient systems.

Within TikTok's Infrastructure SRE, our focus is on ensuring the reliability and uptime of our infrastructure services, supporting rapid improvements through automation and system optimization.

The Role

As a Tech Lead, you will guide and build a team of software and system engineers, establishing efficient processes and promoting best engineering practices. You will coordinate with other teams and the user community.

Responsibilities
  • Build and lead the SRE team, including recruitment, training, system operation, and fostering a strong team culture.
  • Oversee software system development and organizational unit integration.
  • Develop long-term technical strategies with clear milestones to enhance team capabilities.
  • Guide Proof-of-Concept and solution development, ensuring security and risk considerations.
  • Establish protocols for access management, configuration, disaster recovery, and fault handling.
  • Create monitoring frameworks and promote automated, intelligent governance within a service-oriented architecture.
  • Collaborate with development teams to ensure system reliability from design to launch, advancing automation in operations and maintenance.
  • Improve communication and collaboration with business teams, refining processes and business architecture.
Qualifications

What you should have:

  • Bachelor's Degree in Computer Science or related field, with over 5 years of professional experience, including at least 3 in R&D.
  • Proficiency in Linux systems, networking, and managing large-scale distributed systems.
  • Strong planning, summarization, and project management skills.
  • Responsibility, proactive attitude, and problem-solving skills.
  • Experience with cloud platforms is a plus; experience in large-scale storage, scheduling, big data, or intelligent operations is preferred.

ByteDance values diversity and is committed to creating an inclusive environment where employees are valued for their unique perspectives. We aim to reflect the communities we serve and foster a workplace of creativity and innovation.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Tech Lead (SRE) - Cloud Infrastructure

Singapore, Singapore Refine Group

Posted today

Job Viewed

Tap Again To Close

Job Description

Responsibilities
ByteDance will prioritize applicants who have the right to work in Singapore without requiring sponsorship.
About ByteDance
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. Our products include TikTok, Toutiao, Douyin, and Xigua, making it easier and more fun for people to connect, consume, and create content.
Why Join Us
Creation is at the core of ByteDance's purpose. Our teams drive innovation and growth, turning challenges into opportunities to learn and improve. We foster a culture of courage, collaboration, and impact.
Team Introduction
The Site Reliability Engineering (SRE) team combines software and systems engineering to design and operate large-scale, distributed, and resilient systems.
Within TikTok's Infrastructure SRE, our focus is on ensuring the reliability and uptime of our infrastructure services, supporting rapid improvements through automation and system optimization.
The Role
As a Tech Lead, you will guide and build a team of software and system engineers, establishing efficient processes and promoting best engineering practices. You will coordinate with other teams and the user community.
Responsibilities
Build and lead the SRE team, including recruitment, training, system operation, and fostering a strong team culture.
Oversee software system development and organizational unit integration.
Develop long-term technical strategies with clear milestones to enhance team capabilities.
Guide Proof-of-Concept and solution development, ensuring security and risk considerations.
Establish protocols for access management, configuration, disaster recovery, and fault handling.
Create monitoring frameworks and promote automated, intelligent governance within a service-oriented architecture.
Collaborate with development teams to ensure system reliability from design to launch, advancing automation in operations and maintenance.
Improve communication and collaboration with business teams, refining processes and business architecture.
Qualifications
What you should have:
Bachelor's Degree in Computer Science or related field, with over 5 years of professional experience, including at least 3 in R&D.
Proficiency in Linux systems, networking, and managing large-scale distributed systems.
Strong planning, summarization, and project management skills.
Responsibility, proactive attitude, and problem-solving skills.
Experience with cloud platforms is a plus; experience in large-scale storage, scheduling, big data, or intelligent operations is preferred.
ByteDance values diversity and is committed to creating an inclusive environment where employees are valued for their unique perspectives. We aim to reflect the communities we serve and foster a workplace of creativity and innovation.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Senior Cloud Infrastructure Engineer (SRE)

Singapore, Singapore Assurity Trusted Solutions Pte Ltd

Posted today

Job Viewed

Tap Again To Close

Job Description

Base pay range
SGD100,000.00/yr - SGD120,000.00/yr
In Digital Resiliency Engineering (DRE), we combine software and systems engineering to build and operate large-scale and distributed systems designed and/or built by the Singapore Government. We ensure Government services are reliable, meets expected performance and satisfy customer needs.
If you are someone with strong DevOps, Infrastructure engineering and/or SRE background, have experience operating mission critical production technology infrastructure at scale, and are looking for opportunities to work with a team of practitioners and leading industry experts, we welcome you to join us.
In this role, you will build central services for observability and automation of infrastructure services. You will be part of a rotation with other engineers in providing rapid response to major incidents impacting critical Government Services. You will provide technical leadership for the team and work closely with technical leads to operate highly available solutions. You will also provide guidance to other team member on managing availability and performance of mission critical services, building automation and monitoring solutions to prevent problem recurrence, and building automated responses for non-exceptional service conditions.
You will also manage execution of project priorities, deadlines and deliverables. You will also lead designs of major components, systems and features to improve availability, scalability, latency and efficiency of services design and built by the Government.
Key Responsibilities
Build Service Level Indicators (SLI), Service Level Objectives (SLO), Error Budgets, and Post-mortem Incident processes.
As part of an on-call roster, ensure reliability and performance of critical Government Services. Provide operational support and engineering for large-scale and distributed systems to drive incidents resolution effectively.
Gather and analyse metrics and logs from Operating Systems and/or applications for capacity planning, performance tuning and fault isolation.
Build automation to manage services, infrastructure, and/or applications.
Improve reliability and quality of services using proactive monitoring.
Measure and optimize system performance, with continuous improvement and pushing SRE practice forward.
Build SRE playbook for the Whole-of-Government to leverage as reference for SRE.
Identify potential and emerging technologies relevant to innovation for the Government.
Work in a cross-functional service team consisting of software engineers, infrastructure engineers, DevOps, and other specialists.
Requirements
10+ years of experience in technology operations as an Infrastructure Engineer or Site Reliability Engineer - with experience operating large-scale mission critical production systems.
Expertise in building and operating automated monitoring and incident detection systems, creating runbooks and running incident management processes.
Expertise in designing automation solutions using provisioning tools, continuous integration tools (CI/CD), and scripting languages.
Experience leading highly complex technical projects with multiple dependencies and stakeholders.
Knowledgeable and experienced in working within an Agile development environment, focusing on dynamic and rapid quality delivery.
Proficient in building and managing highly available and scalable IT infrastructure and/or application, with knowledge in Container and Virtualization technologies.
Proficiency in Python, PowerShell, or Ruby.
Proficiency with Infrastructure as Code (IaC) tools such as SaltStack, Puppet, Terraform, or Ansible.
Able to work independently and deliver results within specified deadlines.
Ability to prioritize work and strong problem-solving skills.
Good communication skills, both verbally and in writing to users, vendors and management.
Ability to communicate complex interaction concepts clearly and persuasively across different audiences and GovTech.
Join us and discover a meaningful and exciting career with Assurity Trusted Solutions!
The remuneration package will commensurate with your qualifications and experience. Interested applicants, please click "Apply Now".
We thank you for your interest and please note that only shortlisted candidates will be notified.
By submitting your application, you agree that your personal data may be collected, used and disclosed by Assurity Trusted Solutions Pte. Ltd. (ATS), GovTech and their service providers and agents in accordance with ATS's privacy statement which can be found at: or such other successor site.
Benefits
A wholly-owned subsidiary of GovTech
We promote a learning culture and encourage you to grow and learn
Annual Leave Benefits with additional perks such as Family Care and Birthday Leave
Contract Staff enjoys the same benefits as Permanent Employees
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Sre Jobs