Didn't find the right job?

Get expert career advice to help you find the ideal role and improve your job search strategy.

1,411 Senior Sre jobs in Singapore

Site Reliability Engineer (SRE)

Singapore, Singapore PERSOLKELLY SINGAPORE PTE. LTD.

Posted 1 day ago

Tap Again To Close

Job Description

Overview

Site Reliability Engineer (SRE) — An excellent opportunity in a cutting-edge, fast-growing cloud environment.

Job Purpose

Job Purpose: Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure.

Responsibilities

Manage and support AWS services, ensuring uptime, performance, and security compliance.
Automate deployments and infrastructure tasks using Terraform, CloudFormation, and Ansible tools.
Maintain operating systems, patches, and certificates across Linux and Windows environments.
Document processes, create runbooks, and ensure adherence to security and compliance standards.
Mentor junior engineers while resolving incidents and supporting production-critical systems.

Qualifications

8+ years’ experience in cloud operations, SRE, or DevOps environments.
Strong AWS knowledge, including Lambda, ECS, EKS, and cloud-native services.
Proficiency with Infrastructure-as-Code tools like Terraform, Ansible, and CloudFormation.
Experience with Linux/Windows server management, patching, and SSL lifecycle maintenance.
Excellent communication, problem-solving, and teamwork skills, with compliance-focused mindset.

The successful Site Reliability Engineer (SRE) must possess deep AWS expertise and proven operational excellence.

Curious about this opportunity? Reach out now and discover how you can be part of a global technology leader—let’s talk at

PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)

By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE)

Singapore, Singapore PERSOLKELLY SINGAPORE PTE. LTD.

Posted today

Tap Again To Close

Job Description

Overview
Site Reliability Engineer (SRE)
— An excellent opportunity in a cutting-edge, fast-growing cloud environment.
Job Purpose
Job Purpose:
Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure.
Responsibilities
Manage and support AWS services, ensuring uptime, performance, and security compliance.
Automate deployments and infrastructure tasks using Terraform, CloudFormation, and Ansible tools.
Maintain operating systems, patches, and certificates across Linux and Windows environments.
Document processes, create runbooks, and ensure adherence to security and compliance standards.
Mentor junior engineers while resolving incidents and supporting production-critical systems.
Qualifications
8+ years’ experience in cloud operations, SRE, or DevOps environments.
Strong AWS knowledge, including Lambda, ECS, EKS, and cloud-native services.
Proficiency with Infrastructure-as-Code tools like Terraform, Ansible, and CloudFormation.
Experience with Linux/Windows server management, patching, and SSL lifecycle maintenance.
Excellent communication, problem-solving, and teamwork skills, with compliance-focused mindset.
The successful Site Reliability Engineer (SRE) must possess deep AWS expertise and proven operational excellence.
Curious about this opportunity? Reach out now and discover how you can be part of a global technology leader—let’s talk at
PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)
By sending us your personal data and curriculum vitae (CV), you are deemed to consent to PERSOLKELLY Singapore Pte Ltd and its affiliates to collect, use and disclose your personal data for the purposes set out in the Privacy Policy available at You acknowledge that you have read, understood, and agree with the Privacy Policy.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE)

$9000 Monthly PERSOLKELLY SINGAPORE PTE. LTD.

Posted 5 days ago

Tap Again To Close

Job Description

Site Reliability Engineer (SRE)

An excellent Site Reliability Engineer (SRE) opportunity is available in a cutting-edge, fast-growing cloud environment.

Job Purpose:
Deliver reliable, secure, and scalable cloud services by managing and optimizing AWS infrastructure.

Job Responsibilities:

Manage and support AWS services, ensuring uptime, performance, and security compliance.
Automate deployments and infrastructure tasks using Terraform, CloudFormation, and Ansible tools.
Maintain operating systems, patches, and certificates across Linux and Windows environments.
Document processes, create runbooks, and ensure adherence to security and compliance standards.
Mentor junior engineers while resolving incidents and supporting production-critical systems.

Job Requirements:

8+ years’ experience in cloud operations, SRE, or DevOps environments.
Strong AWS knowledge, including Lambda, ECS, EKS, and cloud-native services.
Proficiency with Infrastructure-as-Code tools like Terraform, Ansible, and CloudFormation.
Experience with Linux/Windows server management, patching, and SSL lifecycle maintenance.
Excellent communication, problem-solving, and teamwork skills, with compliance-focused mindset.

The successful Site Reliability Engineer (SRE) must possess deep AWS expertise and proven operational excellence.

Curious about this opportunity? Reach out now and discover how you can be part of a global technology leader—let’s talk at

PERSOLKELLY Singapore Pte Ltd • EA License No.01C4394 • EA Registration No. R (Naveen Vasudevan)

***

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE) (GovTech)

Singapore, Singapore AvePoint

Posted 13 days ago

Tap Again To Close

Job Description

Site Reliability Engineer (SRE) (GovTech)

We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.

As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWS and Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.

Responsibilities

As a Site Reliability Engineer, you will be responsible for:

Toil Reduction & Automation

Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.

Observability & System Health

Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.

Production Support & Incident Management

Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.

Security & Compliance

Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.

Maintenance, Optimisation & Performance

Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.

Strategic Customer Engagement

Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.

Knowledge Sharing & Documentation

Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.

Continuous Learning & Innovation

Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.

Requirements

Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
Strong documentation skills and experience in knowledge sharing across teams.
Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.

Soft Skills

Proactive in identifying problems and recommending strategic solutions.
Excellent problem-solving skills with a robust analytical mindset.
Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
Ability to remain calm and effective under pressure, especially during incident response.
Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.

Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice.

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Data Security Software Products

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE) (GovTech)

Singapore, Singapore Avepoint

Posted 24 days ago

Tap Again To Close

Job Description

We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.

As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWSand Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.

Responsibilities:

As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
• Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
• Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
• Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
• Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.

Maintenance, Optimisation & Performance
• Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
• Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
• Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
• Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.

Requirements:

• Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
• Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
• Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
• Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
• Strong documentation skills and experience in knowledge sharing across teams.
• Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
• Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
• Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
• Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
• Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
• Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
• Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
• Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
• Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.

Soft skills:

• Proactive in identifying problems and recommending strategic solutions.
• Excellent problem-solving skills with a robust analytical mindset.
• Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
• Ability to remain calm and effective under pressure, especially during incident response.
• Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
• Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
• Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.

Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice .

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE) (GovTech)

Singapore, Singapore AvePoint

Posted today

Tap Again To Close

Job Description

Site Reliability Engineer (SRE) (GovTech)
We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWS and Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.
Responsibilities
As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
Maintenance, Optimisation & Performance
Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
Requirements
Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
Strong documentation skills and experience in knowledge sharing across teams.
Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
Soft Skills
Proactive in identifying problems and recommending strategic solutions.
Excellent problem-solving skills with a robust analytical mindset.
Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
Ability to remain calm and effective under pressure, especially during incident response.
Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.
Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice.
Seniority level
Mid-Senior level
Employment type
Full-time
Job function
Engineering and Information Technology
Industries
Data Security Software Products
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer (SRE)

Singapore, Singapore HCLTech

Posted today

Tap Again To Close

Job Description

Direct message the job poster from HCLTech
Deputy Manager - Talent Acquisition Growth Markets, APME at HCLTech
The following responsibilities and requirements describe the role of a Senior Site Reliability Engineer (SRE) with 10–15 years of experience. The candidate will focus on building, managing, and optimizing reliable, scalable, and secure systems across multi-cloud, hybrid cloud, and on-premises data center environments.
Job Summary
We are seeking a highly experienced Senior Site Reliability Engineer (SRE) with 10–15 years of expertise in building, managing, and optimizing reliable, scalable, and secure systems. This role requires strong proficiency in end-to-end SRE practices across multi-cloud, hybrid cloud, and on-premises data center environments. The ideal candidate will drive automation, observability, and resiliency while working closely with development, infrastructure, and operations teams to ensure seamless system performance and availability.
Responsibilities
Lead the design, implementation, and management of SRE practices across cloud, hybrid, and on-premises data center environments.
Build and optimize scalable, highly available, and secure infrastructure supporting critical enterprise applications.
Develop automation frameworks to streamline deployment, monitoring, incident response, and system recovery.
Define and enforce SLAs, SLOs, and SLIs to ensure service reliability and performance.
Implement observability solutions, including monitoring, logging, tracing, and alerting for proactive issue detection and resolution.
Partner with development teams to design and deliver resilient systems, ensuring reliability is integrated into every stage of the lifecycle.
Perform root cause analysis (RCA) and drive post-incident reviews to ensure continuous improvement.
Support capacity planning, performance tuning, and cost optimization across hybrid and multi-cloud environments.
Mentor junior engineers and lead best practices in automation, security, and operational excellence.
Collaborate with security and compliance teams to ensure infrastructure and operations align with organizational and regulatory standards.
Requirements
Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related discipline.
10–15 years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Proven expertise in managing multi-cloud (AWS, Azure, GCP), hybrid cloud, and data center environments.
Strong knowledge of Linux/Unix and Windows systems administration.
Proficiency in automation and configuration management tools (Terraform, Ansible, Puppet, Chef, SaltStack).
Hands-on experience with CI/CD pipelines, containerization (Docker, Kubernetes, OpenShift), and orchestration.
Deep knowledge of observability tools (Prometheus, Grafana, ELK, Splunk, Datadog, New Relic).
Strong understanding of networking, load balancing, storage, and security in enterprise-scale environments.
Experience defining and managing SLA/SLO/SLI frameworks.
Excellent problem-solving, incident management, and troubleshooting skills in complex distributed systems.
Strong communication and leadership skills, with experience mentoring and guiding teams.
Knowledge of compliance and governance frameworks (ISO, SOC, GDPR) is a plus.
Preferred Skills
Experience in chaos engineering and resilience testing.
Knowledge of cloud-native security practices and Zero Trust architecture.
Background in financial services, government, or large-scale enterprise IT operations.
Seniority level
Senior
Employment type
Full-time
Job function
Information Technology
Industries: IT Services and IT Consulting
Get notified about new Site Reliability Engineer jobs in Singapore, Singapore.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Be The First To Know

About the latest Senior sre Jobs in Singapore !

Set Email Alert:

Enter your email

Job title

Location

Senior Site Reliability Engineer (SRE)

Singapore, Singapore Dada Consultants

Posted today

Tap Again To Close

Job Description

Responsibilities
Design, implement, and maintain highly available, scalable, and secure infrastructure
Develop and improve observability (monitoring, logging, alerting) across all services
Own incident response lifecycle: detection, mitigation, root cause analysis, and postmortems
Collaborate with software teams to implement SLOs, SLIs, and improve system reliability
Build and maintain CI/CD pipelines to support fast, safe deployments
Manage cloud infrastructure using infrastructure-as-code (e.g., Terraform, Pulumi)
Automate operational tasks using scripting or configuration management tools
Ensure robust backup, disaster recovery, and security controls are in place
Participate in on-call rotations and continuously improve incident response processes
Job Requirements
5+ years of experience in Site Reliability Engineering, DevOps, or relevant Infrastructure roles
Strong hands-on experience with cloud platforms (AWS, GCP, or Azure)
Proficient in infrastructure-as-code tools (e.g., Terraform, CloudFormation)
Solid knowledge of Linux systems administration and networking fundamentals
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK)
Familiar with container orchestration tools like Kubernetes and Docker
Experience working with CI/CD pipelines (e.g., GitLab CI, Jenkins, ArgoCD)
Solid understanding of SRE concepts such as SLAs, SLOs, and error budgets
Strong problem-solving skills, proactive mindset, and attention to detail
Excellent communication and collaboration skills, especially in cross-functional teams
If you are passionate about technology and meet the above requirements, please don’t hesitate to apply. Please note that only shortlisted candidates will be contacted. Appreciate your understanding. Data provided is for recruitment purposes only.
Dada Consultants Pte Ltd
EA License No.:
18S9037 |
EA Registration No.
R
Business Registration Number:
W
Seniority level
Mid-Senior level
Employment type
Full-time
Job function
Information Technology
Industries
Technology, Information and Media
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Tech Lead (SRE) - Cloud Infrastructure

Singapore, Singapore Refine Group

Posted 8 days ago

Tap Again To Close

Job Description

Responsibilities

ByteDance will prioritize applicants who have the right to work in Singapore without requiring sponsorship.

About ByteDance

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. Our products include TikTok, Toutiao, Douyin, and Xigua, making it easier and more fun for people to connect, consume, and create content.

Why Join Us

Creation is at the core of ByteDance's purpose. Our teams drive innovation and growth, turning challenges into opportunities to learn and improve. We foster a culture of courage, collaboration, and impact.

Team Introduction

The Site Reliability Engineering (SRE) team combines software and systems engineering to design and operate large-scale, distributed, and resilient systems.

Within TikTok's Infrastructure SRE, our focus is on ensuring the reliability and uptime of our infrastructure services, supporting rapid improvements through automation and system optimization.

The Role

As a Tech Lead, you will guide and build a team of software and system engineers, establishing efficient processes and promoting best engineering practices. You will coordinate with other teams and the user community.

Responsibilities

Build and lead the SRE team, including recruitment, training, system operation, and fostering a strong team culture.
Oversee software system development and organizational unit integration.
Develop long-term technical strategies with clear milestones to enhance team capabilities.
Guide Proof-of-Concept and solution development, ensuring security and risk considerations.
Establish protocols for access management, configuration, disaster recovery, and fault handling.
Create monitoring frameworks and promote automated, intelligent governance within a service-oriented architecture.
Collaborate with development teams to ensure system reliability from design to launch, advancing automation in operations and maintenance.
Improve communication and collaboration with business teams, refining processes and business architecture.

Qualifications

What you should have:

Bachelor's Degree in Computer Science or related field, with over 5 years of professional experience, including at least 3 in R&D.
Proficiency in Linux systems, networking, and managing large-scale distributed systems.
Strong planning, summarization, and project management skills.
Responsibility, proactive attitude, and problem-solving skills.
Experience with cloud platforms is a plus; experience in large-scale storage, scheduling, big data, or intelligent operations is preferred.

ByteDance values diversity and is committed to creating an inclusive environment where employees are valued for their unique perspectives. We aim to reflect the communities we serve and foster a workplace of creativity and innovation.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Tech Lead (SRE) - Cloud Infrastructure

Singapore, Singapore Refine Group

Posted today

Tap Again To Close

Job Description

Responsibilities
ByteDance will prioritize applicants who have the right to work in Singapore without requiring sponsorship.
About ByteDance
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. Our products include TikTok, Toutiao, Douyin, and Xigua, making it easier and more fun for people to connect, consume, and create content.
Why Join Us
Creation is at the core of ByteDance's purpose. Our teams drive innovation and growth, turning challenges into opportunities to learn and improve. We foster a culture of courage, collaboration, and impact.
Team Introduction
The Site Reliability Engineering (SRE) team combines software and systems engineering to design and operate large-scale, distributed, and resilient systems.
Within TikTok's Infrastructure SRE, our focus is on ensuring the reliability and uptime of our infrastructure services, supporting rapid improvements through automation and system optimization.
The Role
As a Tech Lead, you will guide and build a team of software and system engineers, establishing efficient processes and promoting best engineering practices. You will coordinate with other teams and the user community.
Responsibilities
Build and lead the SRE team, including recruitment, training, system operation, and fostering a strong team culture.
Oversee software system development and organizational unit integration.
Develop long-term technical strategies with clear milestones to enhance team capabilities.
Guide Proof-of-Concept and solution development, ensuring security and risk considerations.
Establish protocols for access management, configuration, disaster recovery, and fault handling.
Create monitoring frameworks and promote automated, intelligent governance within a service-oriented architecture.
Collaborate with development teams to ensure system reliability from design to launch, advancing automation in operations and maintenance.
Improve communication and collaboration with business teams, refining processes and business architecture.
Qualifications
What you should have:
Bachelor's Degree in Computer Science or related field, with over 5 years of professional experience, including at least 3 in R&D.
Proficiency in Linux systems, networking, and managing large-scale distributed systems.
Strong planning, summarization, and project management skills.
Responsibility, proactive attitude, and problem-solving skills.
Experience with cloud platforms is a plus; experience in large-scale storage, scheduling, big data, or intelligent operations is preferred.
ByteDance values diversity and is committed to creating an inclusive environment where employees are valued for their unique perspectives. We aim to reflect the communities we serve and foster a workplace of creativity and innovation.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Industry

View All Senior Sre Jobs

Menu

Search Suggestions

Recent Searches

Popular Searches

Location Suggestions

Popular Locations

Nearby Locations

Other Jobs Near Me

Industry

1,411 Senior Sre jobs in Singapore

Site Reliability Engineer (SRE)

Job Description

Site Reliability Engineer (SRE)

Job Description

Site Reliability Engineer (SRE)

Job Description

Site Reliability Engineer (SRE) (GovTech)

Job Description

Site Reliability Engineer (SRE) (GovTech)

Job Description

Site Reliability Engineer (SRE) (GovTech)

Job Description

Senior Site Reliability Engineer (SRE)

Job Description

Be The First To Know

Senior Site Reliability Engineer (SRE)

Job Description

Tech Lead (SRE) - Cloud Infrastructure

Job Description

Tech Lead (SRE) - Cloud Infrastructure

Job Description

Nearby Locations

Other Jobs Near Me

Industry