124 Site Reliability Engineer jobs in Singapore
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Imagine what you could accomplish here. Bring your passion, creativity, and dedication, and there will be no limit to what you can achieve. This is not just another SRE role - it's a chance to help redefine how reliability engineering is practiced at hyper-scale. Our team is building the platforms that will autonomously operate Apple's core information security systems, setting a new bar for how critical services are managed.
Description
We are seeking exceptional engineers who thrive at the intersection of reliability, software development and automation - individuals driven to push the boundaries of what's possible. The ideal candidate has a strong foundation in modern SRE practices and a proven ability to design and implement software that solves operational challenges. You'll break new ground using the most advanced tools and approaches available, developing automation that doesn't just keep pace with scale but anticipates, reacts and stays ahead of it. You will work closely with Security Engineering, Threat Detection, Incident Response and other internal functions to ensure the scalability, availability and security of the tools and infrastructure that support Apple's cybersecurity mission. Join us, and help build the future of self-managing systems at one of the most innovative companies in the world.
Responsibilities
- Our team is highly collaborative, working closely with partner teams to deliver the best results for Apple. We strive to find the best solution while also considering the need to get things done efficiently for each engineering challenge we face. Good ideas are valued and rewarded.
- As an SRE in Apple Information Security, you will:
- Operate, monitor, and triage all aspects of our production and non-production environments
- Pioneer and implement the next generation telemetry system for AIS services
- Establish alert handling procedures, run-books, and collaborate with our global security team
- Automate deployment and orchestration of services into the cloud environment as well as other routine processes
- Actively participate in capacity planning and disaster recovery exercises
- Interact with and support partner teams across the enterprise
Cultivate and maintain relationships with internal and external third party vendors
Minimum Qualifications
- Bachelor's degree in Computer Science, or a related field, or equivalent practical experience
- Proven experience in Site Reliability Engineering or a related field
- Strong programming skills: Python, Go or Swift
- Experience working with cloud compute environments like AWS, GCP or Azure
- Experience with infrastructure as code (IaC), configuration management, CI/CD, and automation, e.g., Terraform, Pulumi, CloudFormation, Ansible, Chef, Puppet, Jenkins
- Cloud deployment and CI/CD problem diagnosis and troubleshooting
Preferred Qualifications
- Experience or experimentation building systems that leverage Agentic AI principles, tools, platforms and frameworks
- Strong understanding and experience in implementing monitoring and observability tools like Splunk, Grafana, Prometheus
- Building and operating container orchestrating systems (Docker, Kubernetes, Vagrant and micro-services)
- Experience administering and troubleshooting Linux systems including the usage of standard Linux utilities
- Experience in shell scripting (e.g., bash/zsh) and system administration
- Experience with measuring, analyzing, and optimizing system performance
- Passion for high-quality code, tests, documentation and production services
Participation in an on-call rotation
Submit CV
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Technology
Site Reliability Engineer (Global) - TikTok Server Arch
Location
:
Singapore
Employment Type
:
Regular
Job Code
:
A
Responsibilities
This position is with TikTok's Stability Assurance Team. The team is responsible for ensuring that the services provided by TikTok are highly reliable with low-latency. Reliability assurance is complex and systematic for any massive application system and the team focuses on optimizing the application architecture from end to end; driven by data analysis, with automatic and intelligent failure recovery.
Job Responsibilities:
1.Ensure the online stability of TikTok and improve product SLA through systematic disaster recovery abilities, standardized emergency mechanisms, and intelligent analysis.
2.Identify system risks and promote governance through comprehensive and multi-perspective quality data.
3.Establish TikTok's unified standards and specifications, design and develop a one-stop operation platform, and enhance efficiency across multiple fields.
4.Collaborate closely with developers to implement best practices in SRE.
Qualifications
Minimum Qualifications:
1. Bachelor's degree or above in a computer-related field
2.Solid foundational knowledge of computer software; understanding of Linux operating systems, storage, network IO, and related principles.
3.Ability to solve problems systematically, strong communication skills, and a sense of ownership.
Preferred Qualification
- Minimum 3-5 years relevant work experience from a large-scale internet business
Job Information
About TikTok
TikTok is the leading destination for short-form mobile video. At TikTok, our mission is to inspire creativity and bring joy. TikTok's global headquarters are in Los Angeles and Singapore, and we also have offices in New York City, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo.
Why Join Us
Inspiring creativity is at the core of TikTok's mission. Our innovative product is built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and bring joy - a mission we work towards every day.
We strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. Every challenge is an opportunity to learn and innovate as one team. We're resilient and embrace challenges as they come. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Before you apply to a job, select your language preference from the options available at the top right of this page.
Explore your next opportunity at a Fortune Global 500 organization. Envision innovative possibilities, experience our rewarding culture, and work with talented teams that help you become better every day. We know what it takes to lead UPS into tomorrow—people with a unique combination of skill + passion. If you have the qualities and drive to lead yourself or teams, there are roles ready to cultivate your skills and take you to the next level.
Job Description:
Job Summary:
We are seeking a skilled and proactive Site Reliability Engineer (SRE) with 5–8 years of experience and deep expertise in Google Cloud Platform (GCP). The ideal candidate will be responsible for the reliability, availability, and performance of cloud-based applications and infrastructure. You will collaborate with development, operations, and security teams to build and maintain scalable, secure, and highly available systems.
Key Responsibilities:
- Design, develop, and maintain reliable, scalable, and highly available systems on GCP.
- Build and manage CI/CD pipelines, infrastructure as code (IaC), and monitoring solutions.
- Proactively monitor and manage system performance, uptime, and capacity using observability tools.
- Troubleshoot and resolve infrastructure and application-level issues in real-time.
- Implement and maintain disaster recovery, failover mechanisms, and backup strategies.
- Automate repetitive tasks and processes to improve efficiency and reduce toil.
- Participate in on-call rotations, incident management, and root cause analysis (RCA).
- Ensure compliance with security standards, privacy regulations, and governance policies.
- Collaborate with cross-functional teams to support DevOps and SRE best practices.
- Drive improvements in SLAs, SLOs, and error budgets through data-driven insights.
Required Qualifications:
- 5–8 years of relevant experience as an SRE, DevOps Engineer, or Cloud Infrastructure Engineer.
- Strong hands-on experience with Google Cloud Platform (GCP) – Compute Engine, GKE, Cloud Functions, Cloud Storage, IAM, BigQuery, etc.
- Proficiency in Infrastructure as Code tools like Terraform, Deployment Manager, or CloudFormation.
- Experience with Kubernetes, Docker, and container orchestration.
- Proficiency in scripting languages like Python, Shell, or Go.
- Deep understanding of monitoring and logging tools such as Prometheus, Grafana, Stackdriver, or Datadog.
- Knowledge of CI/CD tools such as Jenkins, GitLab CI, or Cloud Build.
- Experience with incident response, postmortem analysis, and site reliability principles.
- Strong problem-solving and communication skills.
Preferred Qualifications:
- GCP certifications (e.g., Professional Cloud DevOps Engineer, Cloud Architect).
- Exposure to multi-cloud environments or hybrid cloud infrastructure.
- Familiarity with Agile and ITIL frameworks.
- Experience working in regulated environments with compliance standards (e.g., ISO, SOC2).
Employee Type:
Permanent
UPS is committed to providing a workplace free of discrimination, harassment, and retaliation.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities
This position is with TikTok's Stability Assurance Team. The team is responsible for ensuring that the services provided by TikTok are highly reliable with low-latency. Reliability assurance is complex and systematic for any massive application system and the team focuses on optimizing the application architecture from end to end; driven by data analysis, with automatic and intelligent failure recovery.
Job Responsibilities:
1.Ensure the online stability of TikTok and improve product SLA through systematic disaster recovery abilities, standardized emergency mechanisms, and intelligent analysis.
2.Identify system risks and promote governance through comprehensive and multi-perspective quality data.
3.Establish TikTok's unified standards and specifications, design and develop a one-stop operation platform, and enhance efficiency across multiple fields.
4.Collaborate closely with developers to implement best practices in SRE.
Qualifications
Minimum Qualifications:
1. Bachelor's degree or above in a computer-related field
2.Solid foundational knowledge of computer software; understanding of Linux operating systems, storage, network IO, and related principles.
3.Ability to solve problems systematically, strong communication skills, and a sense of ownership.
Preferred Qualification
1. Minimum 3-5 years relevant work experience from a large-scale internet business
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Description
Please search profiles using key words "PRE" or "SRE" or "platform reliability engineer" or "System Reliability Engineer".
Responsibilities and Requirements:
- Experience must be atleast 9+ years
- Should be from engineering skills
- Typically people hired for R&D
- Experience in using Infrastructure as Code (IaC) tools
- Strong foundation in Operating Systems, Kernels, and Systems Programming.
- Proven track record in designing and managing automated infrastructure solutions in complex environments.
- Education Qualification - Engineering / Master's Degree in Computer Science from IIT/VIT/BITS plain/Anna University
Must have Skills
- Experience with multiple operating systems ( Windows, Linux, macOS, etc.) and strong foundation in Operating Systems, Kernels and systems programming.
- Proficiency in programming languages such as C, C++, and Python
- Background in hardware architecture and embedded systems
- Strong understanding of operating system concepts, including memory management, process scheduling, and file systems
- Experience with virtualization and cloud computing technologies
- Understanding of cybersecurity principles and best practices
- Proven experience in designing and developing operating systems or similar low-level software
- Ability to work collaboratively in a team environment
- Strong problem-solving and analytical skills
- Excellent analytical and troubleshooting skills
- Excellent written and verbal communication skills
- Familiarity with development tools and debugging techniques
- Strong attention to detail and commitment to quality.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities
- Develop and oversee performance-critical infrastructure for financial markets, ensuring maximum throughput, high resiliency, and minimal operational risk.
- Leverage deep Linux kernel expertise to fine-tune scheduling policies, interrupt routing, and NUMA resource allocation, ensuring predictable performance at scale.
- Build and maintain high-availability containerized environments using Kubernetes, Docker, and advanced orchestration tools with a strong focus on scalability and security.
- Lead automation initiatives with Ansible, Bash, and Python, eliminating manual intervention and improving system efficiency.
- Manage hybrid cloud infrastructure (AWS, Azure,GCP) with strict performance SLAs, security compliance, and cost-optimized deployments.
- Oversee infrastructure monitoring and observability using ELK Stack, Grafana, Site24x7, Splunk, and other enterprise-grade tools, ensuring proactive incident detection and resolution.
- Administer and troubleshoot enterprise storage and networking stacks like RAID, NFS, SAN/NAS, TCP/IP networking,VMware/vCenter, BigIP load balancers.
- Collaborate with development, DevOps, and security teams to design fault-tolerant systems and enforce infrastructure governance policies.
- Execute predictive capacity modeling, OS hardening and patch compliance, coupled with benchmark-driven performance optimization for trading and real-time compute platforms.
- Provide expert-level outage resolution, coordinating cross-functional teams to deliver sustainable remediation and operational resilience.
Requirements
- 8+ years of progressive experience in system administration, performance engineering, and reliability operations across enterprise and financial domains.
- Advanced proficiency in Linux internals with specialization in kernel performance tuning, NUMA-aware optimizations, and real-time workload handling.
- Proven hands-on experience with Kubernetes,Docker, and Ansible for large-scale automation and orchestration.
- Strong scripting/programming in Bash, Python, and experience with perf/eBPF for system analysis.
- Demonstrated expertise in cloud operations across AWS, Azure, and GCP.
- Strong background in networking protocols (TCP/IP, FIX) and high-performance trading environments.
- Familiarity with storage systems (SAN, NAS, RAID) and database tuning (MySQL optimization).
- Experience implementing observability and monitoring solutions like ELK, Grafana, Splunk, Corvil.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are a global dating app created to give everyone a chance at love. The sense of belonging and connectedness we get from relationships helps us survive and thrive, and we're working to make it a little easier for people to find that. We're inspired by the stories we hear from employees, friends, and family who have used our app to transform their lives, and you, too, can make a difference by joining us
We are looking for a talented Senior Site Reliability Engineer to help design the future of dating. This individual will bring extensive experience in running large-scale data sources in the cloud and will be responsible for modernizing our data source handling and maintaining our core infrastructure and services on AWS.
This role will be based in Singapore and report directly to the CTO.
Responsibilities:- Architect, develop, and maintain our core infrastructure and services on AWS, focusing on high availability, performance, and scalability.
- Specific AWS services of interest include EC2, RDS, S3, ElastiCache, CloudWatch, RedShift, OpenSearch, and VPC.
- Implement and manage continuous deployment processes to achieve seamless deployment of services with minimal downtime.
- Monitor system performance, identify bottlenecks, and apply necessary optimizations to ensure the smooth operation of our services.
- Develop and maintain automated tools for infrastructure provisioning, configuration, and deployment.
- Work closely with development teams to integrate infrastructure builds and operational best practices into the software development lifecycle.
- Conduct root cause analysis for production errors and implement strategies to prevent future occurrences.
- Manage and optimize network configurations to ensure secure and efficient data flow and access.
- Administer and maintain databases, ensuring their reliability, performance, and security.
- Lead capacity planning efforts to ensure that our infrastructure scales in line with demand while optimizing costs and maintaining performance.
- Modernize data source handling (Redshift, Postgres, RDS, etc.).
- Manage Kubernetes workloads.
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 5+ years of industry experience.
- Proven experience as an SRE, DevOps Engineer, or similar role in a cloud-based environment.
- Strong expertise in AWS services and tools.
- Proficient understanding of networking principles, transport, and application protocols, especially TCP/IP, BGP, DNS, TLS, and HTTP/S.
- Experience with database administration, including performance tuning, backup and recovery processes, and security management.
- Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Terraform).
- Excellent problem-solving skills and the ability to work independently or as part of a team.
- Strong Written and Verbal Communication: Fluent in English (both written and verbal); proficiency in Chinese is a must.
- Significant experience in capacity planning and cost management within cloud environments.
- Experience with Kubernetes.
- Familiarity with Terraform for general systems maintenance.
- Experience with data sources like Redshift, Postgres (Citus, Patroni), and RDS.
- AWS SysOps Administrator Associate or AWS Solutions Architect Professional (SAP) certification.
- Experience with Spotinst for cost optimization.
- Familiarity with additional scripting languages such as Go or JavaScript.
If you're passionate about tackling big challenges and have the skills to help us shape the future of online dating, we want to hear from you
Be The First To Know
About the latest Site reliability engineer Jobs in Singapore !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Technology
Site Reliability Engineer (Component Platform) - TikTok Shop
Location
:
Singapore
Employment Type
:
Regular
Job Code
:
A88507A
Responsibilities
About the team
TikTok Shop is a content e-commerce business utilising international short video products as carriers. Our aim is to become the preferred choice for users seeking to discover and purchase affordable, high-quality products. We provide users with tailored, vibrant, and efficient consumption experiences while enabling merchants to access robust and dependable platform services in various scenarios, such as live e-commerce and short video content e-commerce. Our vision is to make affordable and high-quality products easily accessible, enhancing the quality of life for all. We are looking for passionate and talented people to join our product and operations team, to build an e-commerce ecosystem that is innovative, secure and intuitive for our users and brands.
Our role combine software and systems engineering disciplines to run high-performance, large-scale distributed infrastructure. This means you will be deeply involved in the developmental lifecycle of critical software services, collaborating closely with product engineers to combine software code and systems knowledge to ensure that TikTok Shop's services are reliable, fault-tolerant, efficiently scalable and cost-effective. You will also be leveraging your software engineering expertise to develop software platforms and tools to optimise the operational and engineering efficiencies of complex systems at scale, with particular focus on improving the systems' observability, performance and maintainability.
Responsibilities:
- Provide component stability solutions tailored to real-world business scenarios for TikTok Shop, based on collaboration mechanisms across teams, time zones, and regions.
- Continuously build component metadata and observability capabilities, and improve multi-dimensional observability solutions.
- Develop platform-based, data visualisation, and automated monitoring processes to enhance the efficiency of component operations and maintenance for TikTok Shop platforms.
- Gain deep understanding of e-commerce business to enable risk awareness and governance of components.
- Continuously follow up on the management and optimisation of components in international e-commerce.
Qualifications
Minimum Qualifications:
- Bachelor's or higher degree in Computer Science, Information Technology, Programming & System Analysis, Science (Computer Studies) or related discipline.
- Candidate should have at least 5 years of experience in one or more programming languages (such as Java, C++, Go), or scripting experience with Shell/Python.
- Familiar with component O&M (Operations and Maintenance) processes, and knowledgeable about trends in foundational component technologies.
- Familiar with the architecture of storage and computing components such as MySQL, Redis, MQ, RocksDB, MongoDB, Kubernetes (K8s), Docker, and service mesh technologies.
- Expertise in operating, deploying, and ensuring the high availability and quality assurance of large-scale distributed systems, with a strong focus on stability and performance.
- Strong sense of responsibility, proactive team spirit, and excellent analytical and problem-solving skills.
Preferred Qualifications
- 3+ years of experience in component O&M or platform development.
- Experience in e-commerce and cloud computing.
Job Information
About TikTok
TikTok is the leading destination for short-form mobile video. At TikTok, our mission is to inspire creativity and bring joy. TikTok's global headquarters are in Los Angeles and Singapore, and we also have offices in New York City, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo.
Why Join Us
Inspiring creativity is at the core of TikTok's mission. Our innovative product is built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and bring joy - a mission we work towards every day.
We strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. Every challenge is an opportunity to learn and innovate as one team. We're resilient and embrace challenges as they come. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Description
Data Automation & Integration
- Develop and maintain scripts to retrieve and process data from Google Workspace and Zoom, including but not limited to users, groups, meeting rooms, licenses, activity logs, and configurations
- Normalize and structure data for analysis, reporting, and alerting
Monitoring & Alerting
- Build automated alerting systems to identify anomalies, policy violations, or operational issues in Workspace or Zoom environments
Automation of Administrative Actions
- Design workflows to automate common tasks such as account cleanup, license management, and configuration enforcement
Platform Development
- Build a secure internal web-based platform to standardize and simplify administrative actions currently performed using tools like GAM or Zoom admin portal
- Include reporting, dashboards, and visualizations to support operational visibility
Collaboration & Input Gathering
- Work closely with the IT Services and Collaboration Platforms Lead to prioritize features and workflows
- Gather automation and reporting requirements from the broader support team
Code Management & Workflow
- Implement Git workflows for team collaboration and change management
- Ensure well-documented and maintainable code to support contributions from less-technical teammates
Infrastructure & Deployment
- Deploy tooling and platforms in containerized environments (e.g., Docker)
- Propose or implement supporting infrastructure such as databases, schedulers, and authentication mechanisms
- Minimum 3 years of software development experience, ideally in automation, platform tooling, or internal systems
- Proficiency in at least one scripting or backend development language (e.g., Python, , Go)
- Experience with APIs (Google Workspace Admin SDK, Zoom API, GAM)
- Experience developing web interfaces (e.g., Flask, FastAPI, , React)
- Familiarity with Git and collaborative development workflows
- Strong problem-solving skills and the ability to work independently with broad responsibilities
Site Reliability Engineer
Posted today
Job Viewed
Job Description
- Design and implement hybrid cloud observability and monitoring solutions across multiple environments.
- Develop and manage alerting systems, metrics, and dashboards for proactive issue detection.
- Integrate logging pipelines for structured and unstructured data sources.
- Implement log archiving strategies (e.g., S3) for compliance and cost optimization.
- Perform advanced log analysis and correlation to support root cause investigation and performance tuning.
- Collaborate with infrastructure and development teams to define and track SLIs, SLOs, and SLAs .
- Strong experience with Splunk , Prometheus , Grafana , and Amazon CloudWatch .
- Proficiency with ELK/EFK stacks (Elasticsearch, Logstash/Fluentd, Kibana).
- Hands-on experience creating custom dashboards , alerts, and metrics visualizations.
- Experience building and managing centralized logging pipelines across distributed systems.
- Familiarity with S3 log archiving and multi-environment data integrations .