54 Site Reliability Engineer jobs in Singapore
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Technology
Site Reliability Engineer (Global) - TikTok Server Arch
Location
:
Singapore
Employment Type
:
Regular
Job Code
:
A
Responsibilities
This position is with TikTok's Stability Assurance Team. The team is responsible for ensuring that the services provided by TikTok are highly reliable with low-latency. Reliability assurance is complex and systematic for any massive application system and the team focuses on optimizing the application architecture from end to end; driven by data analysis, with automatic and intelligent failure recovery.
Job Responsibilities:
1.Ensure the online stability of TikTok and improve product SLA through systematic disaster recovery abilities, standardized emergency mechanisms, and intelligent analysis.
2.Identify system risks and promote governance through comprehensive and multi-perspective quality data.
3.Establish TikTok's unified standards and specifications, design and develop a one-stop operation platform, and enhance efficiency across multiple fields.
4.Collaborate closely with developers to implement best practices in SRE.
Qualifications
Minimum Qualifications:
1. Bachelor's degree or above in a computer-related field
2.Solid foundational knowledge of computer software; understanding of Linux operating systems, storage, network IO, and related principles.
3.Ability to solve problems systematically, strong communication skills, and a sense of ownership.
Preferred Qualification
- Minimum 3-5 years relevant work experience from a large-scale internet business
Job Information
About TikTok
TikTok is the leading destination for short-form mobile video. At TikTok, our mission is to inspire creativity and bring joy. TikTok's global headquarters are in Los Angeles and Singapore, and we also have offices in New York City, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo.
Why Join Us
Inspiring creativity is at the core of TikTok's mission. Our innovative product is built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and bring joy - a mission we work towards every day.
We strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. Every challenge is an opportunity to learn and innovate as one team. We're resilient and embrace challenges as they come. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Before you apply to a job, select your language preference from the options available at the top right of this page.
Explore your next opportunity at a Fortune Global 500 organization. Envision innovative possibilities, experience our rewarding culture, and work with talented teams that help you become better every day. We know what it takes to lead UPS into tomorrow—people with a unique combination of skill + passion. If you have the qualities and drive to lead yourself or teams, there are roles ready to cultivate your skills and take you to the next level.
Job Description:
Job Summary:
We are seeking a skilled and proactive Site Reliability Engineer (SRE) with 5–8 years of experience and deep expertise in Google Cloud Platform (GCP). The ideal candidate will be responsible for the reliability, availability, and performance of cloud-based applications and infrastructure. You will collaborate with development, operations, and security teams to build and maintain scalable, secure, and highly available systems.
Key Responsibilities:
- Design, develop, and maintain reliable, scalable, and highly available systems on GCP.
- Build and manage CI/CD pipelines, infrastructure as code (IaC), and monitoring solutions.
- Proactively monitor and manage system performance, uptime, and capacity using observability tools.
- Troubleshoot and resolve infrastructure and application-level issues in real-time.
- Implement and maintain disaster recovery, failover mechanisms, and backup strategies.
- Automate repetitive tasks and processes to improve efficiency and reduce toil.
- Participate in on-call rotations, incident management, and root cause analysis (RCA).
- Ensure compliance with security standards, privacy regulations, and governance policies.
- Collaborate with cross-functional teams to support DevOps and SRE best practices.
- Drive improvements in SLAs, SLOs, and error budgets through data-driven insights.
Required Qualifications:
- 5–8 years of relevant experience as an SRE, DevOps Engineer, or Cloud Infrastructure Engineer.
- Strong hands-on experience with Google Cloud Platform (GCP) – Compute Engine, GKE, Cloud Functions, Cloud Storage, IAM, BigQuery, etc.
- Proficiency in Infrastructure as Code tools like Terraform, Deployment Manager, or CloudFormation.
- Experience with Kubernetes, Docker, and container orchestration.
- Proficiency in scripting languages like Python, Shell, or Go.
- Deep understanding of monitoring and logging tools such as Prometheus, Grafana, Stackdriver, or Datadog.
- Knowledge of CI/CD tools such as Jenkins, GitLab CI, or Cloud Build.
- Experience with incident response, postmortem analysis, and site reliability principles.
- Strong problem-solving and communication skills.
Preferred Qualifications:
- GCP certifications (e.g., Professional Cloud DevOps Engineer, Cloud Architect).
- Exposure to multi-cloud environments or hybrid cloud infrastructure.
- Familiarity with Agile and ITIL frameworks.
- Experience working in regulated environments with compliance standards (e.g., ISO, SOC2).
Employee Type:
Permanent
UPS is committed to providing a workplace free of discrimination, harassment, and retaliation.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities
This position is with TikTok's Stability Assurance Team. The team is responsible for ensuring that the services provided by TikTok are highly reliable with low-latency. Reliability assurance is complex and systematic for any massive application system and the team focuses on optimizing the application architecture from end to end; driven by data analysis, with automatic and intelligent failure recovery.
Job Responsibilities:
1.Ensure the online stability of TikTok and improve product SLA through systematic disaster recovery abilities, standardized emergency mechanisms, and intelligent analysis.
2.Identify system risks and promote governance through comprehensive and multi-perspective quality data.
3.Establish TikTok's unified standards and specifications, design and develop a one-stop operation platform, and enhance efficiency across multiple fields.
4.Collaborate closely with developers to implement best practices in SRE.
Qualifications
Minimum Qualifications:
1. Bachelor's degree or above in a computer-related field
2.Solid foundational knowledge of computer software; understanding of Linux operating systems, storage, network IO, and related principles.
3.Ability to solve problems systematically, strong communication skills, and a sense of ownership.
Preferred Qualification
1. Minimum 3-5 years relevant work experience from a large-scale internet business
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About the Team:
Our team treats infrastructure and operations as software engineering problems. We are responsible for building and progressing software platforms that enable the provisioning and management of all Digibank services in safe, reliable, and scalable ways. We consistently challenge the status quo and use new technologies to build platforms and tooling for engineering teams. Join us and make significant decisions with a huge impact on building modern banking technology.
About the Role:
We treat Infrastructure and operations as Software Engineering problems. Our mission is to build and progress software platforms which enables the provisioning and managing of all Digibank services in safe, reliable and scalable ways. We consistently challenge the status quo, use new technologies to build platforms and tooling for engineering teams. In this role you will make significant decisions with a huge impact on building modern banking technology. You would be part of a team, responsible for designing & architecting new solutions, finding creative ways to optimize existing solutions which will improve agility for managing hundreds of microservices infrastructures in a stable & reliable way.
If you are:
- A strong believer of automating DevOps & SRE aspects like infrastructure provisioning, deployment, observability, incident lifecycle, uptime SLA etc.
- Bold to challenge, open to get challenged, curious to learn & grow
- This role would require skill on linux networking as well (TCP/IP, HTTP, firewall, switch, cloudfront, aws security groups)
This is the right place for you
Roles and Responsibilities:
- Using InfrastructureAsCode tooling like Terraform, and Ansible to manage AWS, Azure & Kubernetes resources
- Configuring and installing various network devices and services (e.g., routers, switches, firewalls, load balancers, VPN).
- Support IT network infrastructure-related work, such as installing Internet connections, WiFi APs, network upgrades, office builds, expansions, and relocations
- Actively participate in engaging with Business Stakeholders, internal IT Teams, and Vendors to manage the outcome of the projects.
- Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents
- Perform analytics on previous incidents and usage patterns to better predict issues and take proactive actions
- Build and drive adoption for greater self-healing and resiliency patterns
- Performance and cost optimization for infrastructure
- Be part of an on-call rotation for the team's tooling and 24x7 support coverage as needed
- Succeed, fail, and learn together with other talented people. We believe in an environment that provides an opportunity for growth and see education as an outcome of failure that gets us closer to the next breakthrough
Qualifications:
- Bachelor's degree in information systems, information technology, computer science, or similar.
- 3-5 years of professional experience.
- Extensive routing, switching, security, and wireless LAN design, implementation, and troubleshooting experience
- Cloud (AWS/Azure) network configuration and integration with on-premises network equipment.
- Network Automation experience using any scripting language (Python, Go, Perl, Bash).
- Experience with managing Infrastructure as code using Terraform
- Direct production operations experience in a cloud environment.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities
- Develop and oversee performance-critical infrastructure for financial markets, ensuring maximum throughput, high resiliency, and minimal operational risk.
- Leverage deep Linux kernel expertise to fine-tune scheduling policies, interrupt routing, and NUMA resource allocation, ensuring predictable performance at scale.
- Build and maintain high-availability containerized environments using Kubernetes, Docker, and advanced orchestration tools with a strong focus on scalability and security.
- Lead automation initiatives with Ansible, Bash, and Python, eliminating manual intervention and improving system efficiency.
- Manage hybrid cloud infrastructure (AWS, Azure,GCP) with strict performance SLAs, security compliance, and cost-optimized deployments.
- Oversee infrastructure monitoring and observability using ELK Stack, Grafana, Site24x7, Splunk, and other enterprise-grade tools, ensuring proactive incident detection and resolution.
- Administer and troubleshoot enterprise storage and networking stacks like RAID, NFS, SAN/NAS, TCP/IP networking,VMware/vCenter, BigIP load balancers.
- Collaborate with development, DevOps, and security teams to design fault-tolerant systems and enforce infrastructure governance policies.
- Execute predictive capacity modeling, OS hardening and patch compliance, coupled with benchmark-driven performance optimization for trading and real-time compute platforms.
- Provide expert-level outage resolution, coordinating cross-functional teams to deliver sustainable remediation and operational resilience.
Requirements
- 8+ years of progressive experience in system administration, performance engineering, and reliability operations across enterprise and financial domains.
- Advanced proficiency in Linux internals with specialization in kernel performance tuning, NUMA-aware optimizations, and real-time workload handling.
- Proven hands-on experience with Kubernetes,Docker, and Ansible for large-scale automation and orchestration.
- Strong scripting/programming in Bash, Python, and experience with perf/eBPF for system analysis.
- Demonstrated expertise in cloud operations across AWS, Azure, and GCP.
- Strong background in networking protocols (TCP/IP, FIX) and high-performance trading environments.
- Familiarity with storage systems (SAN, NAS, RAID) and database tuning (MySQL optimization).
- Experience implementing observability and monitoring solutions like ELK, Grafana, Splunk, Corvil.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are looking for a skilled Site Reliability Engineer to join our client's global SRE Team in Singapore.
Responsibilities:
- Overseeing and ensuring the continuous operation of the firm's Linux based trading infrastructure, addressing day to day operational needs
- Providing second level support, including:
- Rapid response to emergencies
- Implementing scheduled updates and deployments
- In depth analysis and resolution of performance issues
- Engage in a rotational on call schedule, including early morning and weekend shifts, to provide timely support
- Contributing towards the development of automated solutions for server provisioning, configuration, and monitoring, targeting a scalable management of thousands of servers
- Engaging in interactions with the Trading and Core Engineering teams
- Managing essential Core services such as DHCP, LDAP, DNS, and NFS for on prem and hosted data centers as well as public clouds
- Participating in an on call rotation and occasional weekend shifts
Qualifications:
- Sound expertise in Linux production environments
- Basic knowledge of Python and Bash scripting
- Engagement with automation and monitoring tool sets
- Comprehensive knowledge of operating system principles, with a particular focus on Linux internals
- Familiarity with Intel based server hardware and components
- Competence in server side networking, including understanding network protocols and configurations
- Familiarity in cloud services and architectural solutions
- Experience in designing, building, and troubleshooting complex systems
- Good problem solving skills, underpinned by a methodical approach to technical challenges. This includes an ability to communicate effectively, demonstrating strong interpersonal skills, a sense of responsibility, and a commitment to driving projects to completion.
- Sense of ownership and drive
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Description
Please search profiles using key words "PRE" or "SRE" or "platform reliability engineer" or "System Reliability Engineer".
Responsibilities and Requirements:
- Experience must be atleast 9+ years
- Should be from engineering skills
- Typically people hired for R&D
- Experience in using Infrastructure as Code (IaC) tools
- Strong foundation in Operating Systems, Kernels, and Systems Programming.
- Proven track record in designing and managing automated infrastructure solutions in complex environments.
- Education Qualification - Engineering / Master's Degree in Computer Science from IIT/VIT/BITS plain/Anna University
Must have Skills
- Experience with multiple operating systems ( Windows, Linux, macOS, etc.) and strong foundation in Operating Systems, Kernels and systems programming.
- Proficiency in programming languages such as C, C++, and Python
- Background in hardware architecture and embedded systems
- Strong understanding of operating system concepts, including memory management, process scheduling, and file systems
- Experience with virtualization and cloud computing technologies
- Understanding of cybersecurity principles and best practices
- Proven experience in designing and developing operating systems or similar low-level software
- Ability to work collaboratively in a team environment
- Strong problem-solving and analytical skills
- Excellent analytical and troubleshooting skills
- Excellent written and verbal communication skills
- Familiarity with development tools and debugging techniques
- Strong attention to detail and commitment to quality.
Be The First To Know
About the latest Site reliability engineer Jobs in Singapore !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
Why Join ByteDance
Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect - and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day.
As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
Job highlights
Industry experts, Meals provided, Competitive compensation, Flexible hours
Responsibilities
About the Team
The Data Management Suite team is building products that cover the whole lifecycle of data pipeline, including data ingestion and Integration, data development, data catalog, data security and data governance. These products support various businesses, so data engineers and data scientists could greatly boost their productivity.
As a software engineer in the data management suite team, you will have the opportunity to build, optimize and grow one of the largest data platforms in the world. You'll have the opportunity to gain hands-on experience on core systems in the data platform ecosystem. Your work will have a direct and huge impact on the company's core products as well as hundreds of millions of users.
Responsibilities:
- Be responsible for the production stability for big data development and governance systems.
- Engage in and improve the whole lifecycle of service, from inception and design, through to deployment, operation and refinement.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Practice sustainable incident response and blameless postmortems.
- Establish best engineering practice for engineers as well as non-technical people.
- Design and implement reliable, scalable, robust and extensible big data systems that support core products and business.
Qualifications
Minimum Qualifications
- Bachelor's degree in Computer Science, a related technical field involving software or systems engineering, or equivalent practical experience.
- Experience with site reliability engineering, monitoring, alerting for big data related systems.
- Experience writing code in Java, Go, Python or a similar language.
Preferred Qualifications
- Knowledge about a variety of strategies for ingesting, modeling, processing, and persisting data, ETL design, job scheduling and dimensional modeling.
- Familiarity with running production grade services at scale and understanding cloud native technologies and networking.
- Experience developing tools and APIs to reduce human interaction with systems and applications using a variety of coding and scripting standards.
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems is a plus (Hadoop, M/R, Hive, Spark, Presto, Flume, Kafka, ClickHouse, Flink or comparable solutions).
- Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
There is a lot that goes into building the most secure yet user-friendly devices in the world. We are a unique Software Development group with a charter to secure our platforms, which include iOS software, iOS Devices, and Mac. We build solutions that are used by our customers, engineering teams, and manufacturing environments. We are lookng for Site Reliability Engineer (SRE) who would be responsible for deploying, monitoring, troubleshooting and developing tools for all team's solutions. The SRE position requires a mix of strategic engineering and design along with hands-on, technical work. You will have experience in being a Systems Administrator or a Programmer that has moved on to DevOps/Automation in their career. You will configure, tune, and tackle multi-tiered systems to achieve optimal application performance, stability and availability. You will work closely with the systems engineers, network engineers, database administrators, monitoring team, and information security team. For this position, strict application security and high availability requirements need to be consistent to achieve optimal solutions. This hiring team is a rare team focused on security initiatives that provides critical IT solutions across most of Apple's product lines. These solutions are utilized from the manufacturing space all the way to customer facing solutions. We are looking for a hardworking individual who can excel in a dynamic environment, who can be a self starter and bring their passion to ensure quality and reliability of the solutions we maintain.
Description
Review hardware, software infrastructure and application functionality for optimization.
Identify performance bottlenecks.
Responsible for the full system lifecycle including configuration, code deployment in user acceptance test and production environments.
Monitor infrastructure and application services and drive incident management.
Collaborate with Apple's production support team, application engineers, project managers, systems engineers, network engineers, database administrators and QA team to effectively ensure availability and reliability of solutions.
Minimum Qualifications
- Unix or Linux administration and performance tuning skills, 0 ~ 5 years of leading services in a large scale *nix environment
- Java and JVM technologies runtime configurations and troubleshooting. Or proficient in Python/Go/other scripting language
- Experience with DevOps tools, processes, and culture
- Validated experience with Automation skills using Ansible, Chef, Jenkins, Puppet
- Oracle DB knowledge and troubleshooting skills
- Infrastructure knowledge of Networks, load balancers, Firewalls and WAF
- SDLC and release engineering including source code repository and build tools including SVN and GIT
- Network, System and Application Security knowledge
- Application design, development, API programming and improvement using Java, Javascript, HTML, CSS, spring, hibernate, object oriented analysis and design experience will be a plus
- Experience with Kafka or other message queueing technology a plus
Site Reliability Engineer
Posted today
Job Viewed
Job Description
- You will work in a Devops team managing ODC products in GCP Cloud, following the SRE approach.
- You will develop and maintain IAC code and automation tools.
- You will be responsible to provide technical direction when CAB requires tier II input, expertise or changes with high-risk impacts on customer SLAs.
- You will be responsible for support operations tasks to shape the product roadmap and establish strong operational readiness across teams.
- You will extend and acknowledge completion of handover milestones to Tiers I, II to comply with contractual SLAs.
- You will be responsible for System monitoring with real-time monitoring tools.
- You will participate at the preparation and review of technical product & customer specific documentation.
- You will ensure the integrity of the solution functional baseline and architecture.
- You will provide technical guidance for new or evolution of services and for consolidated technical analyses.
- You will be responsible for deployment of Thales products in cloud.
- You will be responsible to perform regular performance tuning, technological watch and updates on service platform.
- You will perform on boarding test, communicate technical risk concerns and help prepare mitigation plans.
- You will provide 24/7 oncall support in shifts.
- Degree in computer Science (or a related discipline).
- Hands on in deployment with Kubernetes and GCP administration and support in production grade environment.
- 4+ years of experience in design, development and implementation of infrastructure and applications.
- Knowledge of Agile methodology and Service Delivery best practices.
- Knowledge on Cloud service provider i.e. GCP, monitoring tools, networking, infrastructure and Linux.
- Strong knowledge of system integration, operation, maintenance and proven experience with automation tools including Gitlab.
- Hands on experience in Continuous Integration and Delivery tools like Gitlab, Terraform & Helm.
- Strong working experience on one of the scripting language - SHELL/Python is required.
- Proficient in Linux, TCP/IP, HTTP (S) protocol.
- Knowledge of Docker and Kubernetes.
- Knowledge of public cloud Google cloud will be highly preferred.
- Kubernetes certification will be highly preferred.
- Experience in Telecom domain will be highly preferred.
- Experience in agile methodology will be highly preferred.
- Scrum certification or equivalent will be highly preferred.
- Working Location: One North
- Working Hours: Monday - Friday, 9am - 6pm
- 24/7 oncall support in shift rotation (Average one shift per team member every 2 months)