414 Site Reliability Engineer jobs in Singapore
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Company Description
Higogame is a trailblazer in the mobile gaming and entertainment industry. Since our inception in late 2020, we have been dedicated to transforming the gaming landscape in Southeast Asia and beyond, delivering innovative and immersive experiences that engage millions of players around the globe.
- Our revenue has seen remarkable growth year after year, with operations extending across multiple regions worldwide.
- In just three years, we've risen to become one of the top two games of our kind in the local market.
- We proudly serve around 2 million active users daily and have a total monthly active user base of 5 million worldwide.
- Our team consists of over 200 talented employees, including a robust R&D division of more than 100 experts.
- We offer exceptional career development opportunities and foster a multinational culture that empowers everyone to reach their full potential.
Join us as we continue to push the boundaries of mobile gaming
Job Responabilities:
Responsible for the full lifecycle management of the company's global/multi-region infrastructure. Lead the setup of the Singapore physical data center and deep operations of Google Cloud (GCP) platform. Drive automation and intelligent operations systems to ensure high availability, low cost, and scalable business operations. The role requires both traditional data center operations experience and cloud-native technical vision, acting as a key technical backbone connecting physical resources with cloud capabilities.
I. Core Responsibilities
1. Physical Data Center Planning & Implementation
- Lead end-to-end management of self-built/hosted data centers: requirements analysis, architecture design (network/power/cooling/cabling), equipment selection (servers/switches/UPS), construction acceptance, and post-operations optimization.
- Design multi–data center disaster recovery architectures (e.g., active-active across two sites, three centers), including cross-site synchronization and failover strategies to ensure business continuity.
- Manage internal resource backup/disaster recovery, including art assets, code, and other data assets.
2. Google Cloud (GCP) Deep Operations & Optimization
- Design and manage GCP architecture (Compute Engine, VPC, Cloud Storage, GKE, BigQuery, etc.), supporting cloud migration and hybrid cloud deployment of core business systems.
- Lead full lifecycle management of cloud resources, including cost optimization (reserved instances, autoscaling, idle resource reclamation), performance tuning (network latency, storage IOPS, compute utilization), and security hardening (IAM governance, encryption policies, vulnerability scanning).
- Build cloud-native ops systems using Cloud Monitoring/Logging for real-time alerting and fault detection.
3. Automation & Intelligent Operations Systems
- Lead development and integration of operations toolchains (e.g., Ansible/Puppet automation, Prometheus+Grafana monitoring, ELK logging) to shift operations from manual to platform-based and intelligent.
- Integrate CI/CD pipelines with cloud platforms, optimizing deployment efficiency and stability of containerized (K8s) and serverless (Cloud Functions) workloads.
- Lead root cause analysis (RCA) and postmortems of major incidents, deliver improvement plans, and strengthen contingency planning and drills (e.g., data center power outage, cloud region failure).
4. Cross-Team Collaboration & Technical Enablement
- Collaborate with R&D, QA, and Product teams to provide infrastructure support for rapid business delivery.
- Develop operations standards and technical documentation, drive team knowledge sharing, and mentor junior engineers.
II. Requirements
Basic Qualifications
- Bachelor's degree or higher in Computer Science, Network Engineering, Cloud Computing, or related fields.
- 5+ years in IT operations, including 3+ years in physical data center build/ops, and 2+ years of hands-on GCP experience (must provide project examples).
- Experience in large-scale distributed systems, with solid knowledge of Linux, network protocols (TCP/IP, SDN), and high availability database architectures (MySQL/Redis).
- Must be able to converse in Mandarin due to the need to travel to China to communicate with Chinese speaking stakeholders
- Must be able to travel (up to 50% of the time)
Technical Skills
Data Center
- Familiar with infrastructure (power/cooling/fire safety/cabling), and optimization metrics like PUE/CUE.
- Experience in IDC hosting, custom data center builds, or third-party acceptance audits. Knowledge of industry standards (e.g., GB50174 Data Center Design Standard).
Google Cloud (GCP)
- Proficient in GCP core services: GCE, VPC, Cloud SQL/Spanner, GKE.
- Skilled in GCP cost management (Budgets & Alerts, preemptible VMs, storage tiers).
- Strong in GCP security: IAM, VPC Service Controls, Cloud Firewall, KMS.
Automation & Toolchains
- Skilled with Terraform/Ansible for IaC, scripting in Shell/Python/Go for ops tooling.
- Experienced in Prometheus+Grafana monitoring, ELK/OpenTelemetry for logging & tracing.
- Hands-on Kubernetes operations (scaling, node management, Helm) and CI/CD pipeline integration (Jenkins/GitLab CI).
Soft Skills
- Strong troubleshooting and resilience, able to quickly resolve complex incidents (e.g., data center outage, regional cloud failure).
- Excellent cross-team communication and project leadership skills.
- Fast learner, stays updated on cloud-native (CNCF), AIOps, and industry trends.
III. Nice-to-Haves
- GCP certifications (e.g., Professional Cloud Architect, Associate Cloud Engineer) or ITIL/ISO2000.
- Led/participated in large-scale data center builds (multi-million) or GCP ops projects with million+ annual cloud spend.
- Experience in hybrid cloud (GCP + on-premises) or edge computing ops.
- Published blogs, open-source contributions, or active participation in tech communities (GitHub, CNCF events).
IV. What We Offer
- Competitive salary
- Global platform: Participate in building multi-region intelligent operations infrastructure.
- Growth: Internal tech sharing, external conferences, certification & training support.
- Work environment: Flat management, flexible hours, free snacks, comprehensive medical, hospital and dental coverage
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Imagine what you could accomplish here. Bring your passion, creativity, and dedication, and there will be no limit to what you can achieve. This is not just another SRE role - it's a chance to help redefine how reliability engineering is practiced at hyper-scale. Our team is building the platforms that will autonomously operate Apple's core information security systems, setting a new bar for how critical services are managed.
Description
We are seeking exceptional engineers who thrive at the intersection of reliability, software development and automation - individuals driven to push the boundaries of what's possible. The ideal candidate has a strong foundation in modern SRE practices and a proven ability to design and implement software that solves operational challenges. You'll break new ground using the most advanced tools and approaches available, developing automation that doesn't just keep pace with scale but anticipates, reacts and stays ahead of it. You will work closely with Security Engineering, Threat Detection, Incident Response and other internal functions to ensure the scalability, availability and security of the tools and infrastructure that support Apple's cybersecurity mission. Join us, and help build the future of self-managing systems at one of the most innovative companies in the world.
Responsibilities
- Our team is highly collaborative, working closely with partner teams to deliver the best results for Apple. We strive to find the best solution while also considering the need to get things done efficiently for each engineering challenge we face. Good ideas are valued and rewarded.
- As an SRE in Apple Information Security, you will:
- Operate, monitor, and triage all aspects of our production and non-production environments
- Pioneer and implement the next generation telemetry system for AIS services
- Establish alert handling procedures, run-books, and collaborate with our global security team
- Automate deployment and orchestration of services into the cloud environment as well as other routine processes
- Actively participate in capacity planning and disaster recovery exercises
- Interact with and support partner teams across the enterprise
Cultivate and maintain relationships with internal and external third party vendors
Minimum Qualifications
- Bachelor's degree in Computer Science, or a related field, or equivalent practical experience
- Proven experience in Site Reliability Engineering or a related field
- Strong programming skills: Python, Go or Swift
- Experience working with cloud compute environments like AWS, GCP or Azure
- Experience with infrastructure as code (IaC), configuration management, CI/CD, and automation, e.g., Terraform, Pulumi, CloudFormation, Ansible, Chef, Puppet, Jenkins
- Cloud deployment and CI/CD problem diagnosis and troubleshooting
Preferred Qualifications
- Experience or experimentation building systems that leverage Agentic AI principles, tools, platforms and frameworks
- Strong understanding and experience in implementing monitoring and observability tools like Splunk, Grafana, Prometheus
- Building and operating container orchestrating systems (Docker, Kubernetes, Vagrant and micro-services)
- Experience administering and troubleshooting Linux systems including the usage of standard Linux utilities
- Experience in shell scripting (e.g., bash/zsh) and system administration
- Experience with measuring, analyzing, and optimizing system performance
- Passion for high-quality code, tests, documentation and production services
Participation in an on-call rotation
Submit CV
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities:
• Maintains platforms or products after go live by measuring and monitoring their availability, performance and overall system health
• Recovers platforms or products during production incidents to meet targeted service-level agreements
• Set up, enhance and maintain observability tools.
• Assist in incident response, perform root cause analysis, and postmortem documentation.
• Develop tools/applications/scripts to improve operational efficiency.
• Maintain and enhance CI/CD pipelines.
• Collaborate with software engineers to design scalable and resilient systems.
• Participate in on-call and on-site rotations and contribute to reducing alert fatigue.
• Document processes, configurations, and best practices.
• Support other software efficiency improvement initiatives.
Requirements:
• At least 1-3 years' experience in software development, Devops or SRE.
• Curious, Strong communicator and ready to work in a fast-paced environment and willing to pick up new skills and technologies as necessary.
• Degree in Electrical / Electronics / Computer Engineering / Computer Science or a relevant discipline
• Basic understanding of Linux/Unix systems and shell scripting.
• Familiarity with cloud platforms (e.g., AWS, Azure, GCP).
• Exposure to containerization tools (e.g., Docker, Kubernetes).
• Experience with monitoring tools (e.g., Prometheus, Grafana, ELK).
• Knowledge of CI/CD tools (e.g., Jenkins, Gitlab, Bitbucket, Jira).
• Programming/scripting skills in Python, Java, or Bash.
• Understanding of networking fundamentals and system security.
• Self-motivated, independent and a good team player
• Able to work under pressure in a fast-paced environment
• Innovative, proactive mindset and with a focus on continuous improvement
• Strong analytical and problem-solving skills
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities
- To be responsible for reliability, availability, user experience, capacity planning, toil reduction, process enhancement and digitalization of the cloud-based internet services.
- Handle SRE role for assigned cloud services owning the KPIs for reliability, issue to resolution, service deployment, business continuity management, security policy planning, capacity planning, toil reduction through automation.
- Introduce service governance initiatives based on latest technologies to consistently increase reliability and user experience components of Huawei mobile services on cloud to provide world class user experience with high reliability.
- Effectively utilize our world class AIOPS and autonomous service governance platform to ideate new ways to streamline process, accuracy of alerts, time series-based trend analysis, anomaly detection, risk identifications.
- Support platform/service expansions, migrations to new architectures, upgrades and drill activities across different technology domains.
- Incorporate mature chaos engineering for risk identification, IPDRR for security, comprehensive automation frameworks to reduce ops effort to reach lowest possible level and make time, space for engineering related focus for the team.
Requirements and Qualifications
- Bachelor/Master of computer science engineering or related majors
- Have knowledge of Linux, Network, Database,Containers, Container management systems, etc.
- Have knowledge of at least one programming language or scripting such as Java, Python, Shell, Ansible, Terraform
- Have knowledge in big data analytics.
- Explored new technology trends, opensource technologies, methodologies in internet service domain.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Join a global leader in gaming to manage the reliability of game-related platforms and infrastructure across both cloud and on-premise environments.
Responsibilities:
- Responsible for deployment, change, issues triage and infrastructure management of overseas games and relevant components and system, e.g. game monitor system, login services.
- Responsible for monitoring and dashboarding for game observability, and ensure the game is reliable, scalable and secure.
- Understand the game architecture, analyze, evaluate and respond to potential risks, such as hidden troubles and performance bottlenecks.
- Responsible for daily communication and coordination between various teams.
Requirements:
- Bachelor's Degree or above in Computer Science or comparable field.
- More than 3 years of operations experience in Linux and Windows operating system.
- Have a high sense of responsibility and teamwork spirit.
- Proficiency in scripting programming such as Bash, Python, SQL.
- Good understanding of cloud environment, such as AWS or Azure.
- Experience with containerization technologies such as Docker and orchestration platforms like Kubernetes is a plus.
- Experience with worldwide online game live operations is a plus.
Rajasekar Shirley Monisha License No.: 02C3423 Personnel Registration No.: R
Please note that your response to this advertisement and communications with us pursuant to this advertisement will constitute informed consent to the collection, use and/or disclosure of personal data by ManpowerGroup Singapore for the purpose of carrying out its business, in compliance with the relevant provisions of the Personal Data Protection Act 2012. To learn more about ManpowerGroup's Global Privacy Policy, please visit
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Technology
Site Reliability Engineer (Global) - TikTok Server Arch
Location
:
Singapore
Employment Type
:
Regular
Job Code
:
A
Responsibilities
This position is with TikTok's Stability Assurance Team. The team is responsible for ensuring that the services provided by TikTok are highly reliable with low-latency. Reliability assurance is complex and systematic for any massive application system and the team focuses on optimizing the application architecture from end to end; driven by data analysis, with automatic and intelligent failure recovery.
Job Responsibilities:
1.Ensure the online stability of TikTok and improve product SLA through systematic disaster recovery abilities, standardized emergency mechanisms, and intelligent analysis.
2.Identify system risks and promote governance through comprehensive and multi-perspective quality data.
3.Establish TikTok's unified standards and specifications, design and develop a one-stop operation platform, and enhance efficiency across multiple fields.
4.Collaborate closely with developers to implement best practices in SRE.
Qualifications
Minimum Qualifications:
1. Bachelor's degree or above in a computer-related field
2.Solid foundational knowledge of computer software; understanding of Linux operating systems, storage, network IO, and related principles.
3.Ability to solve problems systematically, strong communication skills, and a sense of ownership.
Preferred Qualification
- Minimum 3-5 years relevant work experience from a large-scale internet business
Job Information
About TikTok
TikTok is the leading destination for short-form mobile video. At TikTok, our mission is to inspire creativity and bring joy. TikTok's global headquarters are in Los Angeles and Singapore, and we also have offices in New York City, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo.
Why Join Us
Inspiring creativity is at the core of TikTok's mission. Our innovative product is built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and bring joy - a mission we work towards every day.
We strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. Every challenge is an opportunity to learn and innovate as one team. We're resilient and embrace challenges as they come. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Before you apply to a job, select your language preference from the options available at the top right of this page.
Explore your next opportunity at a Fortune Global 500 organization. Envision innovative possibilities, experience our rewarding culture, and work with talented teams that help you become better every day. We know what it takes to lead UPS into tomorrow—people with a unique combination of skill + passion. If you have the qualities and drive to lead yourself or teams, there are roles ready to cultivate your skills and take you to the next level.
Job Description:
Job Summary:
We are seeking a skilled and proactive Site Reliability Engineer (SRE) with 5–8 years of experience and deep expertise in Google Cloud Platform (GCP). The ideal candidate will be responsible for the reliability, availability, and performance of cloud-based applications and infrastructure. You will collaborate with development, operations, and security teams to build and maintain scalable, secure, and highly available systems.
Key Responsibilities:
- Design, develop, and maintain reliable, scalable, and highly available systems on GCP.
- Build and manage CI/CD pipelines, infrastructure as code (IaC), and monitoring solutions.
- Proactively monitor and manage system performance, uptime, and capacity using observability tools.
- Troubleshoot and resolve infrastructure and application-level issues in real-time.
- Implement and maintain disaster recovery, failover mechanisms, and backup strategies.
- Automate repetitive tasks and processes to improve efficiency and reduce toil.
- Participate in on-call rotations, incident management, and root cause analysis (RCA).
- Ensure compliance with security standards, privacy regulations, and governance policies.
- Collaborate with cross-functional teams to support DevOps and SRE best practices.
- Drive improvements in SLAs, SLOs, and error budgets through data-driven insights.
Required Qualifications:
- 5–8 years of relevant experience as an SRE, DevOps Engineer, or Cloud Infrastructure Engineer.
- Strong hands-on experience with Google Cloud Platform (GCP) – Compute Engine, GKE, Cloud Functions, Cloud Storage, IAM, BigQuery, etc.
- Proficiency in Infrastructure as Code tools like Terraform, Deployment Manager, or CloudFormation.
- Experience with Kubernetes, Docker, and container orchestration.
- Proficiency in scripting languages like Python, Shell, or Go.
- Deep understanding of monitoring and logging tools such as Prometheus, Grafana, Stackdriver, or Datadog.
- Knowledge of CI/CD tools such as Jenkins, GitLab CI, or Cloud Build.
- Experience with incident response, postmortem analysis, and site reliability principles.
- Strong problem-solving and communication skills.
Preferred Qualifications:
- GCP certifications (e.g., Professional Cloud DevOps Engineer, Cloud Architect).
- Exposure to multi-cloud environments or hybrid cloud infrastructure.
- Familiarity with Agile and ITIL frameworks.
- Experience working in regulated environments with compliance standards (e.g., ISO, SOC2).
Employee Type:
Permanent
UPS is committed to providing a workplace free of discrimination, harassment, and retaliation.
Be The First To Know
About the latest Site reliability engineer Jobs in Singapore !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities
This position is with TikTok's Stability Assurance Team. The team is responsible for ensuring that the services provided by TikTok are highly reliable with low-latency. Reliability assurance is complex and systematic for any massive application system and the team focuses on optimizing the application architecture from end to end; driven by data analysis, with automatic and intelligent failure recovery.
Job Responsibilities:
1.Ensure the online stability of TikTok and improve product SLA through systematic disaster recovery abilities, standardized emergency mechanisms, and intelligent analysis.
2.Identify system risks and promote governance through comprehensive and multi-perspective quality data.
3.Establish TikTok's unified standards and specifications, design and develop a one-stop operation platform, and enhance efficiency across multiple fields.
4.Collaborate closely with developers to implement best practices in SRE.
Qualifications
Minimum Qualifications:
1. Bachelor's degree or above in a computer-related field
2.Solid foundational knowledge of computer software; understanding of Linux operating systems, storage, network IO, and related principles.
3.Ability to solve problems systematically, strong communication skills, and a sense of ownership.
Preferred Qualification
1. Minimum 3-5 years relevant work experience from a large-scale internet business
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Description
Please search profiles using key words "PRE" or "SRE" or "platform reliability engineer" or "System Reliability Engineer".
Responsibilities and Requirements:
- Experience must be atleast 9+ years
- Should be from engineering skills
- Typically people hired for R&D
- Experience in using Infrastructure as Code (IaC) tools
- Strong foundation in Operating Systems, Kernels, and Systems Programming.
- Proven track record in designing and managing automated infrastructure solutions in complex environments.
- Education Qualification - Engineering / Master's Degree in Computer Science from IIT/VIT/BITS plain/Anna University
Must have Skills
- Experience with multiple operating systems ( Windows, Linux, macOS, etc.) and strong foundation in Operating Systems, Kernels and systems programming.
- Proficiency in programming languages such as C, C++, and Python
- Background in hardware architecture and embedded systems
- Strong understanding of operating system concepts, including memory management, process scheduling, and file systems
- Experience with virtualization and cloud computing technologies
- Understanding of cybersecurity principles and best practices
- Proven experience in designing and developing operating systems or similar low-level software
- Ability to work collaboratively in a team environment
- Strong problem-solving and analytical skills
- Excellent analytical and troubleshooting skills
- Excellent written and verbal communication skills
- Familiarity with development tools and debugging techniques
- Strong attention to detail and commitment to quality.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities
- Develop and oversee performance-critical infrastructure for financial markets, ensuring maximum throughput, high resiliency, and minimal operational risk.
- Leverage deep Linux kernel expertise to fine-tune scheduling policies, interrupt routing, and NUMA resource allocation, ensuring predictable performance at scale.
- Build and maintain high-availability containerized environments using Kubernetes, Docker, and advanced orchestration tools with a strong focus on scalability and security.
- Lead automation initiatives with Ansible, Bash, and Python, eliminating manual intervention and improving system efficiency.
- Manage hybrid cloud infrastructure (AWS, Azure,GCP) with strict performance SLAs, security compliance, and cost-optimized deployments.
- Oversee infrastructure monitoring and observability using ELK Stack, Grafana, Site24x7, Splunk, and other enterprise-grade tools, ensuring proactive incident detection and resolution.
- Administer and troubleshoot enterprise storage and networking stacks like RAID, NFS, SAN/NAS, TCP/IP networking,VMware/vCenter, BigIP load balancers.
- Collaborate with development, DevOps, and security teams to design fault-tolerant systems and enforce infrastructure governance policies.
- Execute predictive capacity modeling, OS hardening and patch compliance, coupled with benchmark-driven performance optimization for trading and real-time compute platforms.
- Provide expert-level outage resolution, coordinating cross-functional teams to deliver sustainable remediation and operational resilience.
Requirements
- 8+ years of progressive experience in system administration, performance engineering, and reliability operations across enterprise and financial domains.
- Advanced proficiency in Linux internals with specialization in kernel performance tuning, NUMA-aware optimizations, and real-time workload handling.
- Proven hands-on experience with Kubernetes,Docker, and Ansible for large-scale automation and orchestration.
- Strong scripting/programming in Bash, Python, and experience with perf/eBPF for system analysis.
- Demonstrated expertise in cloud operations across AWS, Azure, and GCP.
- Strong background in networking protocols (TCP/IP, FIX) and high-performance trading environments.
- Familiarity with storage systems (SAN, NAS, RAID) and database tuning (MySQL optimization).
- Experience implementing observability and monitoring solutions like ELK, Grafana, Splunk, Corvil.