492 Sre Manager jobs in Singapore

Senior Manager – Site Reliability Engineering SRE

Singapore, Singapore DROPMYSITE PTE. LTD.

Posted today

Job Viewed

Tap Again To Close

Job Description

Roles & Responsibilities

Nice to Meet You We are Dropsuite, a NinjaOne Company

Site Ops teams are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our operating environments.

We are seeking a seasoned Senior Manager – Site Reliability Engineering (SRE) to lead a high-impact team focused on building resilient, scalable infrastructure and ensuring platform reliability across our cloud environments. This role combines strategic leadership with deep technical expertise in automation, observability, and modern DevOps practices to drive operational excellence and service uptime.

Work Arrangement

  • Full-time position
  • Hybrid work model (2 days per week in the office)
  • Monday to Friday, 5-day work week (flexible work schedule)
  • Eligible to reside and work in Singapore (Singapore Citizens / PRs preferred)

This position is open exclusively to candidates who reside in and are authorised to work in Singapore. Only shortlisted candidates will be contacted.

Key Accountabilities

  • Define and implement SRE roadmaps aligned with business objectives and SLAs.
  • Collaborate with service owners to define SLOs supporting SLA commitments.
  • Deliver platform SLI insights through reports and observability tools.
  • Integrate reliability best practices into engineering and product workflows.
  • Lead initiatives on uptime, monitoring, incident response, and optimization.
  • Manage incident response processes, on-call rotations, and playbooks.
  • Set infrastructure resiliency standards for cloud-native environments.
  • Optimize architecture for scalability, fault tolerance, and cost efficiency.
  • Ensure production systems meet security and compliance requirements.
  • Provide strategic leadership and mentorship to drive team growth and performance.
  • Design scalable and resilient systems architecture.
  • Recruit, mentor, and retain high-performing SRE talent.
  • Develop growth and training plans for SRE team members.
  • Foster a reliability-focused, customer-centric team culture.

Qualifications and Competencies

  • Bachelor's degree in Computer Science or a related field.
  • Cloud certification in AWS, Azure, or GCP preferred.
  • 8+ years in Software Engineering or Site Reliability Engineering.
  • 3+ years in team management or technical leadership.
  • Expert-level Linux administration, scripting, and troubleshooting.
  • Strong hands-on experience with CI/CD and SDLC practices.
  • Deep passion for automation, security, and self-service.
  • Proficient in AWS, GCP, and/or Azure cloud platforms.
  • Skilled in infrastructure-as-code tools like Terraform, CloudFormation, Helm, and Ansible.
  • Experienced with containers, Kubernetes, and microservice architectures.
  • Excellent verbal and written communication skills.

Why Join Us

At Dropsuite, now proudly part of NinjaOne, we are on a mission to safeguard business information and help businesses stay in business. We are a global, fast-growing, partner-centric company building secure, scalable, and highly usable cloud backup technologies for businesses of all sizes. Today, we perform billions of backups daily for organizations across more than 100 countries.

As we enter an exciting new chapter with NinjaOne—a leader in endpoint management, security, and IT automation—our combined strengths enable us to drive even greater impact, innovation, and global scale. Together, we are building a world-class platform that empowers IT teams with simplicity, performance, and reliability.

At our core, we are a team of hungry owners: we are tenacious in our pursuit of excellence and take full ownership in everything we do. We are deeply customer-focused, collaborative, and solutions-driven. We play as a team—respecting, supporting, and elevating one another every step of the way.

Join us as we shape the future of IT and data protection—powered by passion, purpose, and the spirit of ownership.

Rewards That Go Beyond

  • Competitive compensation
  • Hybrid work model
  • 18 days of annual leave (with accrual up to 20 days)
  • Entitled to Singapore Public Holidays
  • Other leave benefits, such as Wedding leave
  • Health Insurance for you and your dependents
  • Growth opportunities
  • Work in a global company with meaningful work, highly skilled colleagues, and an amazing culture

Diversity and Inclusion Statement

Dropsuite is an Equal Employment Opportunity and Affirmative Action Employer. Qualified applicants will receive consideration for employment without regard to race, colour, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status.

As part of our recruitment process, we may collect personal data to support hiring-related activities such as screening, assessment, and communication. This information is collected solely for recruitment purposes and handled in accordance with applicable data protection and privacy regulations. Your data will be treated with strict confidentiality and used only to facilitate your application with us.

Your Career Growth Starts Here. Apply Now

Tell employers what skills you have

Troubleshooting
Scalability
Operational Excellence
Kubernetes
Azure
Ubuntu
Software Engineering
Scripting
Reliability
Administration Management
Reliability Engineering
Technical Consultation
GCP
Ansible
Linux
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (Cloud), Infrastructure Engineering

Singapore, Singapore ByteDance

Posted today

Job Viewed

Tap Again To Close

Job Description

Overview
The Infrastructure Engineering team supports the company's fast growth by building and operating hyperscale datacenters. The team manages the end to end lifecycle of server fleet, providing cloud solutions and various infrastructure services ensuring that they are scalable and are reliable.
Responsibilities
Build, expand, and operate Bytedance’s global infrastructures, including large-scale systems in public and private clouds, data centers, and content delivery networks.
Build tools, automation, visualizations, and monitors to facilitate the operation and optimization of the global infrastructure.
Work in a fast-paced environment. Participate in technical operations and rotations in response to performance and reliability issues.
Help improve the whole lifecycle of infrastructure services from inception and design throughout development to deployment, user support, and refinement.
Deploy and configure solutions in the cloud.
Automate cloud operations, develop infrastructure automation scripts and participate in the continuous improvement of cloud solutions.
Participate in the specification, setup and run Proof of Concepts and demonstrations of cloud solutions.
Administer and maintain servers across virtual platforms.
Qualifications
Minimum Qualification:
At least a Bachelor’s degree in Computer Science, Information Technology, Programming & Systems Analysis, or Science (Computer Studies).
3+ years of experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols.
3+ years experience with essential system-level apps, such as DNS, APT, LDAP, Nginx, CI/CD, Ansible, Packer, etc.
2+ years experience in one or more programming languages such as Java, C++, Go, or scripting experience in Shell and Python.
Strong analytical skills and the ability to solve real-world problems in a fast-moving environment.
Experience in designing, analyzing, and building automation and tools for large-scale systems.
Experience in building solutions with AWS, Google, OCI, and other cloud services.
Preferred Qualifications:
Strong communication and collaboration skills.
Self-driven and capable of coping with ambiguity and moving projects from concept to delivery.
About Us
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
Why Join ByteDance
Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. We strive to create value for our communities, inspire creativity and enrich life on a daily basis. Join us.
Diversity & Inclusion: ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. We are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineering Expert

Singapore, Singapore beBeeReliability

Posted today

Job Viewed

Tap Again To Close

Job Description

Job Title: Site Reliability Engineering Professional


Job Description

As a Site Reliability Engineering (SRE) professional, you will play a critical role in ensuring the smooth operation of our large-scale websites and web application platforms. Your primary responsibility will be to manage Site Reliability Engineering, servers, networks, and applications to ensure their optimal performance and availability.

You will work closely with cross-functional teams to design, develop, and implement solutions that meet business needs while ensuring the security and integrity of our systems. This includes deploying and monitoring business systems for normal operation and emergency response.

Key areas of focus include:

  • Designing and implementing scalable architectures to ensure high availability and reliability
  • Developing and maintaining scripts and tools to automate deployment and monitoring tasks
  • Collaborating with development teams to ensure seamless integration of new features and services
  • Analyzing system logs and performance metrics to identify bottlenecks and optimize system performance

Required Skills and Qualifications

Successful candidates will possess:

  • Expertise in Linux systems, network technologies, and cloud computing platforms (AWS, Azure, Google Cloud)
  • Proficiency in scripting languages such as Python, Bash, or Perl
  • Familiarity with popular monitoring systems, such as Prometheus, Grafana, or Zabbix
  • Knowledge of DevOps practices, including continuous integration and delivery (CI/CD)
  • A bachelor's degree in Computer Science or a related field, or equivalent experience
  • Excellent problem-solving skills, attention to detail, and ability to work under pressure

Benefits

We offer a competitive compensation package, including:

  • Long-term service staff rewards
  • Friendly working environment, flexible hours, and quarterly events
  • Medical insurance

Others

Join our team of passionate professionals who share your commitment to delivering exceptional results. We value diversity, inclusion, and creativity, and are committed to fostering a positive and supportive work environment.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineering Manager

Singapore, Singapore beBeeSRE

Posted today

Job Viewed

Tap Again To Close

Job Description

Site Reliability Engineering Manager Job Description

We are seeking an experienced Site Reliability Engineering Manager to join our team. The ideal candidate will have hands-on knowledge of managing and troubleshooting servers, networks, and applications.

The successful candidate will have a deep understanding of operation and maintenance architecture, with expertise in architecture design, performance optimization, and platform security.

Key Responsibilities:

  • Manage and troubleshoot server, network, and application issues.
  • Deploy and monitor business systems to ensure normal operation and emergency response.
  • Improve the skills of the operation and maintenance team to ensure professional level standards.
  • Establish and improve standardized operation systems for the DevOps team.

Required Skills and Qualifications:

  • Familiarity with automation deployment and proficiency in popular monitoring systems.
  • Experience with SVN/Git version control systems and CICD configuration.
  • Hands-on knowledge of Nginx configuration and Load Balancer (SLB) configuration.
  • Skilled in managing Linux systems and security, with experience in operating large websites or web application platforms.
  • Understanding of common network technologies, including firewall, VPN, DHCP, DNS, and high availability/load balancing technology.
  • Ability to analyze and eliminate faults, with a strong sense of responsibility and good learning skills.
  • Experience in safe operation and maintenance, with working experience in Cloud platforms (AZURE, AWS, or Google).
  • Working Environments: Hyper-Cloud, Cloudform/AWX, ELK, Zabbix, Jenkins/VSTS, GitLab etc.

Benefits:

  • Proper training for senior positions.
  • Attractive rewards for long-term service staff.
  • Energetic and friendly working environment (flexible working hours, quarterly events, staff birthday celebration, and etc.).
  • Medical insurance.

Education and Experience:

  • Candidate must possess at least Bachelor's Degree/Post Graduate Diploma/Professional Degree in Engineering (Computer/Telecommunication), Engineering (Electrical/Electronic), Computer Science/Information Technology or equivalent.
  • At least 3 years of working experience in the related field is required for senior position.

Preferred Qualifications:

  • Senior Executive specialized in IT/Computer - Network/System/Database Admin or equivalent.
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineering Manager

Singapore, Singapore Canonical

Posted today

Job Viewed

Tap Again To Close

Job Description

Join to apply for the
Site Reliability Engineering Manager
role at
Canonical
3 days ago Be among the first 25 applicants
Join to apply for the
Site Reliability Engineering Manager
role at
Canonical
Canonical is a leading provider of open-source software and operating systems for global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our customers include the world's leading public cloud and silicon providers, and industry leaders in many sectors. The company is a pioneer of global distributed collaboration, with 1200+ colleagues in more than 80 countries and very few office-based roles. Teams meet two to four times yearly in person, in interesting locations around the world, to align on strategy and execution.
The company is founder led, profitable and growing.
We are hiring a
Site Reliability Engineering Manager
aspiring for a world-class devops and gitops engineering management challenge, bringing together operations management, software engineering and product development, and team leadership in a single high-value role. You will need to be a Linux and operations expert, as well as a great manager capable of leading a high-performance team, to excel in this role.
The Information Systems team at Canonical runs services used by over 60 million Ubuntu users. Our mission is to pioneer and prove new and better approaches to large-scale IS. We support Canonical and Ubuntu operations, but we also help shape Canonical's managed application service offerings, raising the bar on devops and cloud-native operations. We take infra-as-code to the next level, blazing a trail to next-generation model-driven operations. We not only aim to automate every process that underpins our business, we also share that automation as open source packages which others use to drive their own operations. From Kubernetes to the kernel and everything in-between, you will be working with the latest technologies in a fast-paced engineering environment.
We have fully distributed, home-based teams in EMEA, APAC and the Americas. You will lead a team in your time zone, and report to a global director who may not be in your time zone.
Location : This role will be based remotely in the APAC region.
The role entails
Lead your team in daily agile devops practices
Represent the IS team to stakeholders, customers, and internal teams
Organize, coordinate and drive internal projects
Mentor engineers to improve their skills
Identify and measure team health indicators
Implement structured engineering and operations processes
Ensure proper team focus on priorities, milestones, and deliverables
Work to meet service level agreements with customer deployments around the globe
Deliver quality managed services in a consistent, timely manner
What we are looking for in you
Drive and a track record of going above-and-beyond expectations
Proven experience of software delivery using infrastructure as code
Proven experience managing devops teams for SAAS or similar offerings
Understanding of testing methodologies and maintainable code quality
Technical aptitude for understanding complex distributed systems
Experience with cloud topologies and technologies
Ability to travel twice a year, for company events up to two weeks long
An exceptional academic track record from both high school and university
Nice-to-have skills
Experience with Ubuntu system administration
Experience with agile software development methodologies
Experience working in and managing distributed teams
What we offer colleagues
We consider geographical location, experience, and performance in shaping compensation worldwide. We revisit compensation annually (and more often for graduates and associates) to ensure we recognize outstanding performance. In addition to base pay, we offer a performance-driven annual bonus or commission. We provide all team members with additional benefits, which reflect our values and ideals. We balance our programs to meet local needs and ensure fairness globally.
Distributed work environment with twice-yearly team sprints in person
Personal learning and development budget of USD 2,000 per year
Annual compensation review
Recognition rewards
Annual holiday leave
Maternity and paternity leave
Employee Assistance Program
Opportunity to travel to new locations to meet colleagues
Priority Pass, and travel upgrades for long haul company events
About Canonical
Canonical is a pioneering tech firm at the forefront of the global move to open source. As the company that publishes Ubuntu, one of the most important open source projects and the platform for AI, IoT and the cloud, we are changing the world of software. We recruit on a global basis and set a very high standard for people joining the company. We expect excellence - in order to succeed, we need to be the best at what we do. Most colleagues at Canonical have worked from home since its inception in 2004. Working here is a step into the future, and will challenge you to think differently, work smarter, learn new skills, and raise your game.
Canonical is an equal opportunity employer
We are proud to foster a workplace free from discrimination. Diversity of experience, perspectives, and background create a better work environment and better products. Whatever your identity, we will give your application fair consideration.
Seniority level
Seniority level Mid-Senior level
Employment type
Employment type Full-time
Job function
Job function Engineering and Information Technology
Industries Software Development
Referrals increase your chances of interviewing at Canonical by 2x
Sign in to set job alerts for “Site Reliability Engineering Manager” roles.
Linux Engineering Manager - Optimisation for Latest Hardware
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Senior Manager - Site Reliability Engineering [SRE] (Ref: 25-061)

Singapore, Singapore Dropsuite Limited

Posted today

Job Viewed

Tap Again To Close

Job Description

Nice to Meet You! We areDropsuite, a NinjaOne Company!
Site Ops teams are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our operating environments.
We are seeking a seasoned Senior Manager – Site Reliability Engineering (SRE) to lead a high-impact team focused on building resilient, scalable infrastructure and ensuring platform reliability across our cloud environments. This role combines strategic leadership with deep technical expertise in automation, observability, and modern DevOps practices to drive operational excellence and service uptime.
Work Arrangement
Full-time position
Hybrid work model (2 days per week in the office)
Monday to Friday, 5-day work week (flexible work schedule)
Eligible to reside and work in Singapore (Singapore Citizens / PRs preferred)
This position is open exclusively to candidates who reside in and are authorised to work in Singapore. Only shortlisted candidates will be contacted.
Key Accountabilities
Define and implement SRE roadmaps aligned with business objectives and SLAs.
Collaborate with service owners to define SLOs supporting SLA commitments.
Deliver platform SLI insights through reports and observability tools.
Integrate reliability best practices into engineering and product workflows.
Lead initiatives on uptime, monitoring, incident response, and optimization.
Manage incident response processes, on-call rotations, and playbooks.
Set infrastructure resiliency standards for cloud-native environments.
Optimize architecture for scalability, fault tolerance, and cost efficiency.
Ensure production systems meet security and compliance requirements.
Provide strategic leadership and mentorship to drive team growth and performance.
Design scalable and resilient systems architecture.
Recruit, mentor, and retain high-performing SRE talent.
Develop growth and training plans for SRE team members.
Foster a reliability-focused, customer-centric team culture.
Qualifications and Competencies
Bachelor's degree in Computer Science or a related field.
Cloud certification in AWS, Azure, or GCP preferred.
8+ years in Software Engineering or Site Reliability Engineering.
3+ years in team management or technical leadership.
Expert-level Linux administration, scripting, and troubleshooting.
Strong hands-on experience with CI/CD and SDLC practices.
Deep passion for automation, security, and self-service.
Proficient in AWS, GCP, and/or Azure cloud platforms.
Skilled in infrastructure-as-code tools like Terraform, CloudFormation, Helm, and Ansible.
Experienced with containers, Kubernetes, and microservice architectures.
Excellent verbal and written communication skills.
Why Join UsAt Dropsuite, now proudly part of NinjaOne, we are on a mission to safeguard business information and help businesses stay in business. We are a global, fast-growing, partner-centric company building secure, scalable, and highly usable cloud backup technologies for businesses of all sizes. Today, we perform billions of backups daily for organizations across more than 100 countries.
As we enter an exciting new chapter with NinjaOne—a leader in endpoint management, security, and IT automation—our combined strengths enable us to drive even greater impact, innovation, and global scale. Together, we are building a world-class platform that empowers IT teams with simplicity, performance, and reliability.
At our core, we are a team of hungry owners: we are tenacious in our pursuit of excellence and take full ownership in everything we do. We are deeply customer-focused, collaborative, and solutions-driven. We play as a team—respecting, supporting, and elevating one another every step of the way.
Join us as we shape the future of IT and data protection—powered by passion, purpose, and the spirit of ownership.
Rewards That Go Beyond
Competitive compensation
Hybrid work model
18 days of annual leave (with accrual up to 20 days)
Entitled to Singapore Public Holidays
Other leave benefits, such as Wedding leave
Health Insurance for you and your dependents
Growth opportunities
Work in a global company with meaningful work, highly skilled colleagues, and an amazing culture
Diversity and Inclusion StatementDropsuite is an Equal Employment Opportunity and Affirmative Action Employer. Qualified applicants will receive consideration for employment without regard to race, colour, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status.
As part of our recruitment process, we may collect personal data to support hiring-related activities such as screening, assessment, and communication. This information is collected solely for recruitment purposes and handled in accordance with applicable data protection and privacy regulations. Your data will be treated with strict confidentiality and used only to facilitate your application with us.
Your Career Growth Starts Here. Apply Now!
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Senior Software Engineer, Site Reliability Engineering

Singapore, Singapore Crypto.com

Posted 4 days ago

Job Viewed

Tap Again To Close

Job Description

We are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend infrastructure.

What you’ll be doing
  • Ensure entire stack is healthy: hardware, software, application and network are operating at optimal performance
  • Perform deep dives into both systemic and latent reliability issues; partnering with other software and DevOps engineers across the organization to design, implement and roll out fixes
  • Continuously improve availability, reliability, and observability and reduce the burden of human toil with tooling and automation
  • Lead and drive SRE initiatives to improve operation efficiencies
  • Represent the SRE team in system design reviews and operational readiness exercises for new and existing services
What you need
  • Experience coding in Ruby and/or Go
  • Familiar with GitOps principles and tools (Github Actions, Docker, Kubernetes)
  • Experience in designing, analyzing, and troubleshooting large-scale distributed systems
  • Curiosity about finding root causes in incidents and outages
  • Ability to develop alignment to cultivate relationships and driving impact
  • Mindset in designing fault tolerance system architecture
  • Comfort with being uncomfortable in ambiguous situations
  • Involvement with incident management and response
  • Desire to grow expertise, inform, and educate others
  • Capable to pick up various technologies, a fast learner and have a “get things done” mentality
  • Humble to embrace better ideas from others, eager to make things better, open to challenges and possibilities
Desirable
  • Familiar with cloud platforms and micro-service based architecture (AWS is big plus)
  • Familiar with monitoring tools (e.g. Datadog, OpenTelemetry)
  • Familiar with CICD tools (e.g. Github Actions)
  • Familiar with IaC tools (e.g. Terraform, Spacelift)
  • Experience in designing resilient system architecture
  • Experience in optimizing performance of large-scale production system
Life @ Crypto.com

Empowered to think big. Try new opportunities while working with a talented, ambitious and supportive team.

Transformational and proactive working environment. Empower employees to find thoughtful and innovative solutions.

Growth from within. We help to develop new skill-sets that would impact the shaping of your personal and professional growth.

Work Culture. Our colleagues are some of the best in the industry; we are all here to help and support one another.

One cohesive team. Engage stakeholders to achieve our ultimate goal - Cryptocurrency in every wallet.

Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.

Aspire career alternatives through us - our internal mobility program offers employees a new scope.

Work Perks: crypto.com visa card provided upon joining.

Benefits

Competitive salary.

Attractive annual leave entitlement including: birthday, work anniversary.

Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.

Aspire career alternatives through us. Our internal mobility program can offer employees a diverse scope.

Work Perks: crypto.com visa card provided upon joining.

Our Crypto.com benefits packages vary depending on region requirements, you can learn more from our talent acquisition team.

About Crypto.com:

Founded in 2016, Crypto.com serves more than 80 million customers and is the world's fastest growing global cryptocurrency platform. Our vision is simple: Cryptocurrency in Every Wallet. Built on a foundation of security, privacy, and compliance, Crypto.com is committed to accelerating the adoption of cryptocurrency through innovation and empowering the next generation of builders, creators, and entrepreneurs to develop a fairer and more equitable digital ecosystem.

Learn more at

Crypto.com is an equal opportunities employer and we are committed to creating an environment where opportunities are presented to everyone in a fair and transparent way. Crypto.com values diversity and inclusion, seeking candidates with a variety of backgrounds, perspectives, and skills that complement and strengthen our team.

Personal data provided by applicants will be used for recruitment purposes only.

Please note that only shortlisted candidates will be contacted.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Sre manager Jobs in Singapore !

Senior Software Engineer, Site Reliability Engineering

Singapore, Singapore Crypto.com

Posted today

Job Viewed

Tap Again To Close

Job Description

full-time

We are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend infrastructure.
What you’ll be doing
Ensure entire stack is healthy: hardware, software, application and network are operating at optimal performance
Perform deep dives into both systemic and latent reliability issues; partnering with other software and DevOps engineers across the organization to design, implement and roll out fixes
Continuously improve availability, reliability, and observability and reduce the burden of human toil with tooling and automation
Lead and drive SRE initiatives to improve operation efficiencies
Represent the SRE team in system design reviews and operational readiness exercises for new and existing services
What you need
Experience coding in Ruby and/or Go
Familiar with GitOps principles and tools (Github Actions, Docker, Kubernetes)
Experience in designing, analyzing, and troubleshooting large-scale distributed systems
Curiosity about finding root causes in incidents and outages
Ability to develop alignment to cultivate relationships and driving impact
Mindset in designing fault tolerance system architecture
Comfort with being uncomfortable in ambiguous situations
Involvement with incident management and response
Desire to grow expertise, inform, and educate others
Capable to pick up various technologies, a fast learner and have a “get things done” mentality
Humble to embrace better ideas from others, eager to make things better, open to challenges and possibilities
Desirable
Familiar with cloud platforms and micro-service based architecture (AWS is big plus)
Familiar with monitoring tools (e.g. Datadog, OpenTelemetry)
Familiar with CICD tools (e.g. Github Actions)
Familiar with IaC tools (e.g. Terraform, Spacelift)
Experience in designing resilient system architecture
Experience in optimizing performance of large-scale production system
Life @ Crypto.com
Empowered to think big. Try new opportunities while working with a talented, ambitious and supportive team.
Transformational and proactive working environment. Empower employees to find thoughtful and innovative solutions.
Growth from within. We help to develop new skill-sets that would impact the shaping of your personal and professional growth.
Work Culture. Our colleagues are some of the best in the industry; we are all here to help and support one another.
One cohesive team. Engage stakeholders to achieve our ultimate goal - Cryptocurrency in every wallet.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us - our internal mobility program offers employees a new scope.
Work Perks: crypto.com visa card provided upon joining.
Benefits
Competitive salary.
Attractive annual leave entitlement including: birthday, work anniversary.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us. Our internal mobility program can offer employees a diverse scope.
Work Perks: crypto.com visa card provided upon joining.
Our Crypto.com benefits packages vary depending on region requirements, you can learn more from our talent acquisition team.
About Crypto.com:
Founded in 2016, Crypto.com serves more than 80 million customers and is the world's fastest growing global cryptocurrency platform. Our vision is simple: Cryptocurrency in Every Wallet. Built on a foundation of security, privacy, and compliance, Crypto.com is committed to accelerating the adoption of cryptocurrency through innovation and empowering the next generation of builders, creators, and entrepreneurs to develop a fairer and more equitable digital ecosystem.
Learn more at
Crypto.com is an equal opportunities employer and we are committed to creating an environment where opportunities are presented to everyone in a fair and transparent way. Crypto.com values diversity and inclusion, seeking candidates with a variety of backgrounds, perspectives, and skills that complement and strengthen our team.
Personal data provided by applicants will be used for recruitment purposes only.
Please note that only shortlisted candidates will be contacted.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Cloud Infrastructure Engineer

Singapore, Singapore INFINITY LINKS PTE. LTD.

Posted today

Job Viewed

Tap Again To Close

Job Description

Overview

IXL Cloud enables businesses, start-ups, researchers, and developers to train, deploy, and scale their AI systems with unmatched performance and flexibility.

We accelerate their AI journey by delivering leading GPU infrastructure, seamless scalability, and AI-first operational support—helping bring advanced AI applications to fruition without the complexity of managing underlying compute architecture.

Responsibilities

As a Cloud Infrastructure Engineer , you will:

  • Design, deploy, and maintain scalable cloud infrastructure for GPU workloads using tools like Terraform, Ansible, and Kubernetes.
  • Automate provisioning of compute resources across bare-metal and cloud environments.
  • Manage container orchestration platforms (Kubernetes, Docker) for multi-tenant GPU cluster environments.
  • Monitor infrastructure performance, uptime, and system health using observability tools (Prometheus, Grafana, ELK, etc.).
  • Maintain and optimize storage, networking, and load balancing layers for high-throughput AI workloads.
  • Implement CI/CD pipelines for both infrastructure and application-level changes.
  • Collaborate with software engineers, platform teams, and AI researchers to understand workload needs and optimize system performance accordingly.
  • Ensure infrastructure security, including secrets management, RBAC, and compliance with best practices.
  • Troubleshoot and resolve infrastructure incidents, scaling issues, and performance bottlenecks.
  • Support hardware provisioning, firmware updates, and GPU driver/CUDA installations.
Qualifications
  • 3–7 years of experience in DevOps, Site Reliability, or Infrastructure Engineering roles.
  • Deep experience managing Linux systems in production environments.
  • Experience deploying and managing Kubernetes clusters at scale (bare metal or cloud-native).
  • Familiarity with GPU drivers (NVIDIA, CUDA) and workload optimization is a plus.
  • Proficiency in scripting languages (Bash, Python, Go, etc.).
  • Strong understanding of networking, firewalls, and storage systems in distributed compute environments.
  • Experience with CI/CD tools such as GitLab CI, ArgoCD, Jenkins, or Flux.
  • Excellent communication and documentation skills.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Cloud Infrastructure Engineer

Singapore, Singapore INFINITY LINKS PTE. LTD.

Posted today

Job Viewed

Tap Again To Close

Job Description

Overview

IXL Cloud enables businesses, start-ups, researchers, and developers to train, deploy, and scale their AI systems with unmatched performance and flexibility.

We accelerate their AI journey by delivering leading GPU infrastructure, seamless scalability, and AI-first operational support—helping bring advanced AI applications to fruition without the complexity of managing underlying compute architecture.

Responsibilities

As a Cloud Infrastructure Engineer , you will:

  • Design, deploy, and maintain scalable cloud infrastructure for GPU workloads using tools like Terraform, Ansible, and Kubernetes.
  • Automate provisioning of compute resources across bare-metal and cloud environments.
  • Manage container orchestration platforms (Kubernetes, Docker) for multi-tenant GPU cluster environments.
  • Monitor infrastructure performance, uptime, and system health using observability tools (Prometheus, Grafana, ELK, etc.).
  • Maintain and optimize storage, networking, and load balancing layers for high-throughput AI workloads.
  • Implement CI/CD pipelines for both infrastructure and application-level changes.
  • Collaborate with software engineers, platform teams, and AI researchers to understand workload needs and optimize system performance accordingly.
  • Ensure infrastructure security, including secrets management, RBAC, and compliance with best practices.
  • Troubleshoot and resolve infrastructure incidents, scaling issues, and performance bottlenecks.
  • Support hardware provisioning, firmware updates, and GPU driver/CUDA installations.
Qualifications
  • 3–7 years of experience in DevOps, Site Reliability, or Infrastructure Engineering roles.
  • Deep experience managing Linux systems in production environments.
  • Experience deploying and managing Kubernetes clusters at scale (bare metal or cloud-native).
  • Familiarity with GPU drivers (NVIDIA, CUDA) and workload optimization is a plus.
  • Proficiency in scripting languages (Bash, Python, Go, etc.).
  • Strong understanding of networking, firewalls, and storage systems in distributed compute environments.
  • Experience with CI/CD tools such as GitLab CI, ArgoCD, Jenkins, or Flux.
  • Excellent communication and documentation skills.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Sre Manager Jobs