790 Software Reliability jobs in Singapore
Senior Software Engineer, Site Reliability Engineering
Posted 10 days ago
Job Viewed
Job Description
We are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend infrastructure.
What you’ll be doing- Ensure entire stack is healthy: hardware, software, application and network are operating at optimal performance
- Perform deep dives into both systemic and latent reliability issues; partnering with other software and DevOps engineers across the organization to design, implement and roll out fixes
- Continuously improve availability, reliability, and observability and reduce the burden of human toil with tooling and automation
- Lead and drive SRE initiatives to improve operation efficiencies
- Represent the SRE team in system design reviews and operational readiness exercises for new and existing services
- Experience coding in Ruby and/or Go
- Familiar with GitOps principles and tools (Github Actions, Docker, Kubernetes)
- Experience in designing, analyzing, and troubleshooting large-scale distributed systems
- Curiosity about finding root causes in incidents and outages
- Ability to develop alignment to cultivate relationships and driving impact
- Mindset in designing fault tolerance system architecture
- Comfort with being uncomfortable in ambiguous situations
- Involvement with incident management and response
- Desire to grow expertise, inform, and educate others
- Capable to pick up various technologies, a fast learner and have a “get things done” mentality
- Humble to embrace better ideas from others, eager to make things better, open to challenges and possibilities
- Familiar with cloud platforms and micro-service based architecture (AWS is big plus)
- Familiar with monitoring tools (e.g. Datadog, OpenTelemetry)
- Familiar with CICD tools (e.g. Github Actions)
- Familiar with IaC tools (e.g. Terraform, Spacelift)
- Experience in designing resilient system architecture
- Experience in optimizing performance of large-scale production system
Empowered to think big. Try new opportunities while working with a talented, ambitious and supportive team.
Transformational and proactive working environment. Empower employees to find thoughtful and innovative solutions.
Growth from within. We help to develop new skill-sets that would impact the shaping of your personal and professional growth.
Work Culture. Our colleagues are some of the best in the industry; we are all here to help and support one another.
One cohesive team. Engage stakeholders to achieve our ultimate goal - Cryptocurrency in every wallet.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us - our internal mobility program offers employees a new scope.
Work Perks: crypto.com visa card provided upon joining.
BenefitsCompetitive salary.
Attractive annual leave entitlement including: birthday, work anniversary.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us. Our internal mobility program can offer employees a diverse scope.
Work Perks: crypto.com visa card provided upon joining.
Our Crypto.com benefits packages vary depending on region requirements, you can learn more from our talent acquisition team.
About Crypto.com:Founded in 2016, Crypto.com serves more than 80 million customers and is the world's fastest growing global cryptocurrency platform. Our vision is simple: Cryptocurrency in Every Wallet. Built on a foundation of security, privacy, and compliance, Crypto.com is committed to accelerating the adoption of cryptocurrency through innovation and empowering the next generation of builders, creators, and entrepreneurs to develop a fairer and more equitable digital ecosystem.
Learn more at
Crypto.com is an equal opportunities employer and we are committed to creating an environment where opportunities are presented to everyone in a fair and transparent way. Crypto.com values diversity and inclusion, seeking candidates with a variety of backgrounds, perspectives, and skills that complement and strengthen our team.
Personal data provided by applicants will be used for recruitment purposes only.
Please note that only shortlisted candidates will be contacted.
#J-18808-LjbffrReliability Engineering Specialist
Posted 26 days ago
Job Viewed
Job Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
THE ROLE:
Join a dynamic global team dedicated to advanced reliability testing of module and system boards of AMD's cutting-edge products. Collaborate closely with cross-functional teams across AMD Global Operations & Quality, and Data Center organizations on accelerator-product system setup and reliability testing.
KEY RESPONSIBILITIES:
- System-level setup and testing:
- Plan, execute, and optimize system-level setups for accelerator products, including server rack and system configurations.
- Ensure seamless integration and functionality of server systems with advanced cooling solutions and environmental management systems.
- Validate and maintain reliability test scripts for automated and manual testing processes.
- Reliability assessment and testing:
- Conduct comprehensive reliability assessments of accelerator systems, focusing on mechanical, thermal, and electrical stress factors.
- Design and implement environmental stress tests to simulate data center conditions, including operational stress, thermal cycling, signal, and power integrity.
- Evaluate material interactions and their impact on product reliability, ensuring robustness in diverse operating environments.
- Analyze results to identify potential reliability risks and areas for design improvement.
- Functional testing and fault isolation:
- Perform detailed functional testing to evaluate system performance under various operational conditions.
- Identify, isolate, and troubleshoot faults using advanced diagnostic tools and methodologies.
- Failure analysis and reporting:
- Perform root cause analysis for identified reliability failures and develop corrective actions for design and process enhancement.
- Collaborate with cross-functional teams to conduct root cause analysis of reliability testing failures.
- Collaboration and documentation:
- Work closely with design, manufacturing, and quality teams to align reliability goals with overall product requirements.
- Generate comprehensive reports detailing reliability test results, analysis, and recommendations.
- Maintain meticulous records of testing methodologies and outcomes for future reference and continuous improvement initiatives.
- Mentorship:
- Effectively mentor junior engineers, providing guidance in both technical domains and professional skill development to foster growth and team success.
PREFERRED EXPERIENCE:
- Knowledge of reliability engineering principles, product lifecycle, and standards in high-performance computing environments.
- Proven experience in system-level setup and testing for accelerator products or similar technologies.
- Proficiency in developing and executing reliability test scripts and protocols.
- Familiarity with reliability standards and best practices in high-performance computing environments.
- Familiarity with data center environmental management, server rack/system configurations, and integrated cooling solutions.
- Strong understanding of environmental stress factors, including thermal, mechanical, and electrical stresses, in server systems (L6–L10).
- Expertise in failure analysis techniques, including root cause analysis and fault isolation methodologies.
- Excellent written and verbal communication skills for clear reporting and collaboration.
- Strong analytical, problem-solving, and communication skills.
- Experience with reliability testing tools, simulation software and statistical tools is an added advantage.
- Knowledge in project and risk management is an added advantage.
- Self-starter and able to independently drive tasks to completion.
- Ability to structure and execute complex analysis, draw insights, and communicate summary conclusions/recommendations to senior management and AMD customers/partners.
- Ability to network, build relationships, and collaborate to drive effective decision-making across multiple functions and levels within AMD.
ACADEMIC CREDENTIALS:
- Bachelor’s or Master’s degree in Electrical/Electronics Engineering (EE) or a related field.
LOCATION:
Singapore
#LI-JV1
Benefits offered are described: AMD benefits at a glance .
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.
#J-18808-LjbffrSenior Manager Reliability Engineering
Posted today
Job Viewed
Job Description
Join to apply for the Senior Manager Reliability Engineering role at AMD
1 week ago Be among the first 25 applicants
Join to apply for the Senior Manager Reliability Engineering role at AMD
Get AI-powered advice on this job and more exclusive features.
APJ Recruitment Manager | Hiring Exceptional Talent for AMDWHAT YOU DO AT AMD CHANGES EVERYTHING
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
The Role
Join a dynamic global team dedicated to advanced reliability testing of module and system boards of AMD's cutting-edge products. Collaborate closely with cross-functional teams across AMD Global Operations & Quality, business units, and a worldwide supplier network.
The Person
This experienced manager would lead a high-performance, dynamic team and oversee a state-of-the-art reliability engineering lab in Singapore, driving module and system-board reliability initiatives to deliver programs that ensure exceptional product quality and reliability for AMD.
The successful candidate will also direct thermal/mechanical characterization development of advanced packages and circuit board assemblies. This role requires a strategic leader who can align lab capabilities with technology roadmaps, grow technical competencies, and nurture talent to tackle emerging reliability challenges.
Key Responsibilities
- Leadership & Team Development: Lead, mentor, and grow a high-performing reliability engineering team, fostering innovation and technical excellence.
- Lab Management: Oversee operations of a reliability engineering lab in Singapore, ensuring best-in-class testing and metrology capabilities. Manage operations and associated headcount, capital planning, budgeting, and expenditures.
- Technical Strategy: Drive reliability engineering development in areas such as the following:
- Advanced metrology
- Simulation & predictive modeling (FEA, thermal-mechanical stress analysis)
- Board-level reliability (BLR) & PCBA reliability (shock, vibration, thermal cycling)
- AI accelerator system reliability (high-power, thermal management, signal integrity)
- Capability Roadmap: Develop and enhance lab capabilities to align with emerging technology trends.
- Cross-Functional Collaboration: Work closely with R&D, product engineering, and manufacturing teams to ensure reliability is embedded in product design and process development.
- Industry Engagement: Collaborate with academic/industry organizations to ensure that AMD reliability test methodologies and tools remain at the forefront of innovation.
- Master’s/PhD in Materials Science, Mechanical Engineering, Electrical Engineering, or related field.
- 10+ years in reliability engineering, with 5+ years in leadership roles managing teams and labs.
- Deep expertise in semiconductor packaging, PCBA reliability, system-level reliability, and failure analysis techniques.
- Strong background in reliability testing standards (JEDEC, IPC, AEC, EIA, Telcordia, IEC, etc.) and statistical methods (Weibull analysis, accelerated life testing).
- Experience with simulation tools (ANSYS, COMSOL, Cadence, etc.) and metrology/characterization techniques.
- Proven track record in lab operation and technical talent development.
Singapore
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.
Seniority level
- Seniority level Not Applicable
- Employment type Full-time
- Industries Semiconductor Manufacturing
Referrals increase your chances of interviewing at AMD by 2x
Get notified about new Reliability Engineering Manager jobs in Singapore, Singapore .
Engineering Manager, Managed Agencies (NEA) Customer Engineering Principal Engineer/Technical Manager SVP/VP - SRE, Technology Risk Manager, Tech COO, Group Technology Discipline Manager (Functional Engineering Manager) Engineering Manager - Mid Platform (Boss) Manager - Commodity Trading Data & AI EngineeringWe’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-LjbffrVP of Site Reliability Engineering
Posted today
Job Viewed
Job Description
- Technology is key to enabling the DBS vision of being the leading bank in Asia. To meet the challenges arising from the ever-evolving technological advancements and increasing sophistication and demands of customers, there is a need for deft Technology Risk Managers to ensure robust risk governance.
- As a member of the Technology Risk Management team, you will oversee aglobal portfolioof technology risk management activities (includes participating in any technology risk management related initiatives), with a focus on:
- Targeted Risk Reviews
- Policy/Standard/Guide enforcement validation
- Thematic risk analysis for IT risks
- This role ensures that DBS Bank's technology risk framework aligns with global regulatory requirements (MAS, HKMA, RBI, GDPR, etc.)and industry best practices (NIST, ISO 27001, COBIT), and internal policies while identifying vulnerabilities and recommending mitigation strategies.
- The position requires a strategic leader who can identify systemic risks, drive audit remediation, and enhance governance across all regions where DBS operates.
- Accountable for managing internal and external reviews/audits from audit planning (such as request for information (RFI), opening meeting, etc.), fieldwork (such as RFI, issue discussion, etc.), to reporting and closing meeting.
- Responsible for monitoring and validating the closure of management actions, arising from internal and external reviews/audits, including regulator inspection reviews.
- Perform review of new / revised processes, provide risk opinion and ensure proper approvals and documentations.
- Collaborate with the different technology teams to conduct post implementation review of new / revised processes to provide assurance.
- Drive automation (e.g., data analytics, AI/ML) for continuous auditing.
- Prepare and develop technology risk insights (such as IT audit thematic and trend analysis) to be presented at forums (such as technology risk forums, etc.).
- Engage and collaborate with technology stakeholders to proactively identify risks at a detailed and technical level and ensure that IT is effectively driving remediation activities and to continuously improve IT risk posture.
- Proactive in forging effective engagement with key stakeholders relating to risk & control matters.
- Provide risk assessment and advisory as required:
- Evaluate the effectiveness of IT risk governance, security policies, and control frameworks.
- Provide actionable recommendations to senior management for risk mitigation.
- Subject matter expert in Site Reliability Engineering.
- Manage technology risk initiatives and target reviews.
Required Experience
- At least 12 years (SVP) / 8 years (VP)of experience preferably with exposure on risk management (in control functions; including technology).
- Demonstrated experience in Identifying, assessing and advising on technology risks.
- Excellent organizational, problem solving, interpersonal and operating skills to effectively drive the IT Risk agenda with IT functions.
- Strong communication skills at all levels -- able to effectively communicate with IT and senior management, as well as line staff to drive IT risk mitigation initiatives and other IT risk management related areas.
- Experience in driving IT risk management in digital age a plus.
- Knowledge of Information Security, System Resiliency & Availability & Software development practices and frameworks and regulatory requirements preferred.
- Subject matter expertise in Site Reliability Engineering, including but not limited to the following areas:
- SDLC governance (includes, CICD, SQA)
- DevOps, Release & deployment
- Change management
- Problem/Incident management
- Disaster recovery
- Good technical competencies and exposure to IT application or infrastructure development, support and management.
- Demonstrated experience of leveraging data and analytics to get stakeholder buy-in is a plus.
- Strong executive communication(for Technology EXCO-level reporting).
- Ability to translate technical risks into business impact.
- Leadership in driving cultural change toward risk awareness.
- Bachelor's/Master's inComputer Science, or related field.
- Certifications (Required):CISA, CISSP, CRISC, CISM, or equivalent.
- Preferred:ISO 27001 Lead Auditor, AWS/Azure Security, CCSP.
information_technology
#J-18808-LjbffrSenior Manager – Site Reliability Engineering [SRE]
Posted today
Job Viewed
Job Description
Join to apply for the Senior Manager – Site Reliability Engineering (SRE) role at Dropsuite
Senior Manager – Site Reliability Engineering (SRE)1 day ago Be among the first 25 applicants
Join to apply for the Senior Manager – Site Reliability Engineering (SRE) role at Dropsuite
Get AI-powered advice on this job and more exclusive features.
Direct message the job poster from Dropsuite
Nice to Meet You! We are Dropsuite, a NinjaOne Company!
Site Ops teams are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our operating environments.
We are seeking a seasoned Senior Manager – Site Reliability Engineering (SRE) to lead a high-impact team focused on building resilient, scalable infrastructure and ensuring platform reliability across our cloud environments. This role combines strategic leadership with deep technical expertise in automation, observability, and modern DevOps practices to drive operational excellence and service uptime.
Work Arrangement
- Full-time position
- Hybrid work model (2 days per week in the office)
- Monday to Friday, 5-day work week (flexible work schedule)
- Eligible to reside and work in Singapore (Singapore Citizens / PRs preferred)
This position is open exclusively to candidates who reside in and are authorised to work in Singapore. Only shortlisted candidates will be contacted.
Key Accountabilities
- Define and implement SRE roadmaps aligned with business objectives and SLAs.
- Collaborate with service owners to define SLOs supporting SLA commitments.
- Deliver platform SLI insights through reports and observability tools.
- Integrate reliability best practices into engineering and product workflows.
- Lead initiatives on uptime, monitoring, incident response, and optimization.
- Manage incident response processes, on-call rotations, and playbooks.
- Set infrastructure resiliency standards for cloud-native environments.
- Optimize architecture for scalability, fault tolerance, and cost efficiency.
- Ensure production systems meet security and compliance requirements.
- Provide strategic leadership and mentorship to drive team growth and performance.
- Design scalable and resilient systems architecture.
- Recruit, mentor, and retain high-performing SRE talent.
- Develop growth and training plans for SRE team members.
- Foster a reliability-focused, customer-centric team culture.
Qualifications and Competencies
- Bachelor's degree in Computer Science or a related field.
- Cloud certification in AWS, Azure, or GCP preferred.
- 8+ years in Software Engineering or Site Reliability Engineering.
- 3+ years in team management or technical leadership.
- Expert-level Linux administration, scripting, and troubleshooting.
- Strong hands-on experience with CI/CD and SDLC practices.
- Deep passion for automation, security, and self-service.
- Proficient in AWS, GCP, and/or Azure cloud platforms.
- Skilled in infrastructure-as-code tools like Terraform, CloudFormation, Helm, and Ansible.
- Experienced with containers, Kubernetes, and microservice architectures.
- Excellent verbal and written communication skills.
Why Join Us
At Dropsuite, now proudly part of NinjaOne, we are on a mission to safeguard business information and help businesses stay in business. We are a global, fast-growing, partner-centric company building secure, scalable, and highly usable cloud backup technologies for businesses of all sizes. Today, we perform billions of backups daily for organizations across more than 100 countries.
As we enter an exciting new chapter with NinjaOne—a leader in endpoint management, security, and IT automation—our combined strengths enable us to drive even greater impact, innovation, and global scale. Together, we are building a world-class platform that empowers IT teams with simplicity, performance, and reliability.
At our core, we are a team of hungry owners: we are tenacious in our pursuit of excellence and take full ownership in everything we do. We are deeply customer-focused, collaborative, and solutions-driven. We play as a team—respecting, supporting, and elevating one another every step of the way.
Join us as we shape the future of IT and data protection—powered by passion, purpose, and the spirit of ownership.
Rewards That Go Beyond
- Competitive compensation
- 18 days of annual leave (with accrual up to 20 days)
- Entitled to Singapore Public Holidays
- Other leave benefits, such as Wedding leave
- Health Insurance for you and your dependents
- Growth opportunities
- Work in a global company with meaningful work, highly skilled colleagues, and an amazing culture
Diversity and Inclusion Statement
Dropsuite is an Equal Employment Opportunity and Affirmative Action Employer. Qualified applicants will receive consideration for employment without regard to race, colour, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status.
As part of our recruitment process, we may collect personal data to support hiring-related activities such as screening, assessment, and communication. This information is collected solely for recruitment purposes and handled in accordance with applicable data protection and privacy regulations. Your data will be treated with strict confidentiality and used only to facilitate your application with us.
Your Career Growth Starts Here. Apply Now!
Seniority level- Seniority level Mid-Senior level
- Employment type Full-time
- Industries Computer and Network Security
Referrals increase your chances of interviewing at Dropsuite by 2x
Get notified about new Reliability Engineering Manager jobs in Singapore, Singapore .
Engineering Manager, Managed Agencies (NEA) Engineering Manager, Managed Agencies (BCA) SVP/VP - SRE, Technology Risk Manager, Tech COO, Group Technology Customer Engineering Principal Engineer/Technical Manager ENGINEERING MANAGER (EXPLOSION PROOF PRODUCTS) Engineering Manager - Mid Platform (Boss) Engineering Manager, Managed Agencies (MCCY) Discipline Manager (Functional Engineering Manager)We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-LjbffrSenior Manager – Site Reliability Engineering [SRE]
Posted today
Job Viewed
Job Description
Nice to Meet You! We are Dropsuite, a NinjaOne Company!
Site Ops teams are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our operating environments.
We are seeking a seasoned Senior Manager – Site Reliability Engineering (SRE) to lead a high-impact team focused on building resilient, scalable infrastructure and ensuring platform reliability across our cloud environments. This role combines strategic leadership with deep technical expertise in automation, observability, and modern DevOps practices to drive operational excellence and service uptime.
Work Arrangement
- Full-time position
- Hybrid work model (2 days per week in the office)
- Monday to Friday, 5-day work week (flexible work schedule)
- Eligible to reside and work in Singapore (Singapore Citizens / PRs preferred)
This position is open exclusively to candidates who reside in and are authorised to work in Singapore. Only shortlisted candidates will be contacted.
Key Accountabilities
- Define and implement SRE roadmaps aligned with business objectives and SLAs.
- Collaborate with service owners to define SLOs supporting SLA commitments.
- Deliver platform SLI insights through reports and observability tools.
- Integrate reliability best practices into engineering and product workflows.
- Lead initiatives on uptime, monitoring, incident response, and optimization.
- Manage incident response processes, on-call rotations, and playbooks.
- Set infrastructure resiliency standards for cloud-native environments.
- Optimize architecture for scalability, fault tolerance, and cost efficiency.
- Ensure production systems meet security and compliance requirements.
- Provide strategic leadership and mentorship to drive team growth and performance.
- Design scalable and resilient systems architecture.
- Recruit, mentor, and retain high-performing SRE talent.
- Develop growth and training plans for SRE team members.
- Foster a reliability-focused, customer-centric team culture.
Qualifications and Competencies
- Bachelor's degree in Computer Science or a related field.
- Cloud certification in AWS, Azure, or GCP preferred.
- 8+ years in Software Engineering or Site Reliability Engineering.
- 3+ years in team management or technical leadership.
- Expert-level Linux administration, scripting, and troubleshooting.
- Strong hands-on experience with CI/CD and SDLC practices.
- Deep passion for automation, security, and self-service.
- Proficient in AWS, GCP, and/or Azure cloud platforms.
- Skilled in infrastructure-as-code tools like Terraform, CloudFormation, Helm, and Ansible.
- Experienced with containers, Kubernetes, and microservice architectures.
- Excellent verbal and written communication skills.
Why Join Us
At Dropsuite, now proudly part of NinjaOne, we are on a mission to safeguard business information and help businesses stay in business. We are a global, fast-growing, partner-centric company building secure, scalable, and highly usable cloud backup technologies for businesses of all sizes. Today, we perform billions of backups daily for organizations across more than 100 countries.
As we enter an exciting new chapter with NinjaOne—a leader in endpoint management, security, and IT automation—our combined strengths enable us to drive even greater impact, innovation, and global scale. Together, we are building a world-class platform that empowers IT teams with simplicity, performance, and reliability.
At our core, we are a team of hungry owners: we are tenacious in our pursuit of excellence and take full ownership in everything we do. We are deeply customer-focused, collaborative, and solutions-driven. We play as a team—respecting, supporting, and elevating one another every step of the way.
Join us as we shape the future of IT and data protection—powered by passion, purpose, and the spirit of ownership.
Rewards That Go Beyond
- Competitive compensation
- Hybrid work model
- 18 days of annual leave (with accrual up to 20 days)
- Entitled to Singapore Public Holidays
- Other leave benefits, such as Wedding leave
- Health Insurance for you and your dependents
- Growth opportunities
- Work in a global company with meaningful work, highly skilled colleagues, and an amazing culture
Diversity and Inclusion Statement
Dropsuite is an Equal Employment Opportunity and Affirmative Action Employer. Qualified applicants will receive consideration for employment without regard to race, colour, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status.
As part of our recruitment process, we may collect personal data to support hiring-related activities such as screening, assessment, and communication. This information is collected solely for recruitment purposes and handled in accordance with applicable data protection and privacy regulations. Your data will be treated with strict confidentiality and used only to facilitate your application with us.
Your Career Growth Starts Here. Apply Now!
#J-18808-LjbffrSenior Manager – Site Reliability Engineering SRE
Posted today
Job Viewed
Job Description
Nice to Meet You We are Dropsuite, a NinjaOne Company
Site Ops teams are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our operating environments.
We are seeking a seasoned Senior Manager – Site Reliability Engineering (SRE) to lead a high-impact team focused on building resilient, scalable infrastructure and ensuring platform reliability across our cloud environments. This role combines strategic leadership with deep technical expertise in automation, observability, and modern DevOps practices to drive operational excellence and service uptime.
Work Arrangement
- Full-time position
- Hybrid work model (2 days per week in the office)
- Monday to Friday, 5-day work week (flexible work schedule)
- Eligible to reside and work in Singapore (Singapore Citizens / PRs preferred)
This position is open exclusively to candidates who reside in and are authorised to work in Singapore. Only shortlisted candidates will be contacted.
Key Accountabilities
- Define and implement SRE roadmaps aligned with business objectives and SLAs.
- Collaborate with service owners to define SLOs supporting SLA commitments.
- Deliver platform SLI insights through reports and observability tools.
- Integrate reliability best practices into engineering and product workflows.
- Lead initiatives on uptime, monitoring, incident response, and optimization.
- Manage incident response processes, on-call rotations, and playbooks.
- Set infrastructure resiliency standards for cloud-native environments.
- Optimize architecture for scalability, fault tolerance, and cost efficiency.
- Ensure production systems meet security and compliance requirements.
- Provide strategic leadership and mentorship to drive team growth and performance.
- Design scalable and resilient systems architecture.
- Recruit, mentor, and retain high-performing SRE talent.
- Develop growth and training plans for SRE team members.
- Foster a reliability-focused, customer-centric team culture.
Qualifications and Competencies
- Bachelor's degree in Computer Science or a related field.
- Cloud certification in AWS, Azure, or GCP preferred.
- 8+ years in Software Engineering or Site Reliability Engineering.
- 3+ years in team management or technical leadership.
- Expert-level Linux administration, scripting, and troubleshooting.
- Strong hands-on experience with CI/CD and SDLC practices.
- Deep passion for automation, security, and self-service.
- Proficient in AWS, GCP, and/or Azure cloud platforms.
- Skilled in infrastructure-as-code tools like Terraform, CloudFormation, Helm, and Ansible.
- Experienced with containers, Kubernetes, and microservice architectures.
- Excellent verbal and written communication skills.
Why Join Us
At Dropsuite, now proudly part of NinjaOne, we are on a mission to safeguard business information and help businesses stay in business. We are a global, fast-growing, partner-centric company building secure, scalable, and highly usable cloud backup technologies for businesses of all sizes. Today, we perform billions of backups daily for organizations across more than 100 countries.
As we enter an exciting new chapter with NinjaOne—a leader in endpoint management, security, and IT automation—our combined strengths enable us to drive even greater impact, innovation, and global scale. Together, we are building a world-class platform that empowers IT teams with simplicity, performance, and reliability.
At our core, we are a team of hungry owners: we are tenacious in our pursuit of excellence and take full ownership in everything we do. We are deeply customer-focused, collaborative, and solutions-driven. We play as a team—respecting, supporting, and elevating one another every step of the way.
Join us as we shape the future of IT and data protection—powered by passion, purpose, and the spirit of ownership.
Rewards That Go Beyond
- Competitive compensation
- Hybrid work model
- 18 days of annual leave (with accrual up to 20 days)
- Entitled to Singapore Public Holidays
- Other leave benefits, such as Wedding leave
- Health Insurance for you and your dependents
- Growth opportunities
- Work in a global company with meaningful work, highly skilled colleagues, and an amazing culture
Diversity and Inclusion Statement
Dropsuite is an Equal Employment Opportunity and Affirmative Action Employer. Qualified applicants will receive consideration for employment without regard to race, colour, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status.
As part of our recruitment process, we may collect personal data to support hiring-related activities such as screening, assessment, and communication. This information is collected solely for recruitment purposes and handled in accordance with applicable data protection and privacy regulations. Your data will be treated with strict confidentiality and used only to facilitate your application with us.
Your Career Growth Starts Here. Apply Now
Tell employers what skills you haveTroubleshooting
Scalability
Operational Excellence
Kubernetes
Azure
Ubuntu
Software Engineering
Scripting
Reliability
Administration Management
Reliability Engineering
Technical Consultation
GCP
Ansible
Linux
Be The First To Know
About the latest Software reliability Jobs in Singapore !
Principal Network Development Engineer - Network Reliability Engineering
Posted today
Job Viewed
Job Description
About the Role:
As a Principal Engineer within NRE, you will be responsible for ensuring the reliability, scalability, and security of OCI's network infrastructure. You will apply engineering principles to measure and automate the network’s reliability, aligning it with Oracle’s service-level objectives. This role will involve resolving complex network issues, collaborating across teams, and driving automation efforts that enhance the overall operational efficiency of the OCI network. You'll work with a team dedicated to proactively preventing network disruptions, performing root-cause analysis, and delivering innovative solutions that ensure the smooth operation of a global network environment.
What You'll Do:
- Lead Network Reliability Efforts : Develop, automate, and optimize network services that ensure high availability and performance across OCI’s global infrastructure.
- Network Lifecycle Management : Drive key programs to manage and maintain the network lifecycle, defining objectives and coordinating delivery milestones to meet organizational goals.
- Troubleshoot and Resolve Complex Network Issues : Serve as the technical expert for network events, providing Tier 2 support and leading efforts to quickly restore services.
- Drive Automation : Develop scripts and automation tools to improve operational efficiency, reduce manual interventions, and support a rapidly evolving network environment.
- Collaborate Across Teams : Work closely with cross-functional teams—including engineering, product, and vendor partners—to design, implement, and optimize network solutions that meet the needs of both the business and end-users.
- Mentor and Lead : Provide technical leadership and mentorship to junior engineers, helping them develop their skills and grow within the organization.
- Innovate and Influence : Contribute to the roadmap for new network technologies, tools, and methodologies that enhance OCI’s network performance and reliability.
Career Level - IC4
#J-18808-LjbffrPrincipal Network Development Engineer - Network Reliability Engineering

Posted 26 days ago
Job Viewed
Job Description
**About the Role:**
As a Principal Engineer within NRE, you will be responsible for ensuring the reliability, scalability, and security of OCI's network infrastructure. You will apply engineering principles to measure and automate the network's reliability, aligning it with Oracle's service-level objectives. This role will involve resolving complex network issues, collaborating across teams, and driving automation efforts that enhance the overall operational efficiency of the OCI network. You'll work with a team dedicated to proactively preventing network disruptions, performing root-cause analysis, and delivering innovative solutions that ensure the smooth operation of a global network environment.
**What You'll Do:**
+ **Lead Network Reliability Efforts** : Develop, automate, and optimize network services that ensure high availability and performance across OCI's global infrastructure.
+ **Network Lifecycle Management** : Drive key programs to manage and maintain the network lifecycle, defining objectives and coordinating delivery milestones to meet organizational goals.
+ **Troubleshoot and Resolve Complex Network Issues** : Serve as the technical expert for network events, providing Tier 2 support and leading efforts to quickly restore services.
+ **Drive Automation** : Develop scripts and automation tools to improve operational efficiency, reduce manual interventions, and support a rapidly evolving network environment.
+ **Collaborate Across Teams** : Work closely with cross-functional teams-including engineering, product, and vendor partners-to design, implement, and optimize network solutions that meet the needs of both the business and end-users.
+ **Mentor and Lead** : Provide technical leadership and mentorship to junior engineers, helping them develop their skills and grow within the organization.
+ **Innovate and Influence** : Contribute to the roadmap for new network technologies, tools, and methodologies that enhance OCI's network performance and reliability.
Career Level - IC4
**Responsibilities**
**What You'll Need to Succeed:**
+ **Technical Expertise** : Extensive experience in network engineering, with a strong background in protocols like **MPLS, BGP, OSPF, IS-IS, TCP/IP, IPv4, IPv6, DNS** , and **DHCP** . Experience with **VxLAN** , **EVPN** , and **SDN technologies** is a plus.
+ **Automation Skills** : Proficiency in scripting or programming, ideally with **Python** , to develop solutions that automate network operations and troubleshooting.
+ **Deep Understanding of Networking** : Strong knowledge of networking protocols, monitoring tools, telemetry solutions, and network modeling techniques (e.g., **YANG, OpenConfig, NETCONF** ).
+ **Experience in Cloud or ISP Environments** : Proven track record in large-scale cloud or ISP network environments, ideally supporting complex, multi-cloud infrastructures.
+ **Problem-Solving Mindset** : Excellent analytical and troubleshooting skills, with a focus on proactive identification and resolution of network issues.
+ **Collaboration and Leadership** : Ability to work effectively in a fast-paced, cross-functional team environment. Experience leading technical teams or projects is highly desirable.
**Preferred Experience:**
+ Experience with **network modeling** and **automation frameworks** for large-scale networks.
+ Familiarity with **cloud-native network architectures** and modern network management tools.
+ Experience with **network monitoring** , **telemetry** systems, and **telemetry-based decision-making** .
**Additional Information:**
+ This role requires participation in an **on-call rotation** to provide 24/7 support for critical network events and incidents.
+ You will work in a **high-impact, high-visibility role** with opportunities for technical leadership and career advancement.
+ This role is open to Singaporeans and PRs only.
+ This role will involve the successful applicant working on government projects which may require security clearance being obtained and maintained as a condition of employment.
**What We Offer:**
+ **Impact at Scale** : Work on projects that support millions of users and some of the largest organizations in the world.
+ **Global Reach** : Collaborate with engineers, leaders, and vendors across the globe to build and operate Oracle Cloud's network.
+ **Innovation and Growth** : Opportunity to work with cutting-edge technologies and drive innovation in a fast-evolving field.
+ **Supportive Culture** : A culture of collaboration, continuous learning, and growth, where your contributions matter.
**About Us**
As a world leader in cloud solutions, Oracle uses tomorrow's technology to tackle today's challenges. We've partnered with industry-leaders in almost every sector-and continue to thrive after 40+ years of change by operating with integrity.
We know that true innovation starts when everyone is empowered to contribute. That's why we're committed to growing an inclusive workforce that promotes opportunities for all.
Oracle careers open the door to global opportunities where work-life balance flourishes. We offer competitive benefits based on parity and consistency and support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.
We're committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by emailing or by calling +1 in the United States.
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
Senior/Expert Engineer, Site Reliability Engineering (Garena)
Posted 4 days ago
Job Viewed
Job Description
Job Description
- Deep dive into development lines, learning and understanding the mechanism of every application component, and promoting product scalability, stability and performance.
- Setup, manage and maintain product/middleware/big-data applications and services.
- Perform regular and ad-hoc server-side deployments, performance fine-tuning and troubleshooting.
- Design and develop automations for our workflow.
- Capacity and Resource management.
- Responsible for the full-chain stress test to enhance the performance and remove redundancy of applications.
- Prepare routine operation documentation.
Job Requirements
- Bachelor’s or higher degree in Computer Science, Engineering, Information Systems or related fields.
- Minimum 3 years of relevant full-time working experience in Site Reliability Engineer roles
- Extensive and hands-on knowledge with Linux operating systems (Ubuntu, CentOS, etc.).
- Extensive and hands-on knowledge with Kubernetes and the eco-system.
- Knowledge of Computer Network(TCP/IP, DNS, etc.) and OS.
- Hands-on experience with at least one of the programming languages: Bash, Python, Go.
- Strong analytical and problem-solving skills with the ability to thrive under high-pressure situations.
- Fast learning ability and a good team player.
- Detailed-oriented, cautious and prudent.