636 Software Reliability jobs in Singapore

Senior Software Engineer, Site Reliability Engineering

Singapore, Singapore Crypto.com

Posted today

Job Viewed

Tap Again To Close

Job Description

full-time

We are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend infrastructure.
What you’ll be doing
Ensure entire stack is healthy: hardware, software, application and network are operating at optimal performance
Perform deep dives into both systemic and latent reliability issues; partnering with other software and DevOps engineers across the organization to design, implement and roll out fixes
Continuously improve availability, reliability, and observability and reduce the burden of human toil with tooling and automation
Lead and drive SRE initiatives to improve operation efficiencies
Represent the SRE team in system design reviews and operational readiness exercises for new and existing services
What you need
Experience coding in Ruby and/or Go
Familiar with GitOps principles and tools (Github Actions, Docker, Kubernetes)
Experience in designing, analyzing, and troubleshooting large-scale distributed systems
Curiosity about finding root causes in incidents and outages
Ability to develop alignment to cultivate relationships and driving impact
Mindset in designing fault tolerance system architecture
Comfort with being uncomfortable in ambiguous situations
Involvement with incident management and response
Desire to grow expertise, inform, and educate others
Capable to pick up various technologies, a fast learner and have a “get things done” mentality
Humble to embrace better ideas from others, eager to make things better, open to challenges and possibilities
Desirable
Familiar with cloud platforms and micro-service based architecture (AWS is big plus)
Familiar with monitoring tools (e.g. Datadog, OpenTelemetry)
Familiar with CICD tools (e.g. Github Actions)
Familiar with IaC tools (e.g. Terraform, Spacelift)
Experience in designing resilient system architecture
Experience in optimizing performance of large-scale production system
Life @ Crypto.com
Empowered to think big. Try new opportunities while working with a talented, ambitious and supportive team.
Transformational and proactive working environment. Empower employees to find thoughtful and innovative solutions.
Growth from within. We help to develop new skill-sets that would impact the shaping of your personal and professional growth.
Work Culture. Our colleagues are some of the best in the industry; we are all here to help and support one another.
One cohesive team. Engage stakeholders to achieve our ultimate goal - Cryptocurrency in every wallet.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us - our internal mobility program offers employees a new scope.
Work Perks: crypto.com visa card provided upon joining.
Benefits
Competitive salary.
Attractive annual leave entitlement including: birthday, work anniversary.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us. Our internal mobility program can offer employees a diverse scope.
Work Perks: crypto.com visa card provided upon joining.
Our Crypto.com benefits packages vary depending on region requirements, you can learn more from our talent acquisition team.
About Crypto.com:
Founded in 2016, Crypto.com serves more than 80 million customers and is the world's fastest growing global cryptocurrency platform. Our vision is simple: Cryptocurrency in Every Wallet. Built on a foundation of security, privacy, and compliance, Crypto.com is committed to accelerating the adoption of cryptocurrency through innovation and empowering the next generation of builders, creators, and entrepreneurs to develop a fairer and more equitable digital ecosystem.
Learn more at
Crypto.com is an equal opportunities employer and we are committed to creating an environment where opportunities are presented to everyone in a fair and transparent way. Crypto.com values diversity and inclusion, seeking candidates with a variety of backgrounds, perspectives, and skills that complement and strengthen our team.
Personal data provided by applicants will be used for recruitment purposes only.
Please note that only shortlisted candidates will be contacted.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Reliability Engineering Manager- Information Security

Singapore, Singapore $180000 - $250000 Y Apple

Posted today

Job Viewed

Tap Again To Close

Job Description

Imagine what you could accomplish here. Bring your passion, creativity, and dedication, and there will be no limit to what you can achieve. This is not just another SRE role-it's a chance to help redefine how reliability engineering is practiced at hyper-scale. Our team is building the platforms that will autonomously operate Apple's core information security systems, setting a new bar for how critical services are managed.

Description

We are seeking exceptional engineers who thrive at the intersection of reliability, software development and automation - individuals driven to push the boundaries of what's possible. The ideal candidate has a strong foundation in modern SRE practices and a proven ability to design and implement software that solves operational challenges. You'll break new ground using the most advanced tools and approaches available, developing automation that doesn't just keep pace with scale but anticipates, reacts and stays ahead of it. You will work closely with Security Engineering, Threat Detection, Incident Response and other internal functions to ensure the scalability, availability and security of the tools and infrastructure that support our cybersecurity mission. Join us, and help build the future of self-managing systems at one of the most innovative companies in the world.

Responsibilities

  • Inspire, mentor, and grow a high-performing team of SREs dedicated to automating and scaling Apple's core security platforms.
  • Champion operational excellence by building resilient monitoring, alerting, and automated remediation practices that minimize downtime and manual effort.
  • Advance infrastructure-as-code and automation to eliminate toil, improve consistency, and accelerate delivery of secure, reliable services.
  • Partner closely with InfoSec stakeholders to translate security requirements into scalable, supportable, and performant solutions.
  • Own the reliability of critical security systems-including SIEM, SOAR, telemetry, and vulnerability management-ensuring availability, performance, and capacity keep pace with business demand.
  • Lead incident response with confidence, driving resolution of outages and infrastructure issues while fostering a blameless, learning-oriented culture.
  • Define and enforce SLOs/SLIs for InfoSec services, using data to measure success and continuously improve.
  • Collaborate across engineering and IT to embed best practices in CI/CD, containerization, and service orchestration.
  • Uphold strong security hygiene and compliance, aligning with both internal standards and external regulatory requirements.
    Set direction and priorities for the team, managing resources, timelines, and initiatives to maximize impact.

Minimum Qualifications

  • 5+ years of experience in SRE or Service Infrastructure roles, including 2+ years in a leadership or managerial role
  • Strong understanding of modern SRE practices, including observability, automation, and reliability engineering
  • Experience with cloud platforms (AWS, GCP) and infrastructure-as-code tools (Pulumi, Terraform, Ansible, etc.)
  • Familiarity with container technologies (Docker, Kubernetes) and CI/CD pipelines
  • Excellent communication skills with an ability to collaborate across technical and non-technical teams

Preferred Qualifications

  • Bachelor's degree in Computer Science, or a related field, or equivalent practical experience
  • Prior experience working in or closely with Information Security teams
  • The ability to contribute and review code in Python, Go, Swift or other scripting languages
  • Experience operating with Scrum/Agile development methodologies
  • Ability to cultivate an environment that emphasizes collaboration, accountability, and excellence
  • Experience managing systems that support InfoSec functions (e.g., security monitoring, log aggregation, scanning tools)
  • Ability to work under pressure and manage difficult situations in a dynamic work environment
  • Passion for high-quality code, unit-tests, documentation, and production services
    Previous experience working on a global team with 24/7 support model

Submit CV

This advertiser has chosen not to accept applicants from your region.

Reliability Engineering Senior Manager /MTS

Singapore, Singapore Systems on Silicon Manufacturing

Posted today

Job Viewed

Tap Again To Close

Job Description

SSMC (Systems on Silicon Manufacturing Company Pte. Ltd.), is a Joint Venture between NXP and TSMC. We offer flexible and cost-effective semiconductor fabrication solutions by maintaining fully equipped SMIF cleanroom environment, 100% equipment automation and proven wafer-manufacturing processes.
We're looking for innovative, passionate, and talented people like you to join our team.
We’re searching for a
Manager /Senior MTS
to be part of our
QRE Department
diverse team of talent, to support Reliability Laboratory Operations and Manage PLR and WLR Reliability Test Equipment (Preventive Maintenance, Calibration). Lead High Voltage (HV) Process Technologies Reliability Tests & Support for Fab Monitoring / Qualification / Customer Issues / Engineering Change Evaluations.
What you will be working on:
Lead and Setup New Process Technology Reliability Qualification
Define and Execute New Process Technology Reliability Qualification Plan Requirements to meet Technology Milestones requirements
Lead and Setup New Process Technology Reliability Monitoring
Conduct Process/Wafer Level Reliability (WLR) Tests and Analysis
Conduct Product Level Reliability (PLR) Tests and Analysis
Support Fab Monitoring / Qualification / Customer Issues / Engineering Change Evaluations and Perform Reliability Risk Assessments
Develop and Setup New or Enhanced Process and Product Reliability Tests / Analysis / Methodologies / Capabilities / Techniques
Schedule & Prioritize Reliability Tests Requests (Manpower, Skills, Tool resources)
Keep in-line with Industry and Mother-fabs’ Reliability Tests & Requirement Trends / Development
Support Reliability Laboratory Operations and Manage PLR and WLR Reliability Test Equipment (Preventive Maintenance, Calibration). Maintain Day-to-Day Reliability Laboratory Operations, Equipment Uptime
Drive Continuous Improvement in Safety, Quality, Productivity of work processes and environment to achieve assigned department targets
Training, Coaching and Development of Reliability Engineers
More about you:
Master / Degree in Science or Engineering in Mechanical, Chemical Engineering or equivalent
Extensive Experience: >10 years in Wafer Fab / Semiconductor Environment and Leading Role in WLR / PLR Reliability.
In-depth understanding of Technologies, Trends and Needs
Experience with major Process Technologies like Automotive, Logic, High Voltage, FLASH / EE / Non-Volatile-Memory (NVM), General Purpose Processes。
In-depth Knowledge Front-End / Back-End Reliability Mechanisms, Test Methodology (GOI, TDDB, HCI, NBTI, BTS, JS, PID, ESD, LU, EM, SV, Low-K IMD) (HTOL, EFR, IFR, THB, HAST, TMCL, TH, HTS, Pre-Con, Reflow)
Good knowledge of International Standards & Requirements on Process & Product Reliability (AEC-Q100, JEDEC, JEP001)
SSMC is firmly committed to upholding equal employment opportunities for all individuals. We strictly adhere to the Tripartite Guidelines on Fair Employment Practices (TGFEP), the Singapore Food Safety and Security Act 2025 (FSSA 2025), and the Singapore Code of Advertising Practice. All qualified applicants will receive non-discriminatory consideration for employment on the basis of merit and regardless of age, race, gender, religion, marital status and family responsibilities, or disability, or any other attributes as protected by the relevant laws.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Reliability Engineering Senior Manager /MTS

519527 $11000 Monthly SYSTEMS ON SILICON MANUFACTURING COMPANY PTE LTD

Posted 10 days ago

Job Viewed

Tap Again To Close

Job Description

SSMC (Systems on Silicon Manufacturing Company Pte. Ltd.), is a Joint Venture between NXP and TSMC. We offer flexible and cost-effective semiconductor fabrication solutions by maintaining fully equipped SMIF cleanroom environment, 100% equipment automation and proven wafer-manufacturing processes.


We're looking for innovative, passionate, and talented people like you to join our team.


We’re searching for a Manager /Senior MTS to be part of our QRE Department diverse team of talent, to support Reliability Laboratory Operations and Manage PLR and WLR Reliability Test Equipment (Preventive Maintenance, Calibration). Lead High Voltage (HV) Process Technologies Reliability Tests & Support for Fab Monitoring / Qualification / Customer Issues / Engineering Change Evaluations.


What you will be working on:

  • Lead and Setup New Process Technology Reliability Qualification
  • Define and Execute New Process Technology Reliability Qualification Plan Requirements to meet Technology Milestones requirements
  • Lead and Setup New Process Technology Reliability Monitoring
  • Conduct Process/Wafer Level Reliability (WLR) Tests and Analysis
  • Conduct Product Level Reliability (PLR) Tests and Analysis
  • Support Fab Monitoring / Qualification / Customer Issues / Engineering Change Evaluations and Perform Reliability Risk Assessments
  • Develop and Setup New or Enhanced Process and Product Reliability Tests / Analysis / Methodologies / Capabilities / Techniques
  • Schedule & Prioritize Reliability Tests Requests (Manpower, Skills, Tool resources)
  • Keep in-line with Industry and Mother-fabs’ Reliability Tests & Requirement Trends / Development
  • Support Reliability Laboratory Operations and Manage PLR and WLR Reliability Test Equipment (Preventive Maintenance, Calibration). Maintain Day-to-Day Reliability Laboratory Operations, Equipment Uptime
  • Drive Continuous Improvement in Safety, Quality, Productivity of work processes and environment to achieve assigned department targets
  • Training, Coaching and Development of Reliability Engineers

More about you:

  • Master / Degree in Science or Engineering in Mechanical, Chemical Engineering or equivalent
  • Extensive Experience: >10 years in Wafer Fab / Semiconductor Environment and Leading Role in WLR / PLR Reliability.
  • In-depth understanding of Technologies, Trends and Needs
  • Experience with major Process Technologies like Automotive, Logic, High Voltage, FLASH / EE / Non-Volatile-Memory (NVM), General Purpose Processes。
  • In-depth Knowledge Front-End / Back-End Reliability Mechanisms, Test Methodology (GOI, TDDB, HCI, NBTI, BTS, JS, PID, ESD, LU, EM, SV, Low-K IMD) (HTOL, EFR, IFR, THB, HAST, TMCL, TH, HTS, Pre-Con, Reflow)
  • Good knowledge of International Standards & Requirements on Process & Product Reliability (AEC-Q100, JEDEC, JEP001)


SSMC is firmly committed to upholding equal employment opportunities for all individuals. We strictly adhere to the Tripartite Guidelines on Fair Employment Practices (TGFEP), the Singapore Food Safety and Security Act 2025 (FSSA 2025), and the Singapore Code of Advertising Practice. All qualified applicants will receive non-discriminatory consideration for employment on the basis of merit and regardless of age, race, gender, religion, marital status and family responsibilities, or disability, or any other attributes as protected by the relevant laws.

This advertiser has chosen not to accept applicants from your region.

Principal Specialist, Platforms Reliability Engineering (Networks)

Singapore, Singapore Singtel

Posted today

Job Viewed

Tap Again To Close

Job Description

Overview
Principal Specialist, Platforms Reliability Engineering (Networks)
Singtel Networks is transforming to enable the digital generation of tomorrow. We are introducing new capabilities in 5G, Cloud, Analytics, Digital Commerce, Software Engineering, and Cyber Security to enhance our core competencies and deliver innovative and differentiated services for our customers. We are committed to inclusion and diversity and upskilling all individuals. We build Singtel’s Networks of tomorrow and empower every generation to live, work and play in new ways.
We are an Employer of Choice and strive for a vibrant, diverse and inclusive workforce with a fair, performance-based culture that is collaborative.
Vaccination policy:
We are committed to a safe and healthy environment for our employees and customers and will require all prospective employees to be fully vaccinated.
Responsibilities
Lead the design, development, and operation of scalable platforms including NSB, SO, NDB and NDC to enable Telco APIs and digital services across Consumer and Enterprise domains.
Institutionalize DevOps methodologies infused with AI to optimize planning, coding, testing, and deployment cycles; deliver secure, scalable microservices infrastructure through AI-enhanced CI/CD pipelines and cloud-native technologies, enabling self-optimizing, resilient, and adaptive systems.
Research and explore new technologies in platform engineering, automation, cybersecurity, and cloud computing (hybrid/multi/edge) for incorporation into platform architecture and solutions.
Manage and align various teams and stakeholders, including top management, to ensure timely and secure delivery of key building blocks for Autonomous Networks.
Collaborate with business units to gather and prioritize requirements for platform delivery.
Oversee service management for production platforms, ensuring reliable operations in accordance with network SLAs and IMDA regulations through proactive monitoring and reporting.
Manage change processes to ensure software releases align with internal change management and deployment protocols.
Lead incident management efforts, including troubleshooting and root cause analysis of platform issues.
Qualifications
Bachelor's Degree in IT/Computer Science/Computer Engineering or relevant discipline.
Minimum 12 years of working experience in DevOps automation, containerization, platform engineering and site reliability engineering.
Experience in platforms engineering with strong understanding of containerization, API gateway and enterprise integration.
Strong knowledge of software development automation tools (e.g. Ansible, Terraform, Nexus, Jenkins, SoapUI, SonarQube).
Strong scripting skills (e.g. Python, Bash, JavaScript, Ruby).
Strong understanding and experience in virtualization and networking in a container environment, such as OpenShift/Kubernetes.
Strong understanding of cloud computing/container deployment and management (AWS/Azure/OpenStack, etc.).
Breadth of knowledge – OS, system administration, networking, infrastructure, storage, distributed computing, cloud computing.
Strong understanding of Agile projects (SCRUM/KANBAN) and tools (e.g., JIRA).
Experience in project planning and management activities including financial and procurement, and translating business requirements into actionable deliverables.
Rewards and Benefits
Full suite of health and wellness benefits
Ongoing training and development programs
Internal mobility opportunities
We are committed to a safe and healthy environment for our employees and customers and will require all prospective employees to be fully vaccinated.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Senior Manager – Site Reliability Engineering [SRE]

079903 Anson Road, Singapore $15000 Monthly DROPMYSITE PTE. LTD.

Posted 10 days ago

Job Viewed

Tap Again To Close

Job Description

Nice to Meet You! We are Dropsuite, a NinjaOne Company!


Site Ops teams are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our operating environments.


We are seeking a seasoned Senior Manager – Site Reliability Engineering (SRE) to lead a high-impact team focused on building resilient, scalable infrastructure and ensuring platform reliability across our cloud environments. This role combines strategic leadership with deep technical expertise in automation, observability, and modern DevOps practices to drive operational excellence and service uptime.


Work Arrangement

  • Full-time position
  • Hybrid work model (2 days per week in the office)
  • Monday to Friday, 5-day work week (flexible work schedule)
  • Eligible to reside and work in Singapore (Singapore Citizens / PRs preferred)


This position is open exclusively to candidates who reside in and are authorised to work in Singapore. Only shortlisted candidates will be contacted.


Key Accountabilities

  • Define and implement SRE roadmaps aligned with business objectives and SLAs.
  • Collaborate with service owners to define SLOs supporting SLA commitments.
  • Deliver platform SLI insights through reports and observability tools.
  • Integrate reliability best practices into engineering and product workflows.
  • Lead initiatives on uptime, monitoring, incident response, and optimization.
  • Manage incident response processes, on-call rotations, and playbooks.
  • Set infrastructure resiliency standards for cloud-native environments.
  • Optimize architecture for scalability, fault tolerance, and cost efficiency.
  • Ensure production systems meet security and compliance requirements.
  • Provide strategic leadership and mentorship to drive team growth and performance.
  • Design scalable and resilient systems architecture.
  • Recruit, mentor, and retain high-performing SRE talent.
  • Develop growth and training plans for SRE team members.
  • Foster a reliability-focused, customer-centric team culture.

Qualifications and Competencies

  • Bachelor's degree in Computer Science or a related field.
  • Cloud certification in AWS, Azure, or GCP preferred.
  • 8+ years in Software Engineering or Site Reliability Engineering.
  • 3+ years in team management or technical leadership.
  • Expert-level Linux administration, scripting, and troubleshooting.
  • Strong hands-on experience with CI/CD and SDLC practices.
  • Deep passion for automation, security, and self-service.
  • Proficient in AWS, GCP, and/or Azure cloud platforms.
  • Skilled in infrastructure-as-code tools like Terraform, CloudFormation, Helm, and Ansible.
  • Experienced with containers, Kubernetes, and microservice architectures.
  • Excellent verbal and written communication skills.

Why Join Us
At Dropsuite, now proudly part of NinjaOne, we are on a mission to safeguard business information and help businesses stay in business. We are a global, fast-growing, partner-centric company building secure, scalable, and highly usable cloud backup technologies for businesses of all sizes. Today, we perform billions of backups daily for organizations across more than 100 countries.


As we enter an exciting new chapter with NinjaOne—a leader in endpoint management, security, and IT automation—our combined strengths enable us to drive even greater impact, innovation, and global scale. Together, we are building a world-class platform that empowers IT teams with simplicity, performance, and reliability.


At our core, we are a team of hungry owners: we are tenacious in our pursuit of excellence and take full ownership in everything we do. We are deeply customer-focused, collaborative, and solutions-driven. We play as a team—respecting, supporting, and elevating one another every step of the way.


Join us as we shape the future of IT and data protection—powered by passion, purpose, and the spirit of ownership.


Rewards That Go Beyond

  • Competitive compensation
  • Hybrid work model
  • 18 days of annual leave (with accrual up to 20 days)
  • Entitled to Singapore Public Holidays
  • Other leave benefits, such as Wedding leave
  • Health Insurance for you and your dependents
  • Growth opportunities
  • Work in a global company with meaningful work, highly skilled colleagues, and an amazing culture

Diversity and Inclusion Statement


Dropsuite is an Equal Employment Opportunity and Affirmative Action Employer. Qualified applicants will receive consideration for employment without regard to race, colour, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status.


As part of our recruitment process, we may collect personal data to support hiring-related activities such as screening, assessment, and communication. This information is collected solely for recruitment purposes and handled in accordance with applicable data protection and privacy regulations. Your data will be treated with strict confidentiality and used only to facilitate your application with us.


Your Career Growth Starts Here. Apply Now!

This advertiser has chosen not to accept applicants from your region.

Backend Software Engineer (Architect) -Reliability -Singapore

048583 Raffles Quay, Singapore $20 Monthly TIKTOK PTE. LTD.

Posted 7 days ago

Job Viewed

Tap Again To Close

Job Description

About TikTok

TikTok is the leading destination for short-form mobile video. At TikTok, our mission is to inspire creativity and bring joy. TikTok's global headquarters are in Los Angeles and Singapore, and we also have offices in New York City, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo.


Why Join Us

Inspiring creativity is at the core of TikTok's mission. Our innovative product is built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and bring joy - a mission we work towards every day.

We strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. Every challenge is an opportunity to learn and innovate as one team. We're resilient and embrace challenges as they come. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our company, and our users. When we create and grow together, the possibilities are limitless. Join us.

Diversity & Inclusion

TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.


Job highlights

Positive team atmosphere, Career growth opportunity, Meals provided


Responsibilities

Team introduction:

Build Reliability at Global Scale

Every time a short video is posted or viewed on TikTok, our team is working behind the scenes to make sure it happens instantly and reliably. The Short Video Reliability team blends deep systems expertise with large-scale architecture design to keep TikTok running smoothly for billions of users. We don’t just keep things stable — we design for the unexpected. Whether it’s a viral trend flooding the platform, a major global event, a cross-region migration, or disaster recovery, our systems are built to adapt and thrive.

We’re now looking for experienced engineers and architects to join our Singapore team. In this role, you’ll design, build, and scale the core reliability infrastructure that underpins TikTok’s short video ecosystem. Your work will directly shape the performance, resilience, and evolution of one of the most-used platforms in the world.


Responsibilities:

- Architect and build self-healing systems that adapt to infrastructure changes, migrations, and global-scale challenges

- Design smart traffic and load management to keep performance steady during viral spikes, large events, and global campaigns

- Develop monitoring, alerting, and automation that spots and fixes issues before they affect users

- Lead the creation of reliability frameworks for topology mapping, capacity planning, automated recovery, and disaster readiness

- Continuously refine system architecture for better performance, fault tolerance, and maintainability

- Apply chaos engineering, fault injection, and failure simulations to stress-test our systems

- Use A/B testing to measure the real-world impact of your improvements

- Mentor junior engineers and help set the team’s technical direction


Qualifications

Minimum Qualifications:

- 5+ years in backend, infrastructure, or reliability engineering

- Strong coding skills in Python, Go, Java, C++, or similar

- Solid grasp of distributed systems, networking, and fault-tolerant design

- Experience with Linux/Unix and large-scale infrastructure (cloud or on-prem)

- Proven track record delivering high-availability systems in production

- Strong debugging, analysis, and problem-solving skills

- Strong communication and writing skills.


Preferred Qualifications:

- Experience with video platforms, streaming, or CDN optimization

- Background in highly reliable production systems

- Knowledge of service mesh, edge routing, or traffic shaping at scale

- Hands-on experience with chaos engineering and incident response

- Strong system design and technical leadership skills

- Excellent communication and ability to work across global teams

This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Software reliability Jobs in Singapore !

Principal Network Development Engineer - Network Reliability Engineering

Oracle

Posted 4 days ago

Job Viewed

Tap Again To Close

Job Description

**Job Description**
**About the Role:**
As a Principal Engineer within NRE, you will be responsible for ensuring the reliability, scalability, and security of OCI's network infrastructure. You will apply engineering principles to measure and automate the network's reliability, aligning it with Oracle's service-level objectives. This role will involve resolving complex network issues, collaborating across teams, and driving automation efforts that enhance the overall operational efficiency of the OCI network. You'll work with a team dedicated to proactively preventing network disruptions, performing root-cause analysis, and delivering innovative solutions that ensure the smooth operation of a global network environment.
**What You'll Do:**
+ **Lead Network Reliability Efforts** : Develop, automate, and optimize network services that ensure high availability and performance across OCI's global infrastructure.
+ **Network Lifecycle Management** : Drive key programs to manage and maintain the network lifecycle, defining objectives and coordinating delivery milestones to meet organizational goals.
+ **Troubleshoot and Resolve Complex Network Issues** : Serve as the technical expert for network events, providing Tier 2 support and leading efforts to quickly restore services.
+ **Drive Automation** : Develop scripts and automation tools to improve operational efficiency, reduce manual interventions, and support a rapidly evolving network environment.
+ **Collaborate Across Teams** : Work closely with cross-functional teams-including engineering, product, and vendor partners-to design, implement, and optimize network solutions that meet the needs of both the business and end-users.
+ **Mentor and Lead** : Provide technical leadership and mentorship to junior engineers, helping them develop their skills and grow within the organization.
+ **Innovate and Influence** : Contribute to the roadmap for new network technologies, tools, and methodologies that enhance OCI's network performance and reliability.
Career Level - IC4
**Responsibilities**
**What You'll Need to Succeed:**
+ **Technical Expertise** : Extensive experience in network engineering, with a strong background in protocols like **MPLS, BGP, OSPF, IS-IS, TCP/IP, IPv4, IPv6, DNS** , and **DHCP** . Experience with **VxLAN** , **EVPN** , and **SDN technologies** is a plus.
+ **Automation Skills** : Proficiency in scripting or programming, ideally with **Python** , to develop solutions that automate network operations and troubleshooting.
+ **Deep Understanding of Networking** : Strong knowledge of networking protocols, monitoring tools, telemetry solutions, and network modeling techniques (e.g., **YANG, OpenConfig, NETCONF** ).
+ **Experience in Cloud or ISP Environments** : Proven track record in large-scale cloud or ISP network environments, ideally supporting complex, multi-cloud infrastructures.
+ **Problem-Solving Mindset** : Excellent analytical and troubleshooting skills, with a focus on proactive identification and resolution of network issues.
+ **Collaboration and Leadership** : Ability to work effectively in a fast-paced, cross-functional team environment. Experience leading technical teams or projects is highly desirable.
**Preferred Experience:**
+ Experience with **network modeling** and **automation frameworks** for large-scale networks.
+ Familiarity with **cloud-native network architectures** and modern network management tools.
+ Experience with **network monitoring** , **telemetry** systems, and **telemetry-based decision-making** .
**Additional Information:**
+ This role requires participation in an **on-call rotation** to provide 24/7 support for critical network events and incidents.
+ You will work in a **high-impact, high-visibility role** with opportunities for technical leadership and career advancement.
+ This role is open to Singaporeans and PRs only.
+ This role will involve the successful applicant working on government projects which may require security clearance being obtained and maintained as a condition of employment.
**What We Offer:**
+ **Impact at Scale** : Work on projects that support millions of users and some of the largest organizations in the world.
+ **Global Reach** : Collaborate with engineers, leaders, and vendors across the globe to build and operate Oracle Cloud's network.
+ **Innovation and Growth** : Opportunity to work with cutting-edge technologies and drive innovation in a fast-evolving field.
+ **Supportive Culture** : A culture of collaboration, continuous learning, and growth, where your contributions matter.
**About Us**
As a world leader in cloud solutions, Oracle uses tomorrow's technology to tackle today's challenges. We've partnered with industry-leaders in almost every sector-and continue to thrive after 40+ years of change by operating with integrity.
We know that true innovation starts when everyone is empowered to contribute. That's why we're committed to growing an inclusive workforce that promotes opportunities for all.
Oracle careers open the door to global opportunities where work-life balance flourishes. We offer competitive benefits based on parity and consistency and support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.
We're committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by emailing or by calling in the United States.
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
This advertiser has chosen not to accept applicants from your region.

Senior/Expert Engineer, Site Reliability Engineering (SRE)

Singapore, Singapore Garena

Posted today

Job Viewed

Tap Again To Close

Job Description

Senior/Expert Engineer, Site Reliability Engineering (SRE)
Singapore Engineering and Technology Experienced (Individual Contributor)
Job Description
Deep dive into development lines, learning and understanding the mechanism of every application component, and promoting product scalability, stability and performance.
Setup, manage and maintain product/middleware/big-data applications and services.
Perform regular and ad-hoc server-side deployments, performance fine-tuning and troubleshooting.
Design and develop automations for our workflow.
Capacity and Resource management.
Responsible for the full-chain stress test to enhance the performance and remove redundancy of applications.
Prepare routine operation documentation.
Job Requirements
Bachelor’s or higher degree in Computer Science, Engineering, Information Systems or related fields.
Minimum 3 years of relevant full-time working experience in Site Reliability Engineer roles
Extensive and hands-on knowledge with Linux operating systems (Ubuntu, CentOS, etc.).
Extensive and hands-on knowledge with Kubernetes and the eco-system.
Knowledge of Computer Network(TCP/IP, DNS, etc.) and OS.
Hands-on experience with at least one of the programming languages: Bash, Python, Go.
Strong analytical and problem-solving skills with the ability to thrive under high-pressure situations.
Fast learning ability and a good team player.
Detailed-oriented, cautious and prudent.
Kindly note that you can only be considered for one role at a time with any of the companies within our Group. If you have applied for other jobs with the Group, you will be considered for roles in the order of your application.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineering, Edge Services - Traffic Infrastructure Singapore Regular

Singapore, Singapore ByteDance

Posted today

Job Viewed

Tap Again To Close

Job Description

Overview
Site Reliability Engineering, Edge Services - Traffic Infrastructure
Location:
Team:
Technology
Employment Type:
Regular
Job Code:
A03452
Share this listing:
Responsibilities
Architect and implement solutions that enable both internal and external customers to harness the power of ByteDance’s globally scaled content delivery network.
Build metrics, tools, automations, visualizations and monitors to facilitate the operation and optimization of the edge services.
Develop procedures and workflows that improve efficiency, foster trust, and ensure compliance in operational processes.
Run vulnerability and capacity assessment and develop disaster recovery strategies to ensure high availability of our global CDN services.
Work in a fast-paced environment. Participate in technical operations and rotations in response to performance and reliability issues.
Qualifications
Minimum qualifications
Master’s degree (or Bachelor's degree with 2+ years of experience in Computer Engineering, Electrical Engineering, Computer Science or related major).
2+ years working experience in the field of CDN performance engineering, solution architecting or site reliability engineering roles.
2+ years experience in one or more programming languages such as Java, C++, Go, or scripting experience in Shell and Python.
Preferred qualifications
Self-driven and capable of coping with ambiguity and moving projects from concept to delivery.
Experience in operating in a multi-CDN environment.
Experience in networking technologies such TCP/IP, BGP, DNS, etc. in a carrier-grade environment. Past experience with CDN technologies is a plus.
Strong in analytical skills and the ability to solve real world problems in a fast moving environment.
Experience in designing, analyzing and building automation and tools for large scale systems.
Experience in developing and operating one or more of following systems: OpenStack, Kubernetes, Nginx, ipvs, ELK stack, Hadoop, etc.
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.
#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Software Reliability Jobs