1,217 Software Reliability jobs in Singapore
Senior Software Reliability Engineer
Posted today
Job Viewed
Job Description
Job Overview
We are seeking a seasoned professional to enhance our software development lifecycle through automation, security integration, and infrastructure reliability.
In this role, you will collaborate with cross-functional teams to streamline CICD workflows , embed security controls into every stage of the software delivery process, and ensure compliance—especially for solutions tailored to the public sector.
You will design and manage automated CICD pipelines that include security, quality assurance, and compliance gates throughout the delivery process.
Automate the provisioning and configuration of infrastructure and applications using modern tools and practices.
Integrate code analysis tools such as SAST, DAST, and SCA into the development pipeline to identify and remediate vulnerabilities early.
Monitor and fine-tune deployment processes, system performance, and security posture across environments.
Promote secure coding standards and support the adoption of policies to improve code integrity and system resilience.
Work closely with developers, QA, and system admins to ensure efficient and secure software deployment.
Operate and support container platforms and cloud-native technologies (e.g., Docker, Kubernetes, public cloud services).
Maintain technical documentation and continuously improve DevSecOps toolchains and practices.
- Design and manage automated CICD pipelines that include security, quality assurance, and compliance gates throughout the delivery process.
- Automate the provisioning and configuration of infrastructure and applications using modern tools and practices.
- Integrate code analysis tools such as SAST, DAST, and SCA into the development pipeline to identify and remediate vulnerabilities early.
- Monitor and fine-tune deployment processes, system performance, and security posture across environments.
- Promote secure coding standards and support the adoption of policies to improve code integrity and system resilience.
- Work closely with cross-functional teams to ensure efficient and secure software deployment.
- Operate and support container platforms and cloud-native technologies (e.g., Docker, Kubernetes, public cloud services).
- Maintain technical documentation and continuously improve DevSecOps toolchains and practices.
- Min. 2-3 years in DevOps or DevSecOps roles, with a strong focus on automation and security integration.
- Proficiency with CICD tools (e.g., GitLab CI/CD, Jenkins) and security tools such as Sonatype or AquaSec.
- Solid hands-on experience with container technologies and orchestration platforms (e.g., Docker, Kubernetes).
- Familiarity with cloud environments (AWS or Azure) and infrastructure automation using tools like Terraform or Ansible.
- Strong understanding of DevSecOps principles and best practices for secure software delivery.
- Experience working in Agile or Scrum-based development teams.
- Kubernetes
- Azure
- Pipelines
- IT Infrastructure Design
- Scripting
- Administration
- Cloud Infrastructure
- Networking
- Python
- Containerization
- Cloud Services
- Network Infrastructure
- Docker
- Infrastructure
- Ansible
- System Architecture
- Linux
Senior Software Engineer, Site Reliability Engineering
Posted 7 days ago
Job Viewed
Job Description
We are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend infrastructure.
What you’ll be doing- Ensure entire stack is healthy: hardware, software, application and network are operating at optimal performance
- Perform deep dives into both systemic and latent reliability issues; partnering with other software and DevOps engineers across the organization to design, implement and roll out fixes
- Continuously improve availability, reliability, and observability and reduce the burden of human toil with tooling and automation
- Lead and drive SRE initiatives to improve operation efficiencies
- Represent the SRE team in system design reviews and operational readiness exercises for new and existing services
- Experience coding in Ruby and/or Go
- Familiar with GitOps principles and tools (Github Actions, Docker, Kubernetes)
- Experience in designing, analyzing, and troubleshooting large-scale distributed systems
- Curiosity about finding root causes in incidents and outages
- Ability to develop alignment to cultivate relationships and driving impact
- Mindset in designing fault tolerance system architecture
- Comfort with being uncomfortable in ambiguous situations
- Involvement with incident management and response
- Desire to grow expertise, inform, and educate others
- Capable to pick up various technologies, a fast learner and have a “get things done” mentality
- Humble to embrace better ideas from others, eager to make things better, open to challenges and possibilities
- Familiar with cloud platforms and micro-service based architecture (AWS is big plus)
- Familiar with monitoring tools (e.g. Datadog, OpenTelemetry)
- Familiar with CICD tools (e.g. Github Actions)
- Familiar with IaC tools (e.g. Terraform, Spacelift)
- Experience in designing resilient system architecture
- Experience in optimizing performance of large-scale production system
Empowered to think big. Try new opportunities while working with a talented, ambitious and supportive team.
Transformational and proactive working environment. Empower employees to find thoughtful and innovative solutions.
Growth from within. We help to develop new skill-sets that would impact the shaping of your personal and professional growth.
Work Culture. Our colleagues are some of the best in the industry; we are all here to help and support one another.
One cohesive team. Engage stakeholders to achieve our ultimate goal - Cryptocurrency in every wallet.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us - our internal mobility program offers employees a new scope.
Work Perks: crypto.com visa card provided upon joining.
BenefitsCompetitive salary.
Attractive annual leave entitlement including: birthday, work anniversary.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us. Our internal mobility program can offer employees a diverse scope.
Work Perks: crypto.com visa card provided upon joining.
Our Crypto.com benefits packages vary depending on region requirements, you can learn more from our talent acquisition team.
About Crypto.com:Founded in 2016, Crypto.com serves more than 80 million customers and is the world's fastest growing global cryptocurrency platform. Our vision is simple: Cryptocurrency in Every Wallet. Built on a foundation of security, privacy, and compliance, Crypto.com is committed to accelerating the adoption of cryptocurrency through innovation and empowering the next generation of builders, creators, and entrepreneurs to develop a fairer and more equitable digital ecosystem.
Learn more at
Crypto.com is an equal opportunities employer and we are committed to creating an environment where opportunities are presented to everyone in a fair and transparent way. Crypto.com values diversity and inclusion, seeking candidates with a variety of backgrounds, perspectives, and skills that complement and strengthen our team.
Personal data provided by applicants will be used for recruitment purposes only.
Please note that only shortlisted candidates will be contacted.
#J-18808-LjbffrSenior Software Engineer, Site Reliability Engineering
Posted today
Job Viewed
Job Description
We are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend infrastructure.
What you’ll be doing
- Ensure entire stack is healthy: hardware, software, application and network are operating at optimal performance
- Perform deep dives into both systemic and latent reliability issues; partnering with other software and DevOps engineers across the organization to design, implement and roll out fixes
- Continuously improve availability, reliability, and observability and reduce the burden of human toil with tooling and automation
- Lead and drive SRE initiatives to improve operation efficiencies
- Represent the SRE team in system design reviews and operational readiness exercises for new and existing services
What you need
- Experience coding in Ruby and/or Go
- Familiar with GitOps principles and tools (Github Actions, Docker, Kubernetes)
- Experience in designing, analyzing, and troubleshooting large-scale distributed systems
- Curiosity about finding root causes in incidents and outages
- Ability to develop alignment to cultivate relationships and driving impact
- Mindset in designing fault tolerance system architecture
- Comfort with being uncomfortable in ambiguous situations
- Involvement with incident management and response
- Desire to grow expertise, inform, and educate others
- Capable to pick up various technologies, a fast learner and have a “get things done” mentality
- Humble to embrace better ideas from others, eager to make things better, open to challenges and possibilities
Desirable
- Familiar with cloud platforms and micro-service based architecture (AWS is big plus)
- Familiar with monitoring tools (e.g. Datadog, OpenTelemetry)
- Familiar with CICD tools (e.g. Github Actions)
- Familiar with IaC tools (e.g. Terraform, Spacelift)
- Experience in designing resilient system architecture
- Experience in optimizing performance of large-scale production system
Life @ Crypto.com
Empowered to think big. Try new opportunities while working with a talented, ambitious and supportive team.
Transformational and proactive working environment. Empower employees to find thoughtful and innovative solutions.
Growth from within. We help to develop new skill-sets that would impact the shaping of your personal and professional growth.
Work Culture. Our colleagues are some of the best in the industry; we are all here to help and support one another.
One cohesive team. Engage stakeholders to achieve our ultimate goal - Cryptocurrency in every wallet.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us - our internal mobility program offers employees a new scope.
Work Perks: crypto.com visa card provided upon joining.
Benefits
Competitive salary.
Attractive annual leave entitlement including: birthday, work anniversary.
Work Flexibility Adoption. Flexi-work hour and hybrid or remote set-up.
Aspire career alternatives through us. Our internal mobility program can offer employees a diverse scope.
Work Perks: crypto.com visa card provided upon joining.
Our Crypto.com benefits packages vary depending on region requirements, you can learn more from our talent acquisition team.
About Crypto.com:
Founded in 2016, Crypto.com serves more than 80 million customers and is the world's fastest growing global cryptocurrency platform. Our vision is simple: Cryptocurrency in Every Wallet. Built on a foundation of security, privacy, and compliance, Crypto.com is committed to accelerating the adoption of cryptocurrency through innovation and empowering the next generation of builders, creators, and entrepreneurs to develop a fairer and more equitable digital ecosystem.
Learn more at
Crypto.com is an equal opportunities employer and we are committed to creating an environment where opportunities are presented to everyone in a fair and transparent way. Crypto.com values diversity and inclusion, seeking candidates with a variety of backgrounds, perspectives, and skills that complement and strengthen our team.
Personal data provided by applicants will be used for recruitment purposes only.
Please note that only shortlisted candidates will be contacted.
#J-18808-LjbffrReliability Engineering Specialist
Posted today
Job Viewed
Job Description
Overview
Join to apply for the Reliability Engineering Specialist role at AMD .
RoleJoin a dynamic global team dedicated to advanced reliability testing of module and system boards of AMD's cutting-edge products. Collaborate closely with cross-functional teams across AMD Global Operations & Quality, and Data Center organizations on accelerator-product system setup and reliability testing.
Key Responsibilities- System-level setup and testing
- Plan, execute, and optimize system-level setups for accelerator products, including server rack and system configurations.
- Ensure seamless integration and functionality of server systems with advanced cooling solutions and environmental management systems.
- Validate and maintain reliability test scripts for automated and manual testing processes.
- Reliability assessment and testing
- Conduct comprehensive reliability assessments of accelerator systems, focusing on mechanical, thermal, and electrical stress factors.
- Design and implement environmental stress tests to simulate data center conditions, including operational stress, thermal cycling, signal, and power integrity.
- Evaluate material interactions and their impact on product reliability, ensuring robustness in diverse operating environments.
- Analyze results to identify potential reliability risks and areas for design improvement.
- Functional testing and fault isolation
- Perform detailed functional testing to evaluate system performance under various operational conditions.
- Identify, isolate, and troubleshoot faults using advanced diagnostic tools and methodologies.
- Failure analysis and reporting
- Perform root cause analysis for identified reliability failures and develop corrective actions for design and process enhancement.
- Collaborate with cross-functional teams to conduct root cause analysis of reliability testing failures.
- Collaboration and documentation
- Work closely with design, manufacturing, and quality teams to align reliability goals with overall product requirements.
- Generate comprehensive reports detailing reliability test results, analysis, and recommendations.
- Maintain meticulous records of testing methodologies and outcomes for future reference and continuous improvement initiatives.
- Mentorship
- Effectively mentor junior engineers, providing guidance in both technical domains and professional skill development to foster growth and team success.
- Knowledge of reliability engineering principles, product lifecycle, and standards in high-performance computing environments.
- Proven experience in system-level setup and testing for accelerator products or similar technologies.
- Proficiency in developing and executing reliability test scripts and protocols.
- Familiarity with reliability standards and best practices in high-performance computing environments.
- Familiarity with data center environmental management, server rack/system configurations, and integrated cooling solutions.
- Strong understanding of environmental stress factors, including thermal, mechanical, and electrical stresses, in server systems (L6-L10).
- Expertise in failure analysis techniques, including root cause analysis and fault isolation methodologies.
- Excellent written and verbal communication skills for clear reporting and collaboration.
- Strong analytical, problem-solving, and communication skills.
- Experience with reliability testing tools, simulation software and statistical tools is an added advantage.
- Knowledge in project and risk management is an added advantage.
- Self-starter and able to independently drive tasks to completion.
- Ability to structure and execute complex analysis, draw insights, and communicate summary conclusions/recommendations to senior management and AMD customers/partners.
- Ability to network, build relationships, and collaborate to drive effective decision-making across multiple functions and levels within AMD.
- Bachelor’s or Master’s degree in Electrical/Electronics Engineering (EE) or a related field.
Singapore
Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.
Seniority level- Not Applicable
- Full-time
- Industries
Referrals increase your chances of interviewing at AMD by 2x
Get notified about new Engineering Specialist jobs in Singapore, Singapore.
#J-18808-LjbffrReliability Engineering MTS
Posted 1 day ago
Job Viewed
Job Description
Overview
SSMC (Systems on Silicon Manufacturing Company Pte. Ltd.), a Joint Venture between NXP and TSMC, offers flexible and cost effective semiconductor fabrication solutions by maintaining fully equipped SMIF cleanroom environment, 100% equipment automation and proven wafer-manufacturing processes. We are looking for innovative, passionate, and talented people to join our team.
We’re searching for a Principal Engineer/ MTS to be part of our QRE Department diverse team of talent, to support Reliability Laboratory Operations and Manage PLR and WLR Reliability Test Equipment (Preventive Maintenance, Calibration). Lead High Voltage (HV) Process Technologies Reliability Tests & Support for Fab Monitoring / Qualification / Customer Issues / Engineering Change Evaluations.
What you will be working on- Lead and Setup New Process Technology Reliability Qualification
- Define and Execute New Process Technology Reliability Qualification Plan Requirements to meet Technology Milestones requirements
- Lead and Setup New Process Technology Reliability Monitoring
- Conduct Process/Wafer Level Reliability (WLR) Tests and Analysis
- Conduct Product Level Reliability (PLR) Tests and Analysis
- Support Fab Monitoring / Qualification / Customer Issues / Engineering Change Evaluations and Perform Reliability Risk Assessments
- Develop and Setup New or Enhanced Process and Product Reliability Tests / Analysis / Methodologies / Capabilities / Techniques
- Schedule & Prioritize Reliability Tests Requests (Manpower, Skills, Tool resources)
- Keep in-line with Industry and Mother-fabs’ Reliability Tests & Requirement Trends / Development
- Support Reliability Laboratory Operations and Manage PLR and WLR Reliability Test Equipment (Preventive Maintenance, Calibration). Maintain Day-to-Day Reliability Laboratory Operations, Equipment Uptime
- Drive Continuous Improvement in Safety, Quality, Productivity of work processes and environment to achieve assigned department targets
- Training, Coaching and Development of Reliability Engineers
- Master / Degree in Science or Engineering in Mechanical, Chemical Engineering or equivalent
- Extensive Experience: >10 years in Wafer Fab / Semiconductor Environment and Leading Role in WLR / PLR Reliability.
- In-depth understanding of Technologies, Trends and Needs
- Experience with major Process Technologies like Automotive, Logic, High Voltage, FLASH / EE / Non-Volatile-Memory (NVM), General Purpose Processes。
- In-depth Knowledge Front-End / Back-End Reliability Mechanisms, Test Methodology (GOI, TDDB, HCI, NBTI, BTS, JS, PID, ESD, LU, EM, SV, Low-K IMD) (HTOL, EFR, IFR, THB, HAST, TMCL, TH, HTS, Pre-Con, Reflow)
- Good knowledge of International Standards & Requirements on Process & Product Reliability (AEC-Q100, JEDEC, JEP001)
SSMC is firmly committed to upholding equal employment opportunities for all individuals. We strictly adhere to the Tripartite Guidelines on Fair Employment Practices (TGFEP), the Singapore Food Safety and Security Act 2025 (FSSA 2025), and the Singapore Code of Advertising Practice. All qualified applicants will receive non-discriminatory consideration for employment on the basis of merit and regardless of age, race, gender, religion, marital status and family responsibilities, or disability, or any other attributes as protected by the relevant laws.
Seniority level- Mid-Senior level
- Full-time
- Manufacturing, Project Management, and Engineering
- Semiconductor Manufacturing and Industrial Machinery Manufacturing
Reliability Engineering Lead
Posted today
Job Viewed
Job Description
Reliability Engineering Lead
We are seeking a Reliability Engineering Lead to drive initiatives within the Quality and Reliability Engineering team.
The successful candidate will oversee laboratory operations, guide reliability testing for advanced process technologies, and ensure equipment and methods meet the highest standards. This leadership role is at the intersection of technology, operations, and mentoring.
Key Responsibilities:
- Oversight of laboratory operations to ensure efficiency and effectiveness
- Guidance of reliability testing for advanced process technologies
- Ensuring equipment and methods meet the highest standards
Requirements:
- Demonstrated experience in reliability engineering and leadership
- Strong understanding of laboratory operations and testing protocols
- Ability to mentor and guide cross-functional teams
Benefits:
- Opportunity to work on cutting-edge technologies
- Chance to develop leadership skills and mentor others
- Collaborative work environment with experienced professionals
About Us:
This is an exciting opportunity to join a dynamic team and contribute to the development of innovative solutions. If you are a motivated individual with a passion for reliability engineering, we encourage you to apply.
Reliability Engineering Expert
Posted today
Job Viewed
Job Description
**Job Summary:**
We are seeking a highly skilled Reliability Engineer to join our team. The successful candidate will be responsible for leading the development and implementation of reliability qualification plans, conducting process and product reliability tests, and providing technical support to ensure the highest quality standards.
Main Responsibilities:
- Lead and develop new process technology reliability qualification plans to meet technology milestones.
- Define and execute new process technology reliability monitoring plans.
- Conduct process and wafer level reliability tests and analysis.
- Support fab monitoring, qualification, customer issues, and engineering change evaluations.
- Develop and implement new or enhanced process and product reliability tests, analysis, methodologies, capabilities, techniques.
- Schedule and prioritize reliability tests requests.
- Stay up-to-date with industry and mother-fab reliability trends and requirements.
Requirements:
- Masters/degree in science or engineering in mechanical, chemical engineering or equivalent.
- More than 10 years of experience in wafer fab/semiconductor environment and leading role in WLR/PLR reliability.
- In-depth understanding of technologies, trends, and needs.
- Experience with major process technologies like automotive, logic, high voltage, flash/ee/non-volatile-memory (nvm), general purpose processes.
- In-depth knowledge of front-end/back-end reliability mechanisms, test methodology (goi, tddb, hci, nbti, bts, js, pid, esd, lu, em, sv, low-k imd) (htol, efr, ifr, thb, hast, tmcl, th, hts, pre-con, reflow).
- Good knowledge of international standards & requirements on process & product reliability (aec-q100, jedec, jep001).
About Us:
We are an equal opportunities employer and welcome applications from all qualified candidates. We are committed to providing a diverse and inclusive work environment and strive to create a workplace where everyone feels valued and respected.
Be The First To Know
About the latest Software reliability Jobs in Singapore !
Reliability Engineering Specialist
Posted today
Job Viewed
Job Description
THE ROLE:
Join a dynamic global team dedicated to advanced reliability testing of module and system boards of AMD's cutting-edge products. Collaborate closely with cross-functional teams across AMD Global Operations & Quality, and Data Center organizations on accelerator-product system setup and reliability testing.
KEY RESPONSIBILITIES:
- System-level setup and testing:
- Plan, execute, and optimize system-level setups for accelerator products, including server rack and system configurations.
- Ensure seamless integration and functionality of server systems with advanced cooling solutions and environmental management systems.
- Validate and maintain reliability test scripts for automated and manual testing processes.
- Reliability assessment and testing:
- Conduct comprehensive reliability assessments of accelerator systems, focusing on mechanical, thermal, and electrical stress factors.
- Design and implement environmental stress tests to simulate data center conditions, including operational stress, thermal cycling, signal, and power integrity.
- Evaluate material interactions and their impact on product reliability, ensuring robustness in diverse operating environments.
- Analyze results to identify potential reliability risks and areas for design improvement.
- Functional testing and fault isolation:
- Perform detailed functional testing to evaluate system performance under various operational conditions.
- Identify, isolate, and troubleshoot faults using advanced diagnostic tools and methodologies.
- Failure analysis and reporting:
- Perform root cause analysis for identified reliability failures and develop corrective actions for design and process enhancement.
- Collaborate with cross-functional teams to conduct root cause analysis of reliability testing failures.
- Collaboration and documentation:
- Work closely with design, manufacturing, and quality teams to align reliability goals with overall product requirements.
- Generate comprehensive reports detailing reliability test results, analysis, and recommendations.
- Maintain meticulous records of testing methodologies and outcomes for future reference and continuous improvement initiatives.
- Mentorship:
- Effectively mentor junior engineers, providing guidance in both technical domains and professional skill development to foster growth and team success.
PREFERRED EXPERIENCE:
- Knowledge of reliability engineering principles, product lifecycle, and standards in high-performance computing environments.
- Proven experience in system-level setup and testing for accelerator products or similar technologies.
- Proficiency in developing and executing reliability test scripts and protocols.
- Familiarity with reliability standards and best practices in high-performance computing environments.
- Familiarity with data center environmental management, server rack/system configurations, and integrated cooling solutions.
- Strong understanding of environmental stress factors, including thermal, mechanical, and electrical stresses, in server systems (L6–L10).
- Expertise in failure analysis techniques, including root cause analysis and fault isolation methodologies.
- Excellent written and verbal communication skills for clear reporting and collaboration.
- Strong analytical, problem-solving, and communication skills.
- Experience with reliability testing tools, simulation software and statistical tools is an added advantage.
- Knowledge in project and risk management is an added advantage.
- Self-starter and able to independently drive tasks to completion.
- Ability to structure and execute complex analysis, draw insights, and communicate summary conclusions/recommendations to senior management and AMD customers/partners.
- Ability to network, build relationships, and collaborate to drive effective decision-making across multiple functions and levels within AMD.
ACADEMIC CREDENTIALS:
- Bachelor's or Master's degree in Electrical/Electronics Engineering (EE) or a related field.
LOCATION:
Singapore
Tell employers what skills you haveCycling
Manual Testing
Budget Management
Ubuntu
Root Cause Analysis
Reliability
Administration Management
Reliability Engineering
Infrastructure Architecture
RedHat
Technical Consultation
Environmental Management Systems
Technical Engineering
Failure Analysis
Reliability Engineering Leadership
Posted today
Job Viewed
Job Description
Lead Reliability Expert:
- Develop and oversee technology qualification programs.
- Direct reliability testing and lab operations for wafer-level and product-level products.
- Implement test method improvements and optimize lab capabilities to meet evolving needs.
- Support fab monitoring, customer requests, and technical evaluations.
- Evaluate day-to-day lab performance and maintain high equipment uptime.
- Promote a safe working environment with focus on quality and continuous improvement.
- Mentor team members and develop technical skills within the group.
Requirements:
- Degree in Engineering or Science (Mechanical, Chemical, or related).
- At least 10 years of experience in semiconductor/wafer fab, with proven leadership in reliability engineering.
- Deep knowledge of WLR/PLR methods, reliability mechanisms, and industry standards.
- Experience across various process technologies such as automotive, logic, HV, Flash/NVM.
- Familiarity with global standards including AEC-Q100, JEDEC, JEP001.
Reliability Engineering Specialist
Posted 11 days ago
Job Viewed
Job Description
THE ROLE:
Join a dynamic global team dedicated to advanced reliability testing of module and system boards of AMD's cutting-edge products. Collaborate closely with cross-functional teams across AMD Global Operations & Quality, and Data Center organizations on accelerator-product system setup and reliability testing.
KEY RESPONSIBILITIES:
- System-level setup and testing:
- Plan, execute, and optimize system-level setups for accelerator products, including server rack and system configurations.
- Ensure seamless integration and functionality of server systems with advanced cooling solutions and environmental management systems.
- Validate and maintain reliability test scripts for automated and manual testing processes.
- Reliability assessment and testing:
- Conduct comprehensive reliability assessments of accelerator systems, focusing on mechanical, thermal, and electrical stress factors.
- Design and implement environmental stress tests to simulate data center conditions, including operational stress, thermal cycling, signal, and power integrity.
- Evaluate material interactions and their impact on product reliability, ensuring robustness in diverse operating environments.
- Analyze results to identify potential reliability risks and areas for design improvement.
- Functional testing and fault isolation:
- Perform detailed functional testing to evaluate system performance under various operational conditions.
- Identify, isolate, and troubleshoot faults using advanced diagnostic tools and methodologies.
- Failure analysis and reporting:
- Perform root cause analysis for identified reliability failures and develop corrective actions for design and process enhancement.
- Collaborate with cross-functional teams to conduct root cause analysis of reliability testing failures.
- Collaboration and documentation:
- Work closely with design, manufacturing, and quality teams to align reliability goals with overall product requirements.
- Generate comprehensive reports detailing reliability test results, analysis, and recommendations.
- Maintain meticulous records of testing methodologies and outcomes for future reference and continuous improvement initiatives.
- Mentorship:
- Effectively mentor junior engineers, providing guidance in both technical domains and professional skill development to foster growth and team success.
PREFERRED EXPERIENCE:
- Knowledge of reliability engineering principles, product lifecycle, and standards in high-performance computing environments.
- Proven experience in system-level setup and testing for accelerator products or similar technologies.
- Proficiency in developing and executing reliability test scripts and protocols.
- Familiarity with reliability standards and best practices in high-performance computing environments.
- Familiarity with data center environmental management, server rack/system configurations, and integrated cooling solutions.
- Strong understanding of environmental stress factors, including thermal, mechanical, and electrical stresses, in server systems (L6–L10).
- Expertise in failure analysis techniques, including root cause analysis and fault isolation methodologies.
- Excellent written and verbal communication skills for clear reporting and collaboration.
- Strong analytical, problem-solving, and communication skills.
- Experience with reliability testing tools, simulation software and statistical tools is an added advantage.
- Knowledge in project and risk management is an added advantage.
- Self-starter and able to independently drive tasks to completion.
- Ability to structure and execute complex analysis, draw insights, and communicate summary conclusions/recommendations to senior management and AMD customers/partners.
- Ability to network, build relationships, and collaborate to drive effective decision-making across multiple functions and levels within AMD.
ACADEMIC CREDENTIALS:
- Bachelor’s or Master’s degree in Electrical/Electronics Engineering (EE) or a related field.
LOCATION:
Singapore