310 Incident Management jobs in Singapore
Incident Management Specialist
Posted today
Job Viewed
Job Description
Job Description: We are seeking a highly skilled Incident Manager to join our team. The successful candidate will be responsible for managing technology incidents that impact our business operations.
">Key Responsibilities:
">- ">
- Manage technology incidents that impact our business operations ">
- Work with relevant business and technology groups to comply with incident and problem management processes and procedures ">
- End-to-end ownership of major incidents to minimize downtime ">
- Establish strong command and control of an incident, establishing clear accountability and precise evaluation of complex issue scenarios ">
- Participate in all incident resolution calls to facilitate incident determination, recovery, and resolution ">
- Timely incident recognition, logging, assignment, and resolution with proper documentation ">
- Incident progression coordination and monitoring of incidents and potential areas through symptoms, trends, or deviations from standards ">
- Escalation of critical and unresolved incidents to appropriate levels of management ">
- Ensure accurate capture and documentation of incident data in the incident reporting tool ">
- Post-incident activity to ensure highest levels of service quality and improve service levels through identification of problem trends and causes ">
- Ability to communicate well and manage highly stressful situations during an incident ">
Required Skills and Qualifications:
">- ">
- Bachelor's degree in Business, Computer Science, or related discipline ">
- ITIL certification ">
- 8-10 years of experience managing complex IT initiatives in a matrix environment or operational line managers experience ">
- Excellent English communication skills (written and oral) ">
- Experience in application support, knowledge on EOD batch processing, infrastructure (storage, network, Unix/Linux), web/application/middleware services, and good to know payments flow ">
Benefits:
">- ">
- Competitive salary and benefits package ">
- Opportunity to work with a dynamic team ">
- Professional development and growth opportunities ">
Others:
">- ">
- Good knowledge of Macro, Excel, PowerPoint, ticketing tools, and data analysis ">
- Ability to work in a fast-paced environment and prioritize tasks effectively ">
- Strong analytical and problem-solving skills ">
Incident Management Analyst
Posted 10 days ago
Job Viewed
Job Description
GENERAL DESCRIPTION
The Incident and Service Level Analyst is the primary IT resource to monitor incidents, track problems and ensure SLA’s are correct in place and followed. The Incident analyst will work closely with his counterparts from other locations.
KEY FEATURES OF THE POSITION
- Ensure major incidents are resolved in the shortest period of time.
- Responsible for incident, problem and service level management.
- Initiate and coordinate incidents solving/resolution activities.
- Perform incident review and make recommendations for improvement.
- Part of the Major Incident Management Team (MIM).
- Take joint responsibility in the governance of the Incident and Problem Management end to end process with cross technology teams ensuring all KPI’s are met.
- Facilitate post-mortem and RCA tasks for high-priority incidents.
- Produce comprehensive incident and problem reports to all required audiences.
- Co-own Problem Management activities for all managed incidents.
- Identify individual and at scale emerging problems and escalate issues into Problem Management queue.
- Conduct Root Cause Analysis for all escalated incidents and Problem Management tickets.
- Triage high priority /Major Incidents, work Support teams for resolution and perform escalations of notable incidents.
- Produce management information, including KPI’s and reports.
- Follow up, analyse and track Incidents and SLA breaches.
- Drive and monitor the effectiveness of the incident, problem and service level management processes.
- Perform incident trend analysis and propose recommendations to improve incident trends.
Job Requirements:
Personal and Social
- Ability to work independently or as part of a team.
- Ability to drive both teams and individuals.
- Conscientious in ensuring defined SLAs are met.
- Ability to interact and coordinate with various IT teams within a financial institution.
- Good communication and organization skills.
- Ability to analyse problems, troubleshooting, provide short term, and long-term solutions.
- Experienced in problem-solving and troubleshooting in an IT environment.
- Ability to work under fire; handle stressful situations in a calm manner.
Professional and Technical
- At least 4 years of experience working in a Bank or financial institute.
- Hands on experience in managing Service, Change and Incident management.
- Experience in managing End user Support, Applications and Infrastructure services.
- Possess good people management skills across all levels with the ability to manage multiple support pillars to identify root cause of the incidence.
- Ability to prioritise and multitask when managing incidences with multiple layers. Candidate should have in depth knowledge of regulatory TRM guidelines and of ITIL concepts.
- Candidates with ServiceNow tool experience preffered but not mandator
Interested candidates please email your latest resume to
Event & Incident Management Engineer
Posted today
Job Viewed
Job Description
Event & Incident Management Engineer
Location: Singapore | Shift: 24x7 Rotational | Team: Application Production Support & Reliability Engineering
Join our Event & Incident Management (EIM) team and play a critical role in keeping Bank of America’s core systems running smoothly. As part of our global APSRE team , you’ll be on the frontline of reliability , monitoring high-impact systems, proactively identifying issues, and driving rapid resolution to protect millions of customers worldwide.
Why Join Us- High Impact → You’ll safeguard mission-critical banking systems used by millions globally.
- Cutting-Edge Tech → Work with leading monitoring tools like Splunk , Dynatrace , Netcool , and SiteScope .
- Global Exposure → Collaborate with worldwide teams across infrastructure, applications, and business domains.
- Career Growth → Build expertise in ITIL/ITSM , SRE practices , AI frameworks , and cloud technologies .
- Real-Time Monitoring → Monitor infrastructure, databases, middleware, storage, backups, and application performance across critical banking services.
- Proactive Issue Resolution → Detect, triage, and resolve performance issues before they impact customers.
- Incident Response & Escalation → Lead technical escalations, coordinate across global teams, and drive rapid restoration of services.
- Stakeholder Communication → Keep partners informed on significant progress, risks, and challenges during incident resolution.
- Tooling & Automation → Leverage monitoring platforms (Splunk, Dynatrace, Netcool, SiteScope ) and contribute ideas to improve detection and response automation.
- Shift-Based Operations → Work in a 24x7 rotational environment to ensure continuous availability of banking services.
- Experience : 3+ years in IT production support , infrastructure monitoring , or incident management roles.
- Tech Skills :
Monitoring tools: Splunk , Dynatrace , SiteScope , Tivoli Netcool .
OS experience: Unix/Linux , Windows Server .
Understanding of Java Virtual Machine behavior and troubleshooting. - Frameworks : Familiarity with ITIL/ITSM practices (certification is a plus).
- Cloud & Automation : Exposure to cloud technologies and ServiceNow workflows is a plus.
- Soft Skills :
Strong communication and stakeholder management abilities.
Proactive , detail-oriented , and comfortable working in a high-pressure, global environment .
- Familiarity with AI frameworks, models, and tools for incident correlation and prediction.
- Exposure to large-scale financial environments or mission-critical banking systems .
In this role, you are the first line of defense ensuring our customers, partners, and businesses stay connected 24/7 . You’ll gain hands-on experience with some of the most complex IT ecosystems in banking while shaping next-generation operational reliability .
If you want, I can also create a shorter, more aggressive job ad optimized for LinkedIn or Indeed to attract SREs and production support engineers quickly.
Don't miss out on this chance to be a part of a dynamic and growing team. Take the Next Step in your career journey with us!
- To apply, please submit your updated resume along with your notice period, current salary package details, including base salary, incentives, annual wage supplement, and expected salary.
- Click on the 'Apply here' button to drop your resume directly or email it to .
- Our team will review all applications and contact shortlisted candidates for further steps in the selection process.
Susmita Sahu
EA License No: 91C2918
Personnel Registration Number: R23114076
IT Incident Management Analyst (Shift)
Posted today
Job Viewed
Job Description
Assurity Trusted Solutions (ATS) is a wholly owned subsidiary of the Government Technology Agency (GovTech). As a Trusted Partner over the last decade, ATS offers a comprehensive suite of products and services ranging from infrastructure and operational services, authentication services, governance and assurance services as well as managed processes. In a dynamic digital and cyber landscape, where trust & collaboration are key, ATS continues to drive mutually beneficial business outcomes through collaboration with GovTech, government agencies and commercial partners to mitigate cyber risks and bolster security postures.
We are looking for an IT Incident Management Analyst to join us!
A brief summary of your job responsibility:
Perform 24/7 threats and events monitoring for various domains and notify relevant stakeholders if needed
Support operation and emergency planning and preparedness with relevant authorities
Conduct fact finding on incident occurrence, impact assessment and severity rating, and triage with relevant agencies if needed
Correlate information from multiple sources to detect any anomaly of incident reporting and response
Monitor and log the incident fact finding and investigation, and the status progress
Inform relevant internal and external stakeholders manage timely investigation and updates, and escalate investigation and post-incident reporting if needed
Prepare investigation reports and periodic updates with relevant stakeholders
Requirements
Prefer with 2-5 years' relevant experiences in Incident management, investigation and report writing
Prefer with familiarity on IT/Info/Data/Cyber security or ICT incident management best practices
Able to work independently and contribute as a team player
Possess good communication and interpersonal skills
Join us and discover a meaningful and exciting career with Assurity Trusted Solutions!
The remuneration package will commensurate with your qualifications and experience. Interested applicants, please click "Apply Now".
We thank you for your interest and please note that only shortlisted candidates will be notified.
By submitting your application, you agree that your personal data may be collected, used and disclosed by Assurity Trusted Solutions Pte. Ltd. (ATS), GovTech and their service providers and agents in accordance with ATS's privacy statement which can be found at: or such other successor site.
Benefits
A wholly-owned subsidiary of GovTech
We promote a learning culture and encourage you to grow and learn
#J-18808-Ljbffr
Assistant Director - Crisis & Incident Management
Posted 10 days ago
Job Viewed
Job Description
Roles and Responsibilities:
- Lead a team of Major Incident Managers, Problem Managers and Change Managers.
- Lead and oversee major incidents (severity 1 & 2) for all IT systems to ensure timely recovery of services.
- Ensure closure of incident & problem tickets, meeting the agreed SLAs.
- Ensure major incidents communications such as activating War Room for triage, Conference Bridge and send incident broadcast communication to all Synapxe stakeholders and provide regular updates via incident tracking dashboard until incident closure.
- Collaborate and across multiple internal teams to restore services when incidents occurred, gather required experts to perform root cause analysis for problem resolution.
- Oversee the Problem Management process to produce reports on Root Cause Analysis, SLA measurement and/or performance of incident & problem management and present to Synapxe management when required.
- Oversee the Change Management process and ensure flawless execution of the process.
- Drive IT IPC contract/service provider’s performance for IT Operations and ensure Service Level Agreements have been fulfilled and established improvement plans.
- Drive the Major Change Review meetings to deconflict major changes.
- Work closely with the monitoring teams to ensure potential major incidents are arrested at an early stage prior to impact.
- Work closely with the Disaster Recovery team during yearly exercises and any real time invocation of DR services.
- Maintain processes, templates and SOP, website and support information related to Incident, Problem & Change Management and manage relevant ad-hoc duties.
Requirements / Qualifications
- B.S. in Computer Science or related diploma/degree with min 10 years’ experience
- Familiarity with ITIL framework & methodologies.
- Demonstrable experience and capability to interact with senior management.
- Welcome new challenges, understand the sense of urgency and be able to manage different priorities.
- Uses best practices and knowledge of internal or external business issues to improve products or services
- Exercises judgment within defined procedures; practices and policies to obtain solution
- Experience working in an infrastructure technology environment highly desirable
---
Morgan McKinley Pte Ltd
May Thinzar Khine
EA Licence No: 11C5502
EA Registration No: R22110157
Manager, Incident Response & Management
Posted 2 days ago
Job Viewed
Job Description
Who we are About Stripe
Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.
About the teamThe Incident Response team is a global 24/7 team responsible for driving incident response and management from detection to resolution. Stripe is proud of its five 9s API reliability and this team is at the forefront of ensuring we keep it that way - working hand-in-hand with Reliability Eng and across the Tech Org. This team of incident response managers (IRM) is defined by our sense of ownership and how we drive incidents to resolution - marshaling the necessary cross-functional resources to respond to and resolve service outages, critical bugs, security attacks and anything that significantly impacts the users of our products. The team is user-first and ensures appropriate external communications from Stripe and senior management to keep our users informed of disruption to their experience of Stripe. The team is highly skilled in incident troubleshooting, program management, incident classifications, incident communications, incident escalation and technical adeptness as incidents can arise from anywhere and cut across products and orgs in Stripe.
What you’ll doThis position entails leading and optimizing Stripe's incident management processes and automation, ensuring efficiency and adherence to stringent incident response metrics. As the head of the incident response team, you will establish and maintain a best-in-class incident response framework, upholding the reliability standards expected of Stripe. Responsibilities include but are not limited to incident classification, escalation, and notification management, along with accountability for key incident response metrics (TTx). You will generate actionable insights to drive continuous improvement, collaborating with engineering leadership to refine incident detection, response, user communication, and tooling efficacy. Leadership and development of a highly effective 24/7 global incident response management team, characterized by urgency, programmatic ownership of incidents and communications, and the capacity to engage engineering teams, are crucial. Additionally, you will manage incident communications across multiple channels for executive and end-user audiences, and identify automation opportunities to streamline incident response workflows, thereby safeguarding users and minimizing disruption to their operations.
Responsibilities- Lead the global 24/7 team of regional managers and incident response managers with ability to be hands-on and support frontline on-call with speed, cross-functional collaboration and escalation
- Develop and own Stripe's incident response and management strategy and cross-functional roadmap, ensuring it aligns with the company's reputation for reliability.
- Spearhead and manage Stripe's AI-First strategy for automation of incident response workflows, partnering with the engineering team to implement required tooling enhancements.
- Enhance Stripe's incident response by leading and implementing improvements derived from analyzing user-facing incidents and extracting actionable insights and learnings.
- Collaborate closely with executive leadership, engineering, and operations teams to lead significant programs and reshape workflows and metrics concerning reliability and incident operations.
- Manage relevant TTx metrics, particularly those related to communication and escalation. Collaborate with engineering leadership to implement necessary improvements for each metric.
- Develop user-focused metrics and data to guide Stripe's incident response, reliability strategy, and user communications (including RCAs), ensuring impactful decision-making.
We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.
Minimum requirements- 5+ years of management experience, including 2+ years of experience managing managers with a proven record in building, growing and transforming teams.
- Extensive experience (4+ years) leading incident response for complex, large-scale distributed services with high SLOs/SLAs, coupled with deep expertise in crisis management.
- Demonstrated ability to lead, influence other leaders and deliver complex strategic projects involving multiple stakeholders
- Strong analytical skills, and the ability to use data to drive business decisions
- Possesses proficiency in basic incident troubleshooting and a reasonable understanding of system architecture. Fluent in using SQL, Splunk, or similar query languages.
- Exceptional communication abilities, capable of adapting incident updates for diverse audiences (executives, external users, internal teams).
- Affinity for a fast paced work environment, crafting strategic and rapid fixes to high intensity problems with a keen eye for detail and a high bar for quality
- Comfort navigating ambiguity, while identifying areas for process improvement and establishing best practices
- Experience managing geographically dispersed teams
- Experience using infrastructure and application monitoring tools such as Prometheus, Sentry and others
- Experience in incident response at a high-growth technology company, preferably within the payments or e-commerce sectors.
- Proven ability to apply Agentic and Generative AI to revolutionize incident response, coupled with a strong grasp of current industry trends in the incident response domain.
- Demonstrated history of driving engineering and process enhancements to improve incident response efficiency within a rapidly expanding technology organization.
Office-assigned Stripes spend at least 50% of the time in a given month in their local office or with users. This hits a balance between bringing people together for in-person collaboration and learning from each other, while supporting flexibility about how to do this in a way that makes sense for individuals and their teams.
The annual salary range for this role in the primary location is S$208,000 - S$312,000. This range may change if you are hired in another location. For sales roles, the range provided is the role’s On Target Earnings (“OTE”) range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. This salary range may be inclusive of several career levels at Stripe and will be narrowed during the interview process based on a number of factors, including the candidate’s experience, qualifications, and specific location. Applicants interested in this role and who are not located in the primary location may request the annual salary range for their location during the interview process.
Specific benefits and details about what compensation is included in the salary range listed above will vary depending on the applicant’s location and can be discussed in more detail during the interview process. Benefits/additional compensation for this role may include: equity, company bonus or sales commissions/bonuses; retirement plans; health benefits; and wellness stipends.
Office locations
Singapore
Team
Infrastructure & Corporate Tech
Job type
Full time
#J-18808-LjbffrManager, Incident Response & Management
Posted 7 days ago
Job Viewed
Job Description
Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.
About the teamThe Incident Ops team is a global 24/7 team responsible for driving incident response and management of incidents from detection to resolution. Stripe is proud of its five 9s reliability and this team is at the forefront of ensuring we keep it that way - working hand-in-hand with Reliability Eng and across the Tech Org. This team of incident response managers (IRM) is defined by our sense of ownership and how we drive incidents to resolution - marshaling the necessary cross-functional resources to respond to and resolve service outages, critical bugs, security attacks and anything that significantly impacts the users of our products. The team is user-first and ensures appropriate external communications from Stripe and senior management to keep our users informed of disruption to their experience of Stripe. The team is skilled in program management, communications, incident handling and technical adeptness as incidents can arise from anywhere and cut across products and orgs in Stripe.
What you’ll doAs the Manager of Incident Response Managers, you’ll evolve a world class incident response team in APAC to maintain a high bar of reliability expected of Stripe and by Stripe’s users. You’ll work hand-in-hand with regional IRM teams in AMER and EMEA to ensure solid 24/7 coverage for how we detect, respond to incidents, communicate to users, improve related tooling and measure impact. You will lead and nurture a high-performing IRM team based in APAC who has a strong sense of urgency, focused on identifying incident impact, rapidly assembling incident responders, driving incident communications, and mitigating impact as quickly as possible. As a result, you’ll be seen as the protector of our users - in minimizing the impact of incidents on their business and ensuring that Stripe is always thinking of our users.
Responsibilities- Manage a team of frontline incident response managers
- Provide coaching and development to each team member
- Coordinate and manage incident resolution with speed, cross-functional collaboration, and accuracy, with a global and broad set of stakeholders.
- Facilitate post incident reviews to identify technical or process problems which need to be remediated
- Contribute to incident root cause analysis, identifying remediation opportunities for Incident Operations, partner teams on operations and engineering to execute upon.
- Formulate strategy and deliver on communications to both internal stakeholders and Stripe’s users.
- Collaborate with engineering and operations teams to align on and execute upon on-going improvements to processes, tooling, metrics, and the Incident Management framework.
- Influence and make decisions through interpretation of data and consolidation of input from multiple stakeholders.
We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.
Minimum requirements- Have 5+ years of direct people management experience, an excellent coach
- Have 3+ years of experience within a Major Incident Management team
- Demonstrated employee and team development
- Enjoy a fast paced work environment, crafting strategic and rapid fixes to high intensity problems with a keen eye for detail and a high bar for quality
- Comfortable navigating ambiguity, while identifying areas for process improvement and establishing best practices
- Strong written and verbal communication skills, able to deliver effective messaging to all levels of a technical organization
- Can problem solve and translate complicated technical issues into solutions, while keeping a users-first mindset
- Have an ability to execute on and deliver complex operational projects involving multiple stakeholders especially in partnering with engineering
- Have technical background, are proficient in SQL, Splunk, or equivalent query languages and the ability to use data to drive business decisions based on analytical research
- Experience using infrastructure and application monitoring tools such as Signalfx, Prometheus, Sentry, Grafana and others
- Experience at a high-growth technology company, especially within the payments or e-commerce space in particular for incident response
- Experience working with both cloud and third-party solution providers
- Experience with managing user-facing communications strategy during sensitive situations such as outages
Hybrid work at Stripe
Office-assigned Stripes spend at least 50% of the time in a given month in their local office or with users. This hits a balance between bringing people together for in-person collaboration and learning from each other, while supporting flexibility about how to do this in a way that makes sense for individuals and their teams.
Be The First To Know
About the latest Incident management Jobs in Singapore !
Manager, Incident Response & Management
Posted 11 days ago
Job Viewed
Job Description
Who we are
About Stripe
Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.
About the teamThe Incident Response team is a global 24/7 team responsible for driving incident response and management from detection to resolution. Stripe is proud of its five 9s API reliability and this team is at the forefront of ensuring we keep it that way - working hand-in-hand with Reliability Eng and across the Tech Org. This team of incident response managers (IRM) is defined by our sense of ownership and how we drive incidents to resolution - marshaling the necessary cross-functional resources to respond to and resolve service outages, critical bugs, security attacks and anything that significantly impacts the users of our products. The team is user-first and ensures appropriate external communications from Stripe and senior management to keep our users informed of disruption to their experience of Stripe. The team is highly skilled in incident troubleshooting, program management, incident classifications, incident communications, incident escalation and technical adeptness as incidents can arise from anywhere and cut across products and orgs in Stripe.
What you’ll doThis position entails leading and optimizing Stripe's incident management processes and automation, ensuring efficiency and adherence to stringent incident response metrics. As the head of the incident response team, you will establish and maintain a best-in-class incident response framework, upholding the reliability standards expected of Stripe. Responsibilities include but are not limited to incident classification, escalation, and notification management, along with accountability for key incident response metrics (TTx). You will generate actionable insights to drive continuous improvement, collaborating with engineering leadership to refine incident detection, response, user communication, and tooling efficacy. Leadership and development of a highly effective 24/7 global incident response management team, characterized by urgency, programmatic ownership of incidents and communications, and the capacity to engage engineering teams, are crucial. Additionally, you will manage incident communications across multiple channels for executive and end-user audiences, and identify automation opportunities to streamline incident response workflows, thereby safeguarding users and minimizing disruption to their operations.
Responsibilities- Lead the global 24/7 team of regional managers and incident response managers with ability to be hands-on and support frontline on-call with speed, cross-functional collaboration and escalation
- Develop and own Stripe's incident response and management strategy and cross-functional roadmap, ensuring it aligns with the company's reputation for reliability.
- Spearhead and manage Stripe's AI-First strategy for automation of incident response workflows, partnering with the engineering team to implement required tooling enhancements.
- Enhance Stripe's incident response by leading and implementing improvements derived from analyzing user-facing incidents and extracting actionable insights and learnings.
- Collaborate closely with executive leadership, engineering, and operations teams to lead significant programs and reshape workflows and metrics concerning reliability and incident operations.
- Manage relevant TTx metrics, particularly those related to communication and escalation. Collaborate with engineering leadership to implement necessary improvements for each metric.
- Develop user-focused metrics and data to guide Stripe's incident response, reliability strategy, and user communications (including RCAs), ensuring impactful decision-making.
We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.
Minimum requirements- 5+ years of management experience, including 2+ years of experience managing managers with a proven record in building, growing and transforming teams.
- Extensive experience (4+ years) leading incident response for complex, large-scale distributed services with high SLOs/SLAs, coupled with deep expertise in crisis management.
- Demonstrated ability to lead, influence other leaders and deliver complex strategic projects involving multiple stakeholders
- Strong analytical skills, and the ability to use data to drive business decisions
- Possesses proficiency in basic incident troubleshooting and a reasonable understanding of system architecture. Fluent in using SQL, Splunk, or similar query languages.
- Exceptional communication abilities, capable of adapting incident updates for diverse audiences (executives, external users, internal teams).
- Affinity for a fast paced work environment, crafting strategic and rapid fixes to high intensity problems with a keen eye for detail and a high bar for quality
- Comfort navigating ambiguity, while identifying areas for process improvement and establishing best practices
- Experience managing geographically dispersed teams
- Experience using infrastructure and application monitoring tools such as Prometheus, Sentry and others
- Experience in incident response at a high-growth technology company, preferably within the payments or e-commerce sectors.
- Proven ability to apply Agentic and Generative AI to revolutionize incident response, coupled with a strong grasp of current industry trends in the incident response domain.
- Demonstrated history of driving engineering and process enhancements to improve incident response efficiency within a rapidly expanding technology organization.
Manager, Incident Response & Management
Posted 11 days ago
Job Viewed
Job Description
Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.
About the teamThe Incident Response team is a global 24/7 team responsible for driving incident response and management from detection to resolution. Stripe is proud of its five 9s API reliability and this team is at the forefront of ensuring we keep it that way - working hand-in-hand with Reliability Eng and across the Tech Org. This team of incident response managers (IRM) is defined by our sense of ownership and how we drive incidents to resolution - marshaling the necessary cross-functional resources to respond to and resolve service outages, critical bugs, security attacks and anything that significantly impacts the users of our products. The team is user-first and ensures appropriate external communications from Stripe and senior management to keep our users informed of disruption to their experience of Stripe. The team is highly skilled in incident troubleshooting, program management, incident classifications, incident communications, incident escalation and technical adeptness as incidents can arise from anywhere and cut across products and orgs in Stripe.
What you’ll doThis position entails leading and optimizing Stripe's incident management processes and automation, ensuring efficiency and adherence to stringent incident response metrics. As the head of the incident response team, you will establish and maintain a best-in-class incident response framework, upholding the reliability standards expected of Stripe. Responsibilities include but are not limited to incident classification, escalation, and notification management, along with accountability for key incident response metrics (TTx). You will generate actionable insights to drive continuous improvement, collaborating with engineering leadership to refine incident detection, response, user communication, and tooling efficacy. Leadership and development of a highly effective 24/7 global incident response management team, characterized by urgency, programmatic ownership of incidents and communications, and the capacity to engage engineering teams, are crucial. Additionally, you will manage incident communications across multiple channels for executive and end-user audiences, and identify automation opportunities to streamline incident response workflows, thereby safeguarding users and minimizing disruption to their operations.
Responsibilities- Lead the global 24/7 team of regional managers and incident response managers with ability to be hands-on and support frontline on-call with speed, cross-functional collaboration and escalation
- Develop and own Stripe's incident response and management strategy and cross-functional roadmap, ensuring it aligns with the company's reputation for reliability.
- Spearhead and manage Stripe's AI-First strategy for automation of incident response workflows, partnering with the engineering team to implement required tooling enhancements.
- Enhance Stripe's incident response by leading and implementing improvements derived from analyzing user-facing incidents and extracting actionable insights and learnings.
- Collaborate closely with executive leadership, engineering, and operations teams to lead significant programs and reshape workflows and metrics concerning reliability and incident operations.
- Manage relevant TTx metrics, particularly those related to communication and escalation. Collaborate with engineering leadership to implement necessary improvements for each metric.
- Develop user-focused metrics and data to guide Stripe's incident response, reliability strategy, and user communications (including RCAs), ensuring impactful decision-making.
We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.
- 5+ years of management experience, including 2+ years of experience managing managers with a proven record in building, growing and transforming teams.
- Extensive experience (4+ years) leading incident response for complex, large-scale distributed services with high SLOs/SLAs, coupled with deep expertise in crisis management.
- Demonstrated ability to lead, influence other leaders and deliver complex strategic projects involving multiple stakeholders
- Strong analytical skills, and the ability to use data to drive business decisions
- Possesses proficiency in basic incident troubleshooting and a reasonable understanding of system architecture. Fluent in using SQL, Splunk, or similar query languages.
- Exceptional communication abilities, capable of adapting incident updates for diverse audiences (executives, external users, internal teams).
- Affinity for a fast paced work environment, crafting strategic and rapid fixes to high intensity problems with a keen eye for detail and a high bar for quality
- Comfort navigating ambiguity, while identifying areas for process improvement and establishing best practices
- Experience managing geographically dispersed teams
- Experience using infrastructure and application monitoring tools such as Prometheus, Sentry and others
- Experience in incident response at a high-growth technology company, preferably within the payments or e-commerce sectors.
- Proven ability to apply Agentic and Generative AI to revolutionize incident response, coupled with a strong grasp of current industry trends in the incident response domain.
- Demonstrated history of driving engineering and process enhancements to improve incident response efficiency within a rapidly expanding technology organization.
The annual salary range for this role in the primary location is S$208,000 - S$312,000. This range may change if you are hired in another location. For sales roles, the range provided is the role’s On Target Earnings (“OTE”) range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. This salary range may be inclusive of several career levels at Stripe and will be narrowed during the interview process based on a number of factors, including the candidate’s experience, qualifications, and specific location. Applicants interested in this role and who are not located in the primary location may request the annual salary range for their location during the interview process.
Specific benefits and details about what compensation is included in the salary range listed above will vary depending on the applicant’s location and can be discussed in more detail during the interview process. Benefits/additional compensation for this role may include: equity, company bonus or sales commissions/bonuses; retirement plans; health benefits; and wellness stipends.
At Stripe, we're looking for people with passion, grit, and integrity. You're encouraged to apply even if your experience doesn't precisely match the job description. Your skills and passion will stand out—and set you apart—especially if your career has taken some extraordinary twists and turns. At Stripe, we welcome diverse perspectives and people who think rigorously and aren't afraid to challenge assumptions. Join us.
#J-18808-LjbffrManager, Incident Response & Management
Posted today
Job Viewed
Job Description
Who we are
About Stripe
Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.
About the team
The Incident Ops team is a global 24/7 team responsible for driving incident response and management of incidents from detection to resolution. Stripe is proud of its five 9s reliability and this team is at the forefront of ensuring we keep it that way - working hand-in-hand with Reliability Eng and across the Tech Org. This team of incident response managers (IRM) is defined by our sense of ownership and how we drive incidents to resolution - marshaling the necessary cross-functional resources to respond to and resolve service outages, critical bugs, security attacks and anything that significantly impacts the users of our products. The team is user-first and ensures appropriate external communications from Stripe and senior management to keep our users informed of disruption to their experience of Stripe. The team is skilled in program management, communications, incident handling and technical adeptness as incidents can arise from anywhere and cut across products and orgs in Stripe.
What you’ll do
As the Manager of Incident Response Managers, you’ll evolve a world class incident response team in APAC to maintain a high bar of reliability expected of Stripe and by Stripe’s users. You’ll work hand-in-hand with regional IRM teams in AMER and EMEA to ensure solid 24/7 coverage for how we detect, respond to incidents, communicate to users, improve related tooling and measure impact. You will lead and nurture a high-performing IRM team based in APAC who has a strong sense of urgency, focused on identifying incident impact, rapidly assembling incident responders, driving incident communications, and mitigating impact as quickly as possible. As a result, you’ll be seen as the protector of our users - in minimizing the impact of incidents on their business and ensuring that Stripe is always thinking of our users.
Responsibilities
- Manage a team of frontline incident response managers
- Provide coaching and development to each team member
- Coordinate and manage incident resolution with speed, cross-functional collaboration, and accuracy, with a global and broad set of stakeholders.
- Facilitate post incident reviews to identify technical or process problems which need to be remediated
- Contribute to incident root cause analysis, identifying remediation opportunities for Incident Operations, partner teams on operations and engineering to execute upon.
- Formulate strategy and deliver on communications to both internal stakeholders and Stripe’s users.
- Collaborate with engineering and operations teams to align on and execute upon on-going improvements to processes, tooling, metrics, and the Incident Management framework.
- Influence and make decisions through interpretation of data and consolidation of input from multiple stakeholders.
Who you are
We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.
Minimum requirements
- Have 5+ years of direct people management experience, an excellent coach
- Have 3+ years of experience within a Major Incident Management team
- Demonstrated employee and team development
- Enjoy a fast paced work environment, crafting strategic and rapid fixes to high intensity problems with a keen eye for detail and a high bar for quality
- Comfortable navigating ambiguity, while identifying areas for process improvement and establishing best practices
- Strong written and verbal communication skills, able to deliver effective messaging to all levels of a technical organization
- Can problem solve and translate complicated technical issues into solutions, while keeping a users-first mindset
- Have an ability to execute on and deliver complex operational projects involving multiple stakeholders especially in partnering with engineering
Preferred qualifications
- Have technical background, are proficient in SQL, Splunk, or equivalent query languages and the ability to use data to drive business decisions based on analytical research
- Experience using infrastructure and application monitoring tools such as Signalfx, Prometheus, Sentry, Grafana and others
- Experience at a high-growth technology company, especially within the payments or e-commerce space in particular for incident response
- Experience working with both cloud and third-party solution providers
- Experience with managing user-facing communications strategy during sensitive situations such as outages
Hybrid work at Stripe
Office-assigned Stripes spend at least 50% of the time in a given month in their local office or with users. This hits a balance between bringing people together for in-person collaboration and learning from each other, while supporting flexibility about how to do this in a way that makes sense for individuals and their teams.