1,208 Senior Sre jobs in Singapore
Site Reliability Engineer (SRE)
Posted 23 days ago
Job Viewed
Job Description
Join to apply for the Site Reliability Engineer (SRE) role at Sea .
Get AI-powered advice on this job and more exclusive features.
Responsibilities- Develop and maintain scripts to retrieve and process data from Google Workspace and Zoom, including users, groups, meeting rooms, licenses, activity logs, and configurations.
- Normalize and structure data for analysis, reporting, and alerting.
- Build automated alerting systems to identify anomalies, policy violations, or operational issues in Workspace or Zoom environments.
- Design workflows to automate tasks such as account cleanup, license management, and configuration enforcement.
- Build a secure internal web platform to standardize administrative actions, including reporting, dashboards, and visualizations.
- Collaborate with IT Services and Support teams to prioritize features and gather automation requirements.
- Implement Git workflows for collaboration and maintain well-documented code.
- Deploy tools in containerized environments like Docker and support infrastructure such as databases and authentication mechanisms.
- 3–5 years of experience in software development, automation, or internal systems.
- Proficiency in scripting/backend languages like Python, Node.js, or Go.
- Experience with APIs such as Google Workspace Admin SDK, Zoom API, GAM.
- Familiarity with Git and collaborative workflows.
- Strong problem-solving skills and ability to work independently.
- Seniority level: Entry level
- Employment type: Full-time
- Job function: Information Technology
- Industries: Technology, Internet
Site Reliability Engineer (SRE)
Posted 23 days ago
Job Viewed
Job Description
Join to apply for the Site Reliability Engineer (SRE) role at Percept Solutions
Continue with Google Continue with Google
2 years ago Be among the first 25 applicants
Join to apply for the Site Reliability Engineer (SRE) role at Percept Solutions
Job Description
Job Description
Design and implementation of new solutions as well as enhancement and integration of existing ones to ensure pro-active monitoring
Working collaboration with internal teams and vendors to identify, monitor and improve Service Level Objective and Indicator
Support for incident management, investigation, resolution and post-mortem
Performance monitoring and capacity management
Automate manual operational tasks for self-healing
Administration to provide operational support for monitoring tools
Deployment and patching
System configuration
User access management
Incident management and investigation
Report and Dashboard generation
Job Requirements
SRE and automation tools like Ansible, Jenkins
Monitoring solutions such as Zabbix, Dynatrace,CloudWatch, eG, SolarWinds
Dashboard visualization such as Grafana
Proficient in SQL Scripting for data analytics
Familiar with database technology such as Oracle,MySQL, MS SQL
Familiar with Windows, Unix, Linux OS environments
EA Licence No.:18S9405 / EA Reg. No.:R
- Seniority level Mid-Senior level
- Employment type Full-time
- Job function Engineering and Information Technology
- Industries IT Services and IT Consulting
Referrals increase your chances of interviewing at Percept Solutions by 2x
Sign in to set job alerts for “Site Reliability Engineer” roles.Continue with Google Continue with Google
Continue with Google Continue with Google
Project Intern, Digital Innovations & Solutions (Full Stack Developer) Web Frontend Engineer(Work Location: Remote in Taiwan) Software Engineering - Research Internship Software Developer – Life Sciences Technology Frontend Software Engineer, Data Platform - 2025 Start Python Developer (Singapore) – Elite Fintech Startup (up to $200K SGD + Bonus + Hybrid)We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-LjbffrSite Reliability Engineer (SRE)
Posted today
Job Viewed
Job Description
Join to apply for the
Site Reliability Engineer (SRE)
role at
Sea .
Get AI-powered advice on this job and more exclusive features.
Responsibilities
Develop and maintain scripts to retrieve and process data from Google Workspace and Zoom, including users, groups, meeting rooms, licenses, activity logs, and configurations.
Normalize and structure data for analysis, reporting, and alerting.
Build automated alerting systems to identify anomalies, policy violations, or operational issues in Workspace or Zoom environments.
Design workflows to automate tasks such as account cleanup, license management, and configuration enforcement.
Build a secure internal web platform to standardize administrative actions, including reporting, dashboards, and visualizations.
Collaborate with IT Services and Support teams to prioritize features and gather automation requirements.
Implement Git workflows for collaboration and maintain well-documented code.
Deploy tools in containerized environments like Docker and support infrastructure such as databases and authentication mechanisms.
Requirements
3–5 years of experience in software development, automation, or internal systems.
Proficiency in scripting/backend languages like Python, Node.js, or Go.
Experience with APIs such as Google Workspace Admin SDK, Zoom API, GAM.
Familiarity with Git and collaborative workflows.
Strong problem-solving skills and ability to work independently.
Additional Details
Seniority level: Entry level
Employment type: Full-time
Job function: Information Technology
Industries: Technology, Internet
#J-18808-Ljbffr
Site Reliability Engineer (SRE)
Posted today
Job Viewed
Job Description
Join to apply for the
Site Reliability Engineer (SRE)
role at
Percept Solutions
Continue with Google Continue with Google
2 years ago Be among the first 25 applicants
Join to apply for the
Site Reliability Engineer (SRE)
role at
Percept Solutions
Job Description
Job Description
Design and implementation of new solutions as well as enhancement and integration of existing ones to ensure pro-active monitoring
Working collaboration with internal teams and vendors to identify, monitor and improve Service Level Objective and Indicator
Support for incident management, investigation, resolution and post-mortem
Performance monitoring and capacity management
Automate manual operational tasks for self-healing
Administration to provide operational support for monitoring tools
Deployment and patching
System configuration
User access management
Incident management and investigation
Report and Dashboard generation
Job Requirements
SRE and automation tools like Ansible, Jenkins
Monitoring solutions such as Zabbix, Dynatrace,CloudWatch, eG, SolarWinds
Dashboard visualization such as Grafana
Proficient in SQL Scripting for data analytics
Familiar with database technology such as Oracle,MySQL, MS SQL
Familiar with Windows, Unix, Linux OS environments
EA Licence No.:18S9405 / EA Reg. No.:R
Seniority level
Seniority level Mid-Senior level
Employment type
Employment type Full-time
Job function
Job function Engineering and Information Technology
Industries IT Services and IT Consulting
Referrals increase your chances of interviewing at Percept Solutions by 2x
Sign in to set job alerts for “Site Reliability Engineer” roles.
Continue with Google Continue with Google
Continue with Google Continue with Google
Project Intern, Digital Innovations & Solutions (Full Stack Developer)
Web Frontend Engineer(Work Location: Remote in Taiwan)
Software Engineering - Research Internship
Software Developer – Life Sciences Technology
Frontend Software Engineer, Data Platform - 2025 Start
Python Developer (Singapore) – Elite Fintech Startup (up to $200K SGD + Bonus + Hybrid)
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr
Site Reliability Engineer (SRE)
Posted 2 days ago
Job Viewed
Job Description
We are a global dating app created to give everyone a chance at love. The sense of belonging and connectedness we get from relationships helps us survive and thrive, and we’re working to make it a little easier for people to find that. We’re inspired by the stories we hear from employees, friends, and family who have used our app to transform their lives, and you, too, can make a difference by joining us!
We are looking for a talented Senior Site Reliability Engineer to help design the future of dating. This individual will bring extensive experience in running large-scale data sources in the cloud and will be responsible for modernizing our data source handling and maintaining our core infrastructure and services on AWS.
This role will be based in Singapore and report directly to the CTO.
Responsibilities:- Architect, develop, and maintain our core infrastructure and services on AWS, focusing on high availability, performance, and scalability.
- Specific AWS services of interest include EC2, RDS, S3, ElastiCache, CloudWatch, RedShift, OpenSearch, and VPC.
- Implement and manage continuous deployment processes to achieve seamless deployment of services with minimal downtime.
- Monitor system performance, identify bottlenecks, and apply necessary optimizations to ensure the smooth operation of our services.
- Develop and maintain automated tools for infrastructure provisioning, configuration, and deployment.
- Work closely with development teams to integrate infrastructure builds and operational best practices into the software development lifecycle.
- Conduct root cause analysis for production errors and implement strategies to prevent future occurrences.
- Manage and optimize network configurations to ensure secure and efficient data flow and access.
- Administer and maintain databases, ensuring their reliability, performance, and security.
- Lead capacity planning efforts to ensure that our infrastructure scales in line with demand while optimizing costs and maintaining performance.
- Modernize data source handling (Redshift, Postgres, RDS, etc.).
- Manage Kubernetes workloads.
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 5+ years of industry experience.
- Proven experience as an SRE, DevOps Engineer, or similar role in a cloud-based environment.
- Strong expertise in AWS services and tools.
- Proficient understanding of networking principles, transport, and application protocols, especially TCP/IP, BGP, DNS, TLS, and HTTP/S.
- Experience with database administration, including performance tuning, backup and recovery processes, and security management.
- Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Terraform).
- Excellent problem-solving skills and the ability to work independently or as part of a team.
- Strong Written and Verbal Communication: Fluent in English (both written and verbal); proficiency in Chinese is an advantage to work with Chinese stakeholders.
- Significant experience in capacity planning and cost management within cloud environments.
- Experience with Kubernetes.
- Familiarity with Terraform for general systems maintenance.
- Experience with data sources like Redshift, Postgres (Citus, Patroni), and RDS.
- AWS SysOps Administrator Associate or AWS Solutions Architect Professional (SAP) certification.
- Experience with Spotinst for cost optimization.
- Familiarity with additional scripting languages such as Go or JavaScript.
If you're passionate about tackling big challenges and have the skills to help us shape the future of online dating, we want to hear from you!
Site Reliability Engineer (SRE) (GovTech)
Posted 15 days ago
Job Viewed
Job Description
We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWS and Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.
Responsibilities
As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
- Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
- Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
- Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
- Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
- Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
- Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
- Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
- Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
- Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
- Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
- Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
- Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
- Strong documentation skills and experience in knowledge sharing across teams.
- Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
- Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
- Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
- Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
- Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
- Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
- Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
- Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
- Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
- Proactive in identifying problems and recommending strategic solutions.
- Excellent problem-solving skills with a robust analytical mindset.
- Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
- Ability to remain calm and effective under pressure, especially during incident response.
- Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
- Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
- Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.
Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice.
Seniority levelMid-Senior level
Employment typeFull-time
Job functionEngineering and Information Technology
IndustriesData Security Software Products
#J-18808-LjbffrSite Reliability Engineer (SRE) (GovTech)
Posted 24 days ago
Job Viewed
Job Description
We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWSand Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.
Responsibilities:
As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
• Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
• Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
• Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
• Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
Maintenance, Optimisation & Performance
• Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
• Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
• Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
• Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
Requirements:
• Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
• Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
• Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
• Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
• Strong documentation skills and experience in knowledge sharing across teams.
• Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
• Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
• Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
• Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
• Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
• Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
• Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
• Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
• Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
Soft skills:
• Proactive in identifying problems and recommending strategic solutions.
• Excellent problem-solving skills with a robust analytical mindset.
• Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
• Ability to remain calm and effective under pressure, especially during incident response.
• Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
• Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
• Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.
Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice .
#J-18808-LjbffrBe The First To Know
About the latest Senior sre Jobs in Singapore !
Site Reliability Engineer (SRE) (GovTech)
Posted today
Job Viewed
Job Description
Site Reliability Engineer (SRE) (GovTech)
We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWS and Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.
Responsibilities
As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
Maintenance, Optimisation & Performance
Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
Requirements
Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
Strong documentation skills and experience in knowledge sharing across teams.
Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
Soft Skills
Proactive in identifying problems and recommending strategic solutions.
Excellent problem-solving skills with a robust analytical mindset.
Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
Ability to remain calm and effective under pressure, especially during incident response.
Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.
Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice.
Seniority level
Mid-Senior level
Employment type
Full-time
Job function
Engineering and Information Technology
Industries
Data Security Software Products
#J-18808-Ljbffr
Senior Site Reliability Engineer (SRE)
Posted today
Job Viewed
Job Description
Responsibilities
Design, implement, and maintain highly available, scalable, and secure infrastructure
Develop and improve observability (monitoring, logging, alerting) across all services
Own incident response lifecycle: detection, mitigation, root cause analysis, and postmortems
Collaborate with software teams to implement SLOs, SLIs, and improve system reliability
Build and maintain CI/CD pipelines to support fast, safe deployments
Manage cloud infrastructure using infrastructure-as-code (e.g., Terraform, Pulumi)
Automate operational tasks using scripting or configuration management tools
Ensure robust backup, disaster recovery, and security controls are in place
Participate in on-call rotations and continuously improve incident response processes
Job Requirements
5+ years of experience in Site Reliability Engineering, DevOps, or relevant Infrastructure roles
Strong hands-on experience with cloud platforms (AWS, GCP, or Azure)
Proficient in infrastructure-as-code tools (e.g., Terraform, CloudFormation)
Solid knowledge of Linux systems administration and networking fundamentals
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK)
Familiar with container orchestration tools like Kubernetes and Docker
Experience working with CI/CD pipelines (e.g., GitLab CI, Jenkins, ArgoCD)
Solid understanding of SRE concepts such as SLAs, SLOs, and error budgets
Strong problem-solving skills, proactive mindset, and attention to detail
Excellent communication and collaboration skills, especially in cross-functional teams
If you are passionate about technology and meet the above requirements, please don’t hesitate to apply. Please note that only shortlisted candidates will be contacted. Appreciate your understanding. Data provided is for recruitment purposes only.
Dada Consultants Pte Ltd
EA License No.:
18S9037 |
EA Registration No.
R
Business Registration Number:
W
Seniority level
Mid-Senior level
Employment type
Full-time
Job function
Information Technology
Industries
Technology, Information and Media
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr
Tech Lead (SRE) - Cloud Infrastructure
Posted 10 days ago
Job Viewed
Job Description
ByteDance will prioritize applicants who have the right to work in Singapore without requiring sponsorship.
About ByteDance
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. Our products include TikTok, Toutiao, Douyin, and Xigua, making it easier and more fun for people to connect, consume, and create content.
Why Join Us
Creation is at the core of ByteDance's purpose. Our teams drive innovation and growth, turning challenges into opportunities to learn and improve. We foster a culture of courage, collaboration, and impact.
Team Introduction
The Site Reliability Engineering (SRE) team combines software and systems engineering to design and operate large-scale, distributed, and resilient systems.
Within TikTok's Infrastructure SRE, our focus is on ensuring the reliability and uptime of our infrastructure services, supporting rapid improvements through automation and system optimization.
The RoleAs a Tech Lead, you will guide and build a team of software and system engineers, establishing efficient processes and promoting best engineering practices. You will coordinate with other teams and the user community.
Responsibilities- Build and lead the SRE team, including recruitment, training, system operation, and fostering a strong team culture.
- Oversee software system development and organizational unit integration.
- Develop long-term technical strategies with clear milestones to enhance team capabilities.
- Guide Proof-of-Concept and solution development, ensuring security and risk considerations.
- Establish protocols for access management, configuration, disaster recovery, and fault handling.
- Create monitoring frameworks and promote automated, intelligent governance within a service-oriented architecture.
- Collaborate with development teams to ensure system reliability from design to launch, advancing automation in operations and maintenance.
- Improve communication and collaboration with business teams, refining processes and business architecture.
What you should have:
- Bachelor's Degree in Computer Science or related field, with over 5 years of professional experience, including at least 3 in R&D.
- Proficiency in Linux systems, networking, and managing large-scale distributed systems.
- Strong planning, summarization, and project management skills.
- Responsibility, proactive attitude, and problem-solving skills.
- Experience with cloud platforms is a plus; experience in large-scale storage, scheduling, big data, or intelligent operations is preferred.
ByteDance values diversity and is committed to creating an inclusive environment where employees are valued for their unique perspectives. We aim to reflect the communities we serve and foster a workplace of creativity and innovation.
#J-18808-Ljbffr