159 Sre jobs in Singapore
Site Reliability Engineer (SRE)
Posted today
Job Viewed
Job Description
Join to apply for the Site Reliability Engineer (SRE) role at Percept Solutions
Continue with Google Continue with Google
2 years ago Be among the first 25 applicants
Join to apply for the Site Reliability Engineer (SRE) role at Percept Solutions
Job Description
Job Description
Design and implementation of new solutions as well as enhancement and integration of existing ones to ensure pro-active monitoring
Working collaboration with internal teams and vendors to identify, monitor and improve Service Level Objective and Indicator
Support for incident management, investigation, resolution and post-mortem
Performance monitoring and capacity management
Automate manual operational tasks for self-healing
Administration to provide operational support for monitoring tools
Deployment and patching
System configuration
User access management
Incident management and investigation
Report and Dashboard generation
Job Requirements
SRE and automation tools like Ansible, Jenkins
Monitoring solutions such as Zabbix, Dynatrace,CloudWatch, eG, SolarWinds
Dashboard visualization such as Grafana
Proficient in SQL Scripting for data analytics
Familiar with database technology such as Oracle,MySQL, MS SQL
Familiar with Windows, Unix, Linux OS environments
EA Licence No.:18S9405 / EA Reg. No.:R1330864
- Seniority level Mid-Senior level
- Employment type Full-time
- Job function Engineering and Information Technology
- Industries IT Services and IT Consulting
Referrals increase your chances of interviewing at Percept Solutions by 2x
Sign in to set job alerts for “Site Reliability Engineer” roles.Continue with Google Continue with Google
Continue with Google Continue with Google
Project Intern, Digital Innovations & Solutions (Full Stack Developer) Web Frontend Engineer(Work Location: Remote in Taiwan) Software Engineering - Research Internship Software Developer – Life Sciences Technology Frontend Software Engineer, Data Platform - 2025 Start Python Developer (Singapore) – Elite Fintech Startup (up to $200K SGD + Bonus + Hybrid)We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-LjbffrSite Reliability Engineer (SRE)
Posted 1 day ago
Job Viewed
Job Description
We are a global dating app created to give everyone a chance at love. The sense of belonging and connectedness we get from relationships helps us survive and thrive, and we’re working to make it a little easier for people to find that. We’re inspired by the stories we hear from employees, friends, and family who have used our app to transform their lives, and you, too, can make a difference by joining us!
We are looking for a talented Senior Site Reliability Engineer to help design the future of dating. This individual will bring extensive experience in running large-scale data sources in the cloud and will be responsible for modernizing our data source handling and maintaining our core infrastructure and services on AWS.
This role will be based in Singapore and report directly to the CTO.
Responsibilities:- Architect, develop, and maintain our core infrastructure and services on AWS, focusing on high availability, performance, and scalability.
- Specific AWS services of interest include EC2, RDS, S3, ElastiCache, CloudWatch, RedShift, OpenSearch, and VPC.
- Implement and manage continuous deployment processes to achieve seamless deployment of services with minimal downtime.
- Monitor system performance, identify bottlenecks, and apply necessary optimizations to ensure the smooth operation of our services.
- Develop and maintain automated tools for infrastructure provisioning, configuration, and deployment.
- Work closely with development teams to integrate infrastructure builds and operational best practices into the software development lifecycle.
- Conduct root cause analysis for production errors and implement strategies to prevent future occurrences.
- Manage and optimize network configurations to ensure secure and efficient data flow and access.
- Administer and maintain databases, ensuring their reliability, performance, and security.
- Lead capacity planning efforts to ensure that our infrastructure scales in line with demand while optimizing costs and maintaining performance.
- Modernize data source handling (Redshift, Postgres, RDS, etc.).
- Manage Kubernetes workloads.
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 5+ years of industry experience.
- Proven experience as an SRE, DevOps Engineer, or similar role in a cloud-based environment.
- Strong expertise in AWS services and tools.
- Proficient understanding of networking principles, transport, and application protocols, especially TCP/IP, BGP, DNS, TLS, and HTTP/S.
- Experience with database administration, including performance tuning, backup and recovery processes, and security management.
- Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Terraform).
- Excellent problem-solving skills and the ability to work independently or as part of a team.
- Strong Written and Verbal Communication: Fluent in English (both written and verbal); proficiency in Chinese is a must.
- Significant experience in capacity planning and cost management within cloud environments.
- Experience with Kubernetes.
- Familiarity with Terraform for general systems maintenance.
- Experience with data sources like Redshift, Postgres (Citus, Patroni), and RDS.
- AWS SysOps Administrator Associate or AWS Solutions Architect Professional (SAP) certification.
- Experience with Spotinst for cost optimization.
- Familiarity with additional scripting languages such as Go or JavaScript.
If you're passionate about tackling big challenges and have the skills to help us shape the future of online dating, we want to hear from you!
Site Reliability Engineer (SRE)
Posted 1 day ago
Job Viewed
Job Description
We are a global dating app created to give everyone a chance at love. The sense of belonging and connectedness we get from relationships helps us survive and thrive, and we’re working to make it a little easier for people to find that. We’re inspired by the stories we hear from employees, friends, and family who have used our app to transform their lives, and you, too, can make a difference by joining us!
We are looking for a talented Senior Site Reliability Engineer to help design the future of dating. This individual will bring extensive experience in running large-scale data sources in the cloud and will be responsible for modernizing our data source handling and maintaining our core infrastructure and services on AWS.
This role will be based in Singapore and report directly to the CTO.
Responsibilities:- Architect, develop, and maintain our core infrastructure and services on AWS, focusing on high availability, performance, and scalability.
- Specific AWS services of interest include EC2, RDS, S3, ElastiCache, CloudWatch, RedShift, OpenSearch, and VPC.
- Implement and manage continuous deployment processes to achieve seamless deployment of services with minimal downtime.
- Monitor system performance, identify bottlenecks, and apply necessary optimizations to ensure the smooth operation of our services.
- Develop and maintain automated tools for infrastructure provisioning, configuration, and deployment.
- Work closely with development teams to integrate infrastructure builds and operational best practices into the software development lifecycle.
- Conduct root cause analysis for production errors and implement strategies to prevent future occurrences.
- Manage and optimize network configurations to ensure secure and efficient data flow and access.
- Administer and maintain databases, ensuring their reliability, performance, and security.
- Lead capacity planning efforts to ensure that our infrastructure scales in line with demand while optimizing costs and maintaining performance.
- Modernize data source handling (Redshift, Postgres, RDS, etc.).
- Manage Kubernetes workloads.
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 5+ years of industry experience.
- Proven experience as an SRE, DevOps Engineer, or similar role in a cloud-based environment.
- Strong expertise in AWS services and tools.
- Proficient understanding of networking principles, transport, and application protocols, especially TCP/IP, BGP, DNS, TLS, and HTTP/S.
- Experience with database administration, including performance tuning, backup and recovery processes, and security management.
- Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Terraform).
- Excellent problem-solving skills and the ability to work independently or as part of a team.
- Strong Written and Verbal Communication: Fluent in English (both written and verbal); proficiency in Chinese is an advantage to work with Chinese stakeholders.
- Significant experience in capacity planning and cost management within cloud environments.
- Experience with Kubernetes.
- Familiarity with Terraform for general systems maintenance.
- Experience with data sources like Redshift, Postgres (Citus, Patroni), and RDS.
- AWS SysOps Administrator Associate or AWS Solutions Architect Professional (SAP) certification.
- Experience with Spotinst for cost optimization.
- Familiarity with additional scripting languages such as Go or JavaScript.
If you're passionate about tackling big challenges and have the skills to help us shape the future of online dating, we want to hear from you!
Site Reliability Engineer (SRE) (GovTech)
Posted today
Job Viewed
Job Description
We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWSand Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.
Responsibilities:
As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
• Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
• Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
• Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
• Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
Maintenance, Optimisation & Performance
• Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
• Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
• Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
• Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
Requirements:
• Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
• Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
• Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
• Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
• Strong documentation skills and experience in knowledge sharing across teams.
• Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
• Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
• Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
• Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
• Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
• Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
• Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
• Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
• Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
Soft skills:
• Proactive in identifying problems and recommending strategic solutions.
• Excellent problem-solving skills with a robust analytical mindset.
• Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
• Ability to remain calm and effective under pressure, especially during incident response.
• Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
• Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
• Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.
Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice .
#J-18808-LjbffrSite Reliability Engineer (SRE) (GovTech)
Posted today
Job Viewed
Job Description
We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWS and Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.
Responsibilities
As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
- Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
- Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
- Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
- Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
- Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
- Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
- Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
- Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
- Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
- Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
- Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
- Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
- Strong documentation skills and experience in knowledge sharing across teams.
- Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
- Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
- Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
- Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
- Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
- Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
- Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
- Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
- Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
- Proactive in identifying problems and recommending strategic solutions.
- Excellent problem-solving skills with a robust analytical mindset.
- Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
- Ability to remain calm and effective under pressure, especially during incident response.
- Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
- Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
- Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.
Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice.
Seniority levelMid-Senior level
Employment typeFull-time
Job functionEngineering and Information Technology
IndustriesData Security Software Products
#J-18808-LjbffrSRE, Leading Tech
Posted today
Job Viewed
Job Description
Direct message the job poster from Kerry Consulting
Kerry Consulting is currently partnering a Leading Tech organisation to hire for a SRE.
As a SRE, you are responsible for designing and maintaining scalable, cloud-agnostic platforms while supporting resiliency, security, and compliance goals. You are responsible for managing operations, shared services, and on-call duties, enabling developers to move faster. Collaboration with cross-functional teams and cost optimization will be key to your day-to-day work.
Requirements
- Minimum 8 years experience in software engineering.
- Proven experience in developing shared service platforms and infrastructure solutions.
- Working experience in running PaaS product, Infrastructure Operations, or SRE.
- Ability to work with legacy and open-source code bases to fork and patch problems
- Proficiency in Kubernetes, Pulumi, AWS, Golang, TypeScript, and Python
To Apply
To apply, click on the "Quick Apply" button above. Alternatively, you could also write in with your CV to Grace Lim at quoting the above job title and reference code 34110.
Registration No: R1988923
License No: 16S8060
- Seniority level Entry level
- Employment type Full-time
- Job function Information Technology
- Industries Technology, Information and Internet
Referrals increase your chances of interviewing at Kerry Consulting by 2x
Get notified about new Technician jobs in Singapore, Singapore .
Project Intern, Digital Innovations & Solutions (Full Stack Developer) Frontend Engineer-Search - Singapore-2025 Start Back-end Software Engineer (On-site 202506) Frontend Software Engineer - TikTok Live - 2025 StartWe’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-LjbffrJunior/Senior SRE
Posted 11 days ago
Job Viewed
Job Description
Site Reliability Engineer (SRE)
Responsibilities:
Assist in deploying and managing microservices on Kubernetes cloud platforms.
Work with Cloud and DevOps teams to deploy services across multiple cloud providers (AWS, OCI, Azure, GCP).
Conduct load and chaos testing to ensure system scalability and reliability.
Support disaster recovery planning and troubleshoot production issues.
Automate processes using Python, Go, or Bash.
Define and maintain KPIs (SLA/SLO/SLI) for cloud microservices.
Maintain technical documentation and ensure compliance with security standards (ISO27001, SOC2, GDPR).
Participate in incident response and post-incident analysis.
Assist in technology selection and proof-of-concept implementation.
Provide on-call support as needed.
Requirements:
Bachelor's degree in Computer Science, IT, or related field.
Minimum 3 years of experience in SRE, DevOps, or cloud operations.
Proficiency in backend language.
Understanding of cloud security and best practices.
Strong problem-solving and teamwork skills.
Cloud certifications (AWS, Azure, GCP).
Experience with Kubernetes and container orchestration.
About Us
Dada Consultants was established in 2017, with the commitment of providing the best recruitment services in Singapore. We are comprised of a dynamic head-hunting team dedicated to sourcing for highly competent professionals in IT industry. We provide enterprises with customized talent solutions, and bring talents to career advancement.
EA Registration Number: R25128548
Be The First To Know
About the latest Sre Jobs in Singapore !
Staff Engineer - Infrastructure/SRE
Posted today
Job Viewed
Job Description
OKX will be prioritising applicants who have a current right to work in Singapore, and do not require OKX's sponsorship of a visa.
Who We AreAt OKX, we believe that the future will be reshaped by crypto, and ultimately contribute to every individual's freedom. OKX is a leading crypto exchange, and the developer of OKX Wallet, giving millions access to crypto trading and decentralized crypto applications (dApps). OKX is also a trusted brand by hundreds of large institutions seeking access to crypto markets. We are safe and reliable, backed by our Proof of Reserves. Across our multiple offices globally, we are united by our core principles: We Before Me , Do the Right Thing , and Get Things Done . These shared values drive our culture, shape our processes, and foster a friendly, rewarding, and diverse environment for every OK-er. OKX is part of OKG, a group that brings the value of Blockchain to users around the world, through our leading products OKX, OKX Wallet, OKLink and more.
About the Team
The Service Reliability Engineering team envisions ensuring service stability as one of the company's core competitive advantages. By building end-to-end, chain-level risk management capabilities, we aim to achieve sustainable, automated identification and analysis of stability risks, transitioning from "reactive governance" to "proactive governance". This approach allows us to preemptively address more stability issues, improving user experience.
What You’ll Be Doing- Effectively optimize existing runtime environments (KVM, Docker, K8S, JVM, etc.) to ensure efficient resource utilization and stable service operation.
- Deeply understand the architecture and principles of middleware (Kafka, Spring Cloud, Nacos, Apollo, Kong Gateway, etc.), ensuring high performance and availability.
- Ensure stability and optimize big data platforms (Alibaba Cloud DataWorks, AWS EMR, AWS DataBricks, Spark, Flink) and data warehouses (MaxCompute, Hologres, Hive, Clickhouse, StarRocks, etc.).
- Comprehend network architecture and security, providing guidance on infrastructure stability based on network architecture and security layers, ensuring secure, stable, and efficient network communications.
- Lead chaos engineering exercises, coordinating with business units to validate system robustness and recovery capabilities through simulated failure scenarios.
- Participate in rapid response and troubleshooting of system failures, continuously optimize monitoring strategies to reduce system downtime and ensure service continuity and stability.
- Drive infrastructure automation and intelligence to improve SRE work efficiency and quality.
- Collaborate closely with development teams, providing technical support and advice on infrastructure to jointly promote continuous product improvement and innovation.
- Bachelor's degree or above in Computer Science or related field, with 8+ years of experience in large-scale internet or cloud computing platform development/SRE/operations.
- In-depth understanding of big data platforms, data warehouses, middleware, runtime environments, and network technology principles and architectures, with rich practical experience and troubleshooting skills.
- Proficient in Linux system management and optimization, familiar with scripting languages such as Shell/Python, able to write automation tools and scripts.
- Familiar with container and cloud-native technologies like KVM, Docker, and K8S, including their architectures and principles, with extensive experience in handling common issues and failures.
- Familiar with network protocols such as TCP/UDP/QUIC, proficient in using network commands like TcpDump, TraceRoute, Netstat, and tools like Wireshark, with rich practical experience in troubleshooting common network issues.
- Rich experience with Alibaba Cloud and AWS cloud products, from architecture to usage, with extensive practice in dealing with common issues and failures.
- Practitioners with experience in service governance system construction, architecture optimization, stability assurance construction, capacity management, activity support, and chaos engineering are preferred.
- Proficiency in both the Mandarin and English language is preferred for communication with local and global stakeholders.
- Competitive total compensation package
- L&D programs and Education subsidy for employees' growth and development
- Various team building programs and company events
- Wellness and meal allowances
- Comprehensive healthcare schemes for employees and dependants
- More that we love to tell you along the process!
Staff Engineer - Infrastructure/SRE
Posted today
Job Viewed
Job Description
OKX will be prioritising applicants who have a current right to work in Singapore, and do not require OKX's sponsorship of a visa.
Who We AreAt OKX, we believe that the future will be reshaped by crypto, and ultimately contribute to every individual's freedom. OKX is a leading crypto exchange, and the developer of OKX Wallet, giving millions access to crypto trading and decentralized crypto applications (dApps). OKX is also a trusted brand by hundreds of large institutions seeking access to crypto markets. We are safe and reliable, backed by our Proof of Reserves. Across our multiple offices globally, we are united by our core principles: We Before Me , Do the Right Thing , and Get Things Done . These shared values drive our culture, shape our processes, and foster a friendly, rewarding, and diverse environment for every OK-er. OKX is part of OKG, a group that brings the value of Blockchain to users around the world, through our leading products OKX, OKX Wallet, OKLink and more.
About the TeamThe Service Reliability Engineering team envisions ensuring service stability as one of the company's core competitive advantages. By building end-to-end, chain-level risk management capabilities, we aim to achieve sustainable, automated identification and analysis of stability risks, transitioning from "reactive governance" to "proactive governance". This approach allows us to preemptively address more stability issues, improving user experience.
What You’ll Be Doing- Effectively optimize existing runtime environments (KVM, Docker, K8S, JVM, etc.) to ensure efficient resource utilization and stable service operation.
- Deeply understand the architecture and principles of middleware (Kafka, Spring Cloud, Nacos, Apollo, Kong Gateway, etc.), ensuring high performance and availability.
- Ensure stability and optimize big data platforms (Alibaba Cloud DataWorks, AWS EMR, AWS DataBricks, Spark, Flink) and data warehouses (MaxCompute, Hologres, Hive, Clickhouse, StarRocks, etc.).
- Comprehend network architecture and security, providing guidance on infrastructure stability based on network architecture and security layers, ensuring secure, stable, and efficient network communications.
- Lead chaos engineering exercises, coordinating with business units to validate system robustness and recovery capabilities through simulated failure scenarios.
- Participate in rapid response and troubleshooting of system failures, continuously optimize monitoring strategies to reduce system downtime and ensure service continuity and stability.
- Drive infrastructure automation and intelligence to improve SRE work efficiency and quality.
- Collaborate closely with development teams, providing technical support and advice on infrastructure to jointly promote continuous product improvement and innovation.
Bachelor's degree or above in Computer Science or related field, with 8+ years of experience in large-scale internet or cloud computing platform development/SRE/operations.
In-depth understanding of big data platforms, data warehouses, middleware, runtime environments, and network technology principles and architectures, with rich practical experience and troubleshooting skills.
Proficient in Linux system management and optimization, familiar with scripting languages such as Shell/Python, able to write automation tools and scripts.
Familiar with container and cloud-native technologies like KVM, Docker, and K8S, including their architectures and principles, with extensive experience in handling common issues and failures.
Familiar with network protocols such as TCP/UDP/QUIC, proficient in using network commands like TcpDump, TraceRoute, Netstat, and tools like Wireshark, with rich practical experience in troubleshooting common network issues.
Rich experience with Alibaba Cloud and AWS cloud products, from architecture to usage, with extensive practice in dealing with common issues and failures.
Practitioners with experience in service governance system construction, architecture optimization, stability assurance construction, capacity management, activity support, and chaos engineering are preferred.
Proficiency in both the Mandarin and English language is preferred for communication with local and global stakeholders.
L&D programs and Education subsidy for employees' growth and development.
Various team building programs and company events.
Wellness and meal allowances.
Comprehensive healthcare schemes for employees and dependants.
More that we love to tell you along the process!
Apply for this job*
indicates a required field
First Name *
Last Name *
Email *
Phone *
Location (City) *
Resume/CV *
Enter manually
Accepted file types: pdf, doc, docx, txt, rtf
Education
School Select.
Degree Select.
Start date year
End date year
Do you have any tech experience working in a Java environment like Springboot, Spring Cloud in any part of your career? * Select.
In these teams, we are mostly using Java framework.
Are you legally authorized to work in the advertised location for this role? * Select.
Please indicate if you are a Singapore Citizen, Permanent Resident, or if you require a work pass to work and reside in Singapore. For work pass holders, kindly also specify which pass you are currently holding, if applicable. * Select.
Which company are you currently employed or last employed with? *
What is your notice period to your current employer? * Select.
#J-18808-LjbffrCloud SRE Engineer - Linux
Posted today
Job Viewed
Job Description
You are about to enter websites controlled or offered by third parties. OCBC hereby disclaims liability for any information, materials, products or services posted or offered at any of these third party web-sites. By creating a link to these third party web-sites, OCBC does not endorse or recommend any products or services offered or information contained on those web-sites or information fed by these third parties nor is OCBC liable for any failure of products or services offered or advertised at any of these third party web-sites. OCBC Group shall in no event be liable for any damages, loss or expense including without limitation, direct, indirect, special, or consequential damage, or economic loss arising from or in connection with any use of or access to any other website linked to this website, any system, server or connection failure, error, omission, interruption, delay in transmission, or computer virus and any services, products, information, data, software or other material obtained from this website or from any other website linked to this website. Any hyperlinks to any other websites are not an endorsement or verification of such websites and such websites should only be accessed at the user’s own risks. This exclusion clause shall take effect to the fullest extent permitted by law.
You further consent to Oversea-Chinese Banking Corporation Limited, its related corporations (collectively, the "OCBC Group"), and their respective business partners and agents (collectively, the “OCBC Representatives”) collecting, using and disclosing your personal data for purposes reasonably required by the OCBC Group and the OCBC Representatives to enable them to process your employment application and assess your suitability for the position which you are applying for. Such purposes are set out in a Data Protection Policy, which is accessible at or available on request and which you confirm you have read and understood.
As Singapore’s longest established bank, we have been dedicated to enabling individuals and businesses to achieve their aspirations since 1932. How? By taking the time to truly understand people. From there, we provide support, services, solutions, and career paths that meet their individual needs and desires.
Today, we’re on a journey of transformation. Leveraging technology and creativity to become a future-ready learning organisation. But for all that change, our strategic ambition is consistently clear and bold, which is to be Asia’s leading financial services partner for a sustainable future.
We invite you to build the bank of the future. Innovate the way we deliver financial services. Work in friendly, supportive teams. Build lasting value in your community. Help people grow their assets, business, and investments. Take your learning as far as you can. Or simply enjoy a vibrant, future-ready career.
Your Opportunity Starts Here.
Why JoinImagine being part of a team that designs and delivers a reliable, scalable, secure, and performant Red Hat Linux Platform. As a Cloud SRE Engineer, you'll have the opportunity to work with various technology teams to build and maintain a cutting-edge platform that supports the Bank's operations. You'll stay up-to-date with the latest technical trends and share your expertise with the entire Engineering/Operations organization.
How you succeedTo excel in this role, you'll need to be passionate about problem-solving and have strong analytical capabilities. You'll work collaboratively with other teams to identify and resolve issues, and drive improvements using SRE practices. You'll also be expected to participate in a 24/7 on-call rotation and contribute to toil elimination, observability, and monitoring improvements.
What you doDesign and deliver a reliable, scalable, secure, and performant Red Hat Linux Platform
Stay current on technical trends and suggest innovative tools and approaches to interesting problems
Share expertise with the entire Engineering/Operations organization
Participate in a 24/7 on-call rotation and drive improvements using SRE practices
Eliminate toil, improve observability and monitoring, and manage knowledge
Ensure error budget compliance, deployment designs, and testing
Who you work withGroup Operations & Technology co-creates products and solutions. We build the underlying technology applications and services. And manage the Group's IT operations & cyber defence. 247. 365. End-to-end transaction fulfillment services. For the whole Group. With singular focus. Delivering exceptional customer experience. At the forefront of our digital transformation journey. Relentless innovation. Pushing boundaries. And it's all powered by serious investment in your development.
Who you are5+ years of experience in Linux system administration and engineering
Strong analytical and problem-solving skills
Excellent communication and collaboration skills
Experience with:
Installing, maintaining, upgrading, and patching UNIX servers
Troubleshooting and fixing system and software/hardware issues
Supporting high availability of systems using clustering software
Securing systems by following published hardening guidelines
Assisting in audit and compliance tasks
Writing APIs and developing web applications
Querying relational and NoSQL databases
Automating releases, continuous integration/delivery systems, and relevant tools
Infrastructure as code (Terraform or CloudFormation)
Configuration management systems (SCCM, Ansible, Puppet, Chef)
Programming skills in at least one of the following languages: Python, PowerShell, Ruby, Java, C++, C#, Go
Who we areSingapore's longest established bank, we've been helping people and businesses get what they want from life since 1932. How? By taking the time to truly understand people. From there, we provide support, services, solutions, and career paths that meet their individual needs and desires. Today, we're on a journey of transformation. Embracing technology and creativity to become a future-ready learning organisation. But for all that change, our purpose remains: to enable people and communities to realise their aspirations.
What we offerCompetitive base salary. A suite of holistic, flexible benefits to suit every lifestyle. Community initiatives. Industry-leading learning and professional development opportunities. Equal opportunity. Fair employment. Selection based on ability and fit with our culture and values. Your wellbeing, growth, and aspirations are every bit as cared for as the needs of our customers.
What we offer:
Competitive base salary. A suite of holistic, flexible benefits to suit every lifestyle. Community initiatives. Industry-leading learning and professional development opportunities. Your wellbeing, growth and aspirations are every bit as cared for as the needs of our customers.
If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!
#J-18808-Ljbffr