159 Sre jobs in Singapore

Site Reliability Engineer (SRE)

Singapore, Singapore Percept Solutions

Posted today

Job Viewed

Tap Again To Close

Job Description

Join to apply for the Site Reliability Engineer (SRE) role at Percept Solutions

Continue with Google Continue with Google

2 years ago Be among the first 25 applicants

Join to apply for the Site Reliability Engineer (SRE) role at Percept Solutions

Job Description

Job Description

Design and implementation of new solutions as well as enhancement and integration of existing ones to ensure pro-active monitoring

Working collaboration with internal teams and vendors to identify, monitor and improve Service Level Objective and Indicator

Support for incident management, investigation, resolution and post-mortem

Performance monitoring and capacity management

Automate manual operational tasks for self-healing

Administration to provide operational support for monitoring tools

Deployment and patching

System configuration

User access management

Incident management and investigation

Report and Dashboard generation

Job Requirements

SRE and automation tools like Ansible, Jenkins

Monitoring solutions such as Zabbix, Dynatrace,CloudWatch, eG, SolarWinds

Dashboard visualization such as Grafana

Proficient in SQL Scripting for data analytics

Familiar with database technology such as Oracle,MySQL, MS SQL

Familiar with Windows, Unix, Linux OS environments

EA Licence No.:18S9405 / EA Reg. No.:R1330864

Seniority level
  • Seniority level Mid-Senior level
Employment type
  • Employment type Full-time
Job function
  • Job function Engineering and Information Technology
  • Industries IT Services and IT Consulting

Referrals increase your chances of interviewing at Percept Solutions by 2x

Sign in to set job alerts for “Site Reliability Engineer” roles.

Continue with Google Continue with Google

Continue with Google Continue with Google

Project Intern, Digital Innovations & Solutions (Full Stack Developer) Web Frontend Engineer(Work Location: Remote in Taiwan) Software Engineering - Research Internship Software Developer – Life Sciences Technology Frontend Software Engineer, Data Platform - 2025 Start Python Developer (Singapore) – Elite Fintech Startup (up to $200K SGD + Bonus + Hybrid)

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE)

068895 $7500 Monthly HELLO PLANET PTE. LTD.

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

We are a global dating app created to give everyone a chance at love. The sense of belonging and connectedness we get from relationships helps us survive and thrive, and we’re working to make it a little easier for people to find that. We’re inspired by the stories we hear from employees, friends, and family who have used our app to transform their lives, and you, too, can make a difference by joining us!


We are looking for a talented Senior Site Reliability Engineer to help design the future of dating. This individual will bring extensive experience in running large-scale data sources in the cloud and will be responsible for modernizing our data source handling and maintaining our core infrastructure and services on AWS.

This role will be based in Singapore and report directly to the CTO.

Responsibilities:
  • Architect, develop, and maintain our core infrastructure and services on AWS, focusing on high availability, performance, and scalability.
  • Specific AWS services of interest include EC2, RDS, S3, ElastiCache, CloudWatch, RedShift, OpenSearch, and VPC.
  • Implement and manage continuous deployment processes to achieve seamless deployment of services with minimal downtime.
  • Monitor system performance, identify bottlenecks, and apply necessary optimizations to ensure the smooth operation of our services.
  • Develop and maintain automated tools for infrastructure provisioning, configuration, and deployment.
  • Work closely with development teams to integrate infrastructure builds and operational best practices into the software development lifecycle.
  • Conduct root cause analysis for production errors and implement strategies to prevent future occurrences.
  • Manage and optimize network configurations to ensure secure and efficient data flow and access.
  • Administer and maintain databases, ensuring their reliability, performance, and security.
  • Lead capacity planning efforts to ensure that our infrastructure scales in line with demand while optimizing costs and maintaining performance.
  • Modernize data source handling (Redshift, Postgres, RDS, etc.).
  • Manage Kubernetes workloads.
Qualifications:
  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • 5+ years of industry experience.
  • Proven experience as an SRE, DevOps Engineer, or similar role in a cloud-based environment.
  • Strong expertise in AWS services and tools.
  • Proficient understanding of networking principles, transport, and application protocols, especially TCP/IP, BGP, DNS, TLS, and HTTP/S.
  • Experience with database administration, including performance tuning, backup and recovery processes, and security management.
  • Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Terraform).
  • Excellent problem-solving skills and the ability to work independently or as part of a team.
  • Strong Written and Verbal Communication: Fluent in English (both written and verbal); proficiency in Chinese is a must.
  • Significant experience in capacity planning and cost management within cloud environments.
  • Experience with Kubernetes.
  • Familiarity with Terraform for general systems maintenance.
  • Experience with data sources like Redshift, Postgres (Citus, Patroni), and RDS.
Preferred Qualifications:
  • AWS SysOps Administrator Associate or AWS Solutions Architect Professional (SAP) certification.
  • Experience with Spotinst for cost optimization.
  • Familiarity with additional scripting languages such as Go or JavaScript.

If you're passionate about tackling big challenges and have the skills to help us shape the future of online dating, we want to hear from you!

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE)

068895 $8500 Monthly COFFEE MEETS BAGEL PTE. LTD.

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

We are a global dating app created to give everyone a chance at love. The sense of belonging and connectedness we get from relationships helps us survive and thrive, and we’re working to make it a little easier for people to find that. We’re inspired by the stories we hear from employees, friends, and family who have used our app to transform their lives, and you, too, can make a difference by joining us!

We are looking for a talented Senior Site Reliability Engineer to help design the future of dating. This individual will bring extensive experience in running large-scale data sources in the cloud and will be responsible for modernizing our data source handling and maintaining our core infrastructure and services on AWS.

This role will be based in Singapore and report directly to the CTO.

Responsibilities:
  • Architect, develop, and maintain our core infrastructure and services on AWS, focusing on high availability, performance, and scalability.
  • Specific AWS services of interest include EC2, RDS, S3, ElastiCache, CloudWatch, RedShift, OpenSearch, and VPC.
  • Implement and manage continuous deployment processes to achieve seamless deployment of services with minimal downtime.
  • Monitor system performance, identify bottlenecks, and apply necessary optimizations to ensure the smooth operation of our services.
  • Develop and maintain automated tools for infrastructure provisioning, configuration, and deployment.
  • Work closely with development teams to integrate infrastructure builds and operational best practices into the software development lifecycle.
  • Conduct root cause analysis for production errors and implement strategies to prevent future occurrences.
  • Manage and optimize network configurations to ensure secure and efficient data flow and access.
  • Administer and maintain databases, ensuring their reliability, performance, and security.
  • Lead capacity planning efforts to ensure that our infrastructure scales in line with demand while optimizing costs and maintaining performance.
  • Modernize data source handling (Redshift, Postgres, RDS, etc.).
  • Manage Kubernetes workloads.
Qualifications:
  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • 5+ years of industry experience.
  • Proven experience as an SRE, DevOps Engineer, or similar role in a cloud-based environment.
  • Strong expertise in AWS services and tools.
  • Proficient understanding of networking principles, transport, and application protocols, especially TCP/IP, BGP, DNS, TLS, and HTTP/S.
  • Experience with database administration, including performance tuning, backup and recovery processes, and security management.
  • Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Terraform).
  • Excellent problem-solving skills and the ability to work independently or as part of a team.
  • Strong Written and Verbal Communication: Fluent in English (both written and verbal); proficiency in Chinese is an advantage to work with Chinese stakeholders.
  • Significant experience in capacity planning and cost management within cloud environments.
  • Experience with Kubernetes.
  • Familiarity with Terraform for general systems maintenance.
  • Experience with data sources like Redshift, Postgres (Citus, Patroni), and RDS.
Preferred Qualifications:
  • AWS SysOps Administrator Associate or AWS Solutions Architect Professional (SAP) certification.
  • Experience with Spotinst for cost optimization.
  • Familiarity with additional scripting languages such as Go or JavaScript.

If you're passionate about tackling big challenges and have the skills to help us shape the future of online dating, we want to hear from you!

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE) (GovTech)

Singapore, Singapore Avepoint

Posted today

Job Viewed

Tap Again To Close

Job Description

We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.

As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWSand Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.

Responsibilities:

As a Site Reliability Engineer, you will be responsible for:
Toil Reduction & Automation
• Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
• Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
• Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
• Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.

Maintenance, Optimisation & Performance
• Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
• Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
• Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
• Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.

Requirements:

• Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
• Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
• Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
• Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
• Strong documentation skills and experience in knowledge sharing across teams.
• Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
• Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
• Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
• Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
• Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
• Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
• Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
• Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
• Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.

Soft skills:

• Proactive in identifying problems and recommending strategic solutions.
• Excellent problem-solving skills with a robust analytical mindset.
• Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
• Ability to remain calm and effective under pressure, especially during incident response.
• Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
• Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
• Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.

Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice .

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer (SRE) (GovTech)

Singapore, Singapore AvePoint

Posted today

Job Viewed

Tap Again To Close

Job Description

Site Reliability Engineer (SRE) (GovTech)

We are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.

As a Site Reliability Engineer, you will be responsible for designing and operating GitLab, AWS and Kubernetes-based infrastructure and solutions that power our platform, to ensure the stability, scalability, and performance of our runtime platform.

Responsibilities

As a Site Reliability Engineer, you will be responsible for:

Toil Reduction & Automation

  • Identify repetitive tasks and develop automation via CI/CD pipelines, ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
  • Implement comprehensive observability solutions (logs, metrics, traces, alerts) around the four Golden Signals (latency, traffic, errors, saturation), and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
  • Participate in on-call rotations, promptly respond to incidents to minimize MTTR, and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
  • Design and implement solutions that are secure and compliant by collaborating with dedicated security teams, conducting regular audits, and integrating advanced vulnerability scanning tools.
Maintenance, Optimisation & Performance
  • Identify and resolve performance bottlenecks and operational issues, define and track KPIs (e.g., MTTR, system uptime, cost efficiency), and drive ongoing optimisation efforts.
Strategic Customer Engagement
  • Act as a technical advisor for tenants, guiding them on containerization, and best practices for cloud-native deployments, and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
  • Develop and maintain detailed playbooks, runbooks, and documentation to facilitate team-wide knowledge sharing, streamline incident response, and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
  • Stay current with the latest AWS, Kubernetes, and industry developments, and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
Requirements
  • Bachelor's degree or Diploma in Computer Science, Engineering, or a related field (or equivalent experience).
  • Proven experience as a Site Reliability Engineer or similar role, with a strong background in containerization, orchestration, and cloud-native technologies.
  • Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
  • Demonstrated experience with incident management, including post-incident reviews and continuous improvement.
  • Strong documentation skills and experience in knowledge sharing across teams.
  • Deep understanding of AWS, Kubernetes (including AWS EKS), and operational best practices, with familiarity in multi-cloud or hybrid environments.
  • Solid grasp of networking, security, and storage in both AWS and Kubernetes contexts.
  • Experience integrating Kubernetes with AWS cloud technologies (e.g., Secrets Manager, Load Balancers) and using infrastructure-as-code (Terraform or similar).
  • Hands-on experience with containerization tools (Kubernetes, Kustomize, Helm) and automation scripting (Go, Python, Bash, or equivalent).
  • Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts, ensuring the reliability and effectiveness of automated solutions.
  • Familiarity with CI/CD tools (GitLab CI/CD, ArgoCD) and version control systems (Git).
  • Experience with observability/monitoring tools (Prometheus, Grafana, ELK Stack) and defining SLOs and Error Budgets.
  • Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
  • Experience with developing Kubernetes operators using Go, service mesh technologies, and Chaos Engineering is a plus.
Soft Skills
  • Proactive in identifying problems and recommending strategic solutions.
  • Excellent problem-solving skills with a robust analytical mindset.
  • Clear, concise, and effective communication skills; adept at collaborating across crossfunctional teams, including development, security, and customer-facing groups.
  • Ability to remain calm and effective under pressure, especially during incident response.
  • Adaptability to rapid change with a continuous learning mindset, sharing knowledge to foster team growth.
  • Customer-focused with the ability to translate technical insights into understandable, actionable guidance.
  • Leadership and mentoring capabilities, contributing to the development of a resilient and collaborative team environment are a plus.

Any personal data you share with us during the application process will be processed strictly in compliance with applicable data protection laws and our Privacy Notice.

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Data Security Software Products

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

SRE, Leading Tech

Singapore, Singapore Kerry Consulting

Posted today

Job Viewed

Tap Again To Close

Job Description

Direct message the job poster from Kerry Consulting

Kerry Consulting is currently partnering a Leading Tech organisation to hire for a SRE.

As a SRE, you are responsible for designing and maintaining scalable, cloud-agnostic platforms while supporting resiliency, security, and compliance goals. You are responsible for managing operations, shared services, and on-call duties, enabling developers to move faster. Collaboration with cross-functional teams and cost optimization will be key to your day-to-day work.


Requirements

  • Minimum 8 years experience in software engineering.
  • Proven experience in developing shared service platforms and infrastructure solutions.
  • Working experience in running PaaS product, Infrastructure Operations, or SRE.
  • Ability to work with legacy and open-source code bases to fork and patch problems
  • Proficiency in Kubernetes, Pulumi, AWS, Golang, TypeScript, and Python


To Apply

To apply, click on the "Quick Apply" button above. Alternatively, you could also write in with your CV to Grace Lim at quoting the above job title and reference code 34110.

Registration No: R1988923
License No: 16S8060

Seniority level
  • Seniority level Entry level
Employment type
  • Employment type Full-time
Job function
  • Job function Information Technology
  • Industries Technology, Information and Internet

Referrals increase your chances of interviewing at Kerry Consulting by 2x

Get notified about new Technician jobs in Singapore, Singapore .

Project Intern, Digital Innovations & Solutions (Full Stack Developer) Frontend Engineer-Search - Singapore-2025 Start Back-end Software Engineer (On-site 202506) Frontend Software Engineer - TikTok Live - 2025 Start

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Junior/Senior SRE

119971 $15000 Monthly DADACONSULTANTS PTE. LTD.

Posted 11 days ago

Job Viewed

Tap Again To Close

Job Description

Site Reliability Engineer (SRE)

Responsibilities:

Assist in deploying and managing microservices on Kubernetes cloud platforms.

Work with Cloud and DevOps teams to deploy services across multiple cloud providers (AWS, OCI, Azure, GCP).

Conduct load and chaos testing to ensure system scalability and reliability.

Support disaster recovery planning and troubleshoot production issues.

Automate processes using Python, Go, or Bash.

Define and maintain KPIs (SLA/SLO/SLI) for cloud microservices.

Maintain technical documentation and ensure compliance with security standards (ISO27001, SOC2, GDPR).

Participate in incident response and post-incident analysis.

Assist in technology selection and proof-of-concept implementation.

Provide on-call support as needed.

Requirements:

Bachelor's degree in Computer Science, IT, or related field.

Minimum 3 years of experience in SRE, DevOps, or cloud operations.

Proficiency in backend language.

Understanding of cloud security and best practices.

Strong problem-solving and teamwork skills.

Cloud certifications (AWS, Azure, GCP).

Experience with Kubernetes and container orchestration.



About Us

Dada Consultants was established in 2017, with the commitment of providing the best recruitment services in Singapore. We are comprised of a dynamic head-hunting team dedicated to sourcing for highly competent professionals in IT industry. We provide enterprises with customized talent solutions, and bring talents to career advancement.

EA Registration Number: R25128548

This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Sre Jobs in Singapore !

Staff Engineer - Infrastructure/SRE

Singapore, Singapore NEAR

Posted today

Job Viewed

Tap Again To Close

Job Description

OKX will be prioritising applicants who have a current right to work in Singapore, and do not require OKX's sponsorship of a visa.

Who We Are

At OKX, we believe that the future will be reshaped by crypto, and ultimately contribute to every individual's freedom. OKX is a leading crypto exchange, and the developer of OKX Wallet, giving millions access to crypto trading and decentralized crypto applications (dApps). OKX is also a trusted brand by hundreds of large institutions seeking access to crypto markets. We are safe and reliable, backed by our Proof of Reserves. Across our multiple offices globally, we are united by our core principles: We Before Me , Do the Right Thing , and Get Things Done . These shared values drive our culture, shape our processes, and foster a friendly, rewarding, and diverse environment for every OK-er. OKX is part of OKG, a group that brings the value of Blockchain to users around the world, through our leading products OKX, OKX Wallet, OKLink and more.

About the Team

The Service Reliability Engineering team envisions ensuring service stability as one of the company's core competitive advantages. By building end-to-end, chain-level risk management capabilities, we aim to achieve sustainable, automated identification and analysis of stability risks, transitioning from "reactive governance" to "proactive governance". This approach allows us to preemptively address more stability issues, improving user experience.

What You’ll Be Doing

  • Effectively optimize existing runtime environments (KVM, Docker, K8S, JVM, etc.) to ensure efficient resource utilization and stable service operation.
  • Deeply understand the architecture and principles of middleware (Kafka, Spring Cloud, Nacos, Apollo, Kong Gateway, etc.), ensuring high performance and availability.
  • Ensure stability and optimize big data platforms (Alibaba Cloud DataWorks, AWS EMR, AWS DataBricks, Spark, Flink) and data warehouses (MaxCompute, Hologres, Hive, Clickhouse, StarRocks, etc.).
  • Comprehend network architecture and security, providing guidance on infrastructure stability based on network architecture and security layers, ensuring secure, stable, and efficient network communications.
  • Lead chaos engineering exercises, coordinating with business units to validate system robustness and recovery capabilities through simulated failure scenarios.
  • Participate in rapid response and troubleshooting of system failures, continuously optimize monitoring strategies to reduce system downtime and ensure service continuity and stability.
  • Drive infrastructure automation and intelligence to improve SRE work efficiency and quality.
  • Collaborate closely with development teams, providing technical support and advice on infrastructure to jointly promote continuous product improvement and innovation.

What We Look For In You

  • Bachelor's degree or above in Computer Science or related field, with 8+ years of experience in large-scale internet or cloud computing platform development/SRE/operations.
  • In-depth understanding of big data platforms, data warehouses, middleware, runtime environments, and network technology principles and architectures, with rich practical experience and troubleshooting skills.
  • Proficient in Linux system management and optimization, familiar with scripting languages such as Shell/Python, able to write automation tools and scripts.
  • Familiar with container and cloud-native technologies like KVM, Docker, and K8S, including their architectures and principles, with extensive experience in handling common issues and failures.
  • Familiar with network protocols such as TCP/UDP/QUIC, proficient in using network commands like TcpDump, TraceRoute, Netstat, and tools like Wireshark, with rich practical experience in troubleshooting common network issues.
  • Rich experience with Alibaba Cloud and AWS cloud products, from architecture to usage, with extensive practice in dealing with common issues and failures.
  • Practitioners with experience in service governance system construction, architecture optimization, stability assurance construction, capacity management, activity support, and chaos engineering are preferred.
  • Proficiency in both the Mandarin and English language is preferred for communication with local and global stakeholders.

Perks & Benefits
  • Competitive total compensation package
  • L&D programs and Education subsidy for employees' growth and development
  • Various team building programs and company events
  • Wellness and meal allowances
  • Comprehensive healthcare schemes for employees and dependants
  • More that we love to tell you along the process!
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Staff Engineer - Infrastructure/SRE

Singapore, Singapore OKXhas

Posted today

Job Viewed

Tap Again To Close

Job Description

OKX will be prioritising applicants who have a current right to work in Singapore, and do not require OKX's sponsorship of a visa.

Who We Are

At OKX, we believe that the future will be reshaped by crypto, and ultimately contribute to every individual's freedom. OKX is a leading crypto exchange, and the developer of OKX Wallet, giving millions access to crypto trading and decentralized crypto applications (dApps). OKX is also a trusted brand by hundreds of large institutions seeking access to crypto markets. We are safe and reliable, backed by our Proof of Reserves. Across our multiple offices globally, we are united by our core principles: We Before Me , Do the Right Thing , and Get Things Done . These shared values drive our culture, shape our processes, and foster a friendly, rewarding, and diverse environment for every OK-er. OKX is part of OKG, a group that brings the value of Blockchain to users around the world, through our leading products OKX, OKX Wallet, OKLink and more.

About the Team

The Service Reliability Engineering team envisions ensuring service stability as one of the company's core competitive advantages. By building end-to-end, chain-level risk management capabilities, we aim to achieve sustainable, automated identification and analysis of stability risks, transitioning from "reactive governance" to "proactive governance". This approach allows us to preemptively address more stability issues, improving user experience.

What You’ll Be Doing
  1. Effectively optimize existing runtime environments (KVM, Docker, K8S, JVM, etc.) to ensure efficient resource utilization and stable service operation.
  2. Deeply understand the architecture and principles of middleware (Kafka, Spring Cloud, Nacos, Apollo, Kong Gateway, etc.), ensuring high performance and availability.
  3. Ensure stability and optimize big data platforms (Alibaba Cloud DataWorks, AWS EMR, AWS DataBricks, Spark, Flink) and data warehouses (MaxCompute, Hologres, Hive, Clickhouse, StarRocks, etc.).
  4. Comprehend network architecture and security, providing guidance on infrastructure stability based on network architecture and security layers, ensuring secure, stable, and efficient network communications.
  5. Lead chaos engineering exercises, coordinating with business units to validate system robustness and recovery capabilities through simulated failure scenarios.
  6. Participate in rapid response and troubleshooting of system failures, continuously optimize monitoring strategies to reduce system downtime and ensure service continuity and stability.
  7. Drive infrastructure automation and intelligence to improve SRE work efficiency and quality.
  8. Collaborate closely with development teams, providing technical support and advice on infrastructure to jointly promote continuous product improvement and innovation.
What We Look For In You

Bachelor's degree or above in Computer Science or related field, with 8+ years of experience in large-scale internet or cloud computing platform development/SRE/operations.

In-depth understanding of big data platforms, data warehouses, middleware, runtime environments, and network technology principles and architectures, with rich practical experience and troubleshooting skills.

Proficient in Linux system management and optimization, familiar with scripting languages such as Shell/Python, able to write automation tools and scripts.

Familiar with container and cloud-native technologies like KVM, Docker, and K8S, including their architectures and principles, with extensive experience in handling common issues and failures.

Familiar with network protocols such as TCP/UDP/QUIC, proficient in using network commands like TcpDump, TraceRoute, Netstat, and tools like Wireshark, with rich practical experience in troubleshooting common network issues.

Rich experience with Alibaba Cloud and AWS cloud products, from architecture to usage, with extensive practice in dealing with common issues and failures.

Practitioners with experience in service governance system construction, architecture optimization, stability assurance construction, capacity management, activity support, and chaos engineering are preferred.

Proficiency in both the Mandarin and English language is preferred for communication with local and global stakeholders.

L&D programs and Education subsidy for employees' growth and development.

Various team building programs and company events.

Wellness and meal allowances.

Comprehensive healthcare schemes for employees and dependants.

More that we love to tell you along the process!

Apply for this job

*

indicates a required field

First Name *

Last Name *

Email *

Phone *

Location (City) *

Resume/CV *

Enter manually

Accepted file types: pdf, doc, docx, txt, rtf

Education

School Select.

Degree Select.

Start date year

End date year

Do you have any tech experience working in a Java environment like Springboot, Spring Cloud in any part of your career? * Select.

In these teams, we are mostly using Java framework.

Are you legally authorized to work in the advertised location for this role? * Select.

Please indicate if you are a Singapore Citizen, Permanent Resident, or if you require a work pass to work and reside in Singapore. For work pass holders, kindly also specify which pass you are currently holding, if applicable. * Select.

Which company are you currently employed or last employed with? *

What is your notice period to your current employer? * Select.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Cloud SRE Engineer - Linux

OCBC Bank Berhad OCBC Al Amin Bank Berhad

Posted today

Job Viewed

Tap Again To Close

Job Description

workfromhome


You are about to enter websites controlled or offered by third parties. OCBC hereby disclaims liability for any information, materials, products or services posted or offered at any of these third party web-sites. By creating a link to these third party web-sites, OCBC does not endorse or recommend any products or services offered or information contained on those web-sites or information fed by these third parties nor is OCBC liable for any failure of products or services offered or advertised at any of these third party web-sites. OCBC Group shall in no event be liable for any damages, loss or expense including without limitation, direct, indirect, special, or consequential damage, or economic loss arising from or in connection with any use of or access to any other website linked to this website, any system, server or connection failure, error, omission, interruption, delay in transmission, or computer virus and any services, products, information, data, software or other material obtained from this website or from any other website linked to this website. Any hyperlinks to any other websites are not an endorsement or verification of such websites and such websites should only be accessed at the user’s own risks. This exclusion clause shall take effect to the fullest extent permitted by law.

You further consent to Oversea-Chinese Banking Corporation Limited, its related corporations (collectively, the "OCBC Group"), and their respective business partners and agents (collectively, the “OCBC Representatives”) collecting, using and disclosing your personal data for purposes reasonably required by the OCBC Group and the OCBC Representatives to enable them to process your employment application and assess your suitability for the position which you are applying for. Such purposes are set out in a Data Protection Policy, which is accessible at or available on request and which you confirm you have read and understood.

As Singapore’s longest established bank, we have been dedicated to enabling individuals and businesses to achieve their aspirations since 1932. How? By taking the time to truly understand people. From there, we provide support, services, solutions, and career paths that meet their individual needs and desires.

Today, we’re on a journey of transformation. Leveraging technology and creativity to become a future-ready learning organisation. But for all that change, our strategic ambition is consistently clear and bold, which is to be Asia’s leading financial services partner for a sustainable future.

We invite you to build the bank of the future. Innovate the way we deliver financial services. Work in friendly, supportive teams. Build lasting value in your community. Help people grow their assets, business, and investments. Take your learning as far as you can. Or simply enjoy a vibrant, future-ready career.

Your Opportunity Starts Here.

Why Join

Imagine being part of a team that designs and delivers a reliable, scalable, secure, and performant Red Hat Linux Platform. As a Cloud SRE Engineer, you'll have the opportunity to work with various technology teams to build and maintain a cutting-edge platform that supports the Bank's operations. You'll stay up-to-date with the latest technical trends and share your expertise with the entire Engineering/Operations organization.

How you succeed

To excel in this role, you'll need to be passionate about problem-solving and have strong analytical capabilities. You'll work collaboratively with other teams to identify and resolve issues, and drive improvements using SRE practices. You'll also be expected to participate in a 24/7 on-call rotation and contribute to toil elimination, observability, and monitoring improvements.

What you do

Design and deliver a reliable, scalable, secure, and performant Red Hat Linux Platform

Stay current on technical trends and suggest innovative tools and approaches to interesting problems

Share expertise with the entire Engineering/Operations organization

Participate in a 24/7 on-call rotation and drive improvements using SRE practices

Eliminate toil, improve observability and monitoring, and manage knowledge

Ensure error budget compliance, deployment designs, and testing

Who you work with

Group Operations & Technology co-creates products and solutions. We build the underlying technology applications and services. And manage the Group's IT operations & cyber defence. 247. 365. End-to-end transaction fulfillment services. For the whole Group. With singular focus. Delivering exceptional customer experience. At the forefront of our digital transformation journey. Relentless innovation. Pushing boundaries. And it's all powered by serious investment in your development.

Who you are

5+ years of experience in Linux system administration and engineering

Strong analytical and problem-solving skills

Excellent communication and collaboration skills

Experience with:

Installing, maintaining, upgrading, and patching UNIX servers

Troubleshooting and fixing system and software/hardware issues

Supporting high availability of systems using clustering software

Securing systems by following published hardening guidelines

Assisting in audit and compliance tasks

Writing APIs and developing web applications

Querying relational and NoSQL databases

Automating releases, continuous integration/delivery systems, and relevant tools

Infrastructure as code (Terraform or CloudFormation)

Configuration management systems (SCCM, Ansible, Puppet, Chef)

Programming skills in at least one of the following languages: Python, PowerShell, Ruby, Java, C++, C#, Go

Who we are

Singapore's longest established bank, we've been helping people and businesses get what they want from life since 1932. How? By taking the time to truly understand people. From there, we provide support, services, solutions, and career paths that meet their individual needs and desires. Today, we're on a journey of transformation. Embracing technology and creativity to become a future-ready learning organisation. But for all that change, our purpose remains: to enable people and communities to realise their aspirations.

What we offer

Competitive base salary. A suite of holistic, flexible benefits to suit every lifestyle. Community initiatives. Industry-leading learning and professional development opportunities. Equal opportunity. Fair employment. Selection based on ability and fit with our culture and values. Your wellbeing, growth, and aspirations are every bit as cared for as the needs of our customers.

What we offer:


Competitive base salary. A suite of holistic, flexible benefits to suit every lifestyle. Community initiatives. Industry-leading learning and professional development opportunities. Your wellbeing, growth and aspirations are every bit as cared for as the needs of our customers.

If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Sre Jobs