103 Site Reliability Engineer jobs in Singapore

Site Reliability Engineer

Singapore, Singapore Viasat

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

About us

One team. Global challenges. Infinite opportunities. At Viasat, we’re on a mission to deliver connections with the capacity to change the world. For more than 35 years, Viasat has helped shape how consumers, businesses, governments and militaries around the globe communicate. We’re looking for people who think big, act fearlessly, and create an inclusive environment that drives positive impact to join our team.


What you'll do

The Customer Engineering team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability, and performance of the Service to different Enterprise Customers. The Customer Engineering Team is empowered to drive technical resolutions across the technology stack from hardware through to application and all stops in between. The team is also responsible to build and maintain Alerts to proactively monitor the service and act as the technical liaison between Customer facing teams and the Engineering teams.


The day-to-day

As a Site Reliability Engineer, you will:

  • Identify and investigate potential and actual customer performance problems, recommend, and prioritize remediation, and assess effectiveness of remediation actions
  • Participate in and provide feedback on product design, especially regarding reliability and availability
  • Drive initiatives with partner teams to improve the reliability and performance of the Service through improved system design
  • Drive a culture of intolerance to manual activity which results in a highly automated environment delivering scalable solution
  • Work Closely with Customer facing teams (Technical Account Mangers and Program Teams) to understand and prioritize the Customer issues
  • Drive monitoring and automation initiatives
  • Create and present Performance reports for technical and management stakeholders
  • Work closely with Engineering teams to communicate and prioritize the service impacting issues
  • Reproduce and test the Customer issues in the Lab
  • Develop Automated scripts and tools to Enable monitoring of the Service
  • Be part of on-call rotations

What you'll need

Requirements

  • 5+ years experience in troubleshooting and triage of technical issues in a fast paced environment, to support customers.
  • 5+ years experience in Network Operations or Product Support
  • Advanced knowledge of modern programming languages, especially Python
  • An ability to understand large complex systems and a passion to constantly improve environments
  • Strong networking knowledge: TCP/IP, IPSEC, VPN, NAT, Routing Protocols, AAA
  • Set priorities and work efficiently in a fast-paced environment
  • Demonstrated ability to deliver results on time with high quality and attention to detail
  • Demonstrated ability to work with ambiguous requirements, adapt, and learn
  • Experience with data analytics tools(Splunk, Kibana)
  • Keen (data-driven) decision making skills under incomplete information
  • Excellent face-to-face and remote customer rapport
  • Bachelor’s degree in electrical engineering, Computer Science, or Computer Engineering
  • Up to 10% travel

What will help you on the job

  • Experience analyzing data and trending to gain operational efficiencies
  • Telecom or related operational service experience, especially wireless networks
  • Previous technical role in a DevOps/SRE workflow
  • Experience with Satcom technology
  • Experience/knowledge GCP, AWS, Big Query

EEO Statement

Viasat is proud to be an equal opportunity employer, seeking to create a welcoming and diverse environment. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, ancestry, physical or mental disability, medical condition, marital status, genetics, age, or veteran status or any other applicable legally protected status or characteristic. If you would like to request an accommodation on the basis of disability for completing this on-line application, please click here .

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Singapore, Singapore Ipsator Analytics Pvt Ltd

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

workfromhome

Job Description

Polygon is looking for a Site Reliability Engineer to join our talented engineering team.

Requirements

  • 2+ years of experience in DevOps/SRE
  • Experience designing and deploying application monitors and alarms to maximize application uptime
  • Good exposure to public clouds (GCP/AWS)
  • Proficiency in shell scripting or Perl/Python
  • Strong ability to troubleshoot, analyze, and identify root causes of issues
  • Experience supporting software deployment in staging and production environments
  • Understanding of load balancing technologies
  • Capability to report system and support status to stakeholders and maintain effective communication
  • Availability to be on-call for incident escalations and infrastructure tasks
  • Knowledge of configuration management tools like Ansible, Puppet, or Chef
  • Experience with DevOps practices
  • Quick learner with the ability to apply new technologies
  • Willing to work US shifts

About Company / Benefits

Polygon (formerly Matic Network) is the first well-structured, user-friendly platform for Ethereum scaling and infrastructure development. Its core component is Polygon SDK, a modular, flexible framework supporting various application types. Polygon enables the creation of Optimistic Rollup chains, ZK Rollup chains, standalone chains, or other infrastructure as needed. It transforms Ethereum into a multi-chain system similar to Polkadot, Cosmos, and Avalanche, with the benefits of Ethereum’s security, vibrant ecosystem, and openness.

Benefits include:

  • Work from anywhere
  • Flexible working hours
  • Great working environment
  • Fast career progression
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Singapore, Singapore TP-LINK CORPORATION PTE. LTD.

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

Responsibilities:

  • Serve as technical SME for implementing and operating Microservices on Kubernetes cloud-based platforms.
  • Collaborate with the Cloud Technical Development and DevOps teams to deploy services to the Multi-Cloud Platform.
  • Performing Load Tests and Chaos Tests to ensure the scalability and reliability of microservices.
  • Build Observability for Microservices and cloud platforms like AWS, OCI, Azure, and GCP.
  • Write and Execute the Disaster recovery plans in collaboration with the Development and DevOps team.
  • Analyze and resolve production risks caused by insufficient resources, such as node groups, CPU, memory, HPA scheduling, JVM pre-warming, etc.
  • Write and maintain scripts for automation using languages like Python, Go, or Bash.
  • Define and maintain the KPIs (SLA/SLO/SLI) for all cloud microservices with development teams to better understand the business.
  • Create and maintain technical documentation, including architecture diagrams, design documents, and standard operating procedures.
  • Guarantee adherence to security and compliance standards, including ISO27001, SOC2, and GDPR.
  • Lead incident response efforts to troubleshoot and resolve production issues quickly.
  • Perform post-incident analysis to identify root causes and potential workarounds/solutions.
  • Assist with product/technology selection, including implementation of POCs
  • Be fluid and open to change and evolving processes and tools
  • Help to mentor and train less senior members of the team
  • Ability to be part of On-call rotation and provide support after work hours and on weekends.
  • Other duties as assigned

Requirements:

  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • 1+ year of experience as a Site Reliability Engineer.
  • Proficiency in programming and scripting languages like Java, Python, Bash, or PowerShell.
  • Hands-on experience in SRE, DevOps, cloud operations, and cloud security best practices.
  • Strong knowledge of security technologies, including Identity and access management, Network security, Application security, and Data protection.
  • Strong problem-solving and analytical skills, with the ability to work independently and as part of a team.
  • Experience in developing and maintaining technical documentation and implementing compliance requirements

Additional Skills (Preferred):

  • Expert-level cloud certifications include AWS Solutions Architect, Professional, Azure Solutions
  • Architect Expert, and GCP Professional Cloud Architect.
  • Experience with container orchestration technologies (e.g., Kubernetes).
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Singapore, Singapore Apple Inc.

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

To view your favorites, sign in with your Apple Account.

There is a lot that goes into building the most secure yet user-friendly devices in the world. We are a unique Software Development group with a charter to secure our platforms, which include iOS software, iOS Devices, and Mac. We build solutions that are used by our customers, engineering teams, and manufacturing environments. We are looking for a Site Reliability Engineer (SRE) who would be responsible for deploying, monitoring, troubleshooting, and developing tools for all team's solutions. The SRE position requires a mix of strategic engineering and design along with hands-on, technical work. You will have experience in being a Systems Administrator or a Programmer that has moved on to DevOps/Automation in their career. You will configure, tune, and tackle multi-tiered systems to achieve optimal application performance, stability, and availability. You will work closely with the systems engineers, network engineers, database administrators, monitoring team, and information security team. For this position, strict application security and high availability requirements need to be consistent to achieve optimal solutions. This hiring team is a rare team focused on security initiatives that provides critical IT solutions across most of Apple’s product lines. These solutions are utilized from the manufacturing space all the way to customer-facing solutions. We are looking for a hardworking individual who can excel in a dynamic environment, who can be a self-starter and bring their passion to ensure quality and reliability of the solutions we maintain.

Description

Review hardware, software infrastructure and application functionality for optimization. Identify performance bottlenecks. Responsible for the full system lifecycle including configuration, code deployment in user acceptance test and production environments. Monitor infrastructure and application services and drive incident management. Collaborate with Apple's production support team, application engineers, project managers, systems engineers, network engineers, database administrators, and QA team to effectively ensure availability and reliability of solutions.

Minimum Qualifications
  • Unix or Linux administration and performance tuning skills, 3 ~ 5 years of leading services in a large scale *nix environment.
  • Java and JVM technologies runtime configurations and troubleshooting. Or proficient in Python/Go/other scripting language.
  • Experience with DevOps tools, processes, and culture.
  • Validated experience with Automation skills using Ansible, Chef, Jenkins, Puppet.
Preferred Qualifications
  • Oracle DB knowledge and troubleshooting skills.
  • Infrastructure knowledge of Networks, load balancers, Firewalls and WAF.
  • SDLC and release engineering including source code repository and build tools including SVN and GIT.
  • Network, System and Application Security knowledge.
  • Experience with Kafka or other message queueing technology a plus.
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Singapore, Singapore Point72

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

Join to apply for the Site Reliability Engineer role at Point72

About the role

As part of Point72’s Technology Team, you will focus on developing and maintaining complex, distributed, real-time systems that support our Global Macro business. Your responsibilities will include optimizing operations through automation, building foundational SRE components, and collaborating with development teams to meet system performance standards.

Key responsibilities
  1. Build and enhance SRE programs across multiple systems.
  2. Monitor system capacity and performance to prevent bottlenecks.
  3. Review automation code for quality and efficiency.
  4. Troubleshoot and resolve system issues.
  5. Participate in design reviews and technology evaluations.
Qualifications
  • Experience as a site reliability engineer or similar role.
  • Strong coding skills in Python and PowerShell; basic C# knowledge.
  • Experience with Windows/Linux OS, AWS, Terraform, Ansible.
  • Knowledge of Docker, Kubernetes, AWS EKS/ECS.
  • Proactive, communicative, with a strong sense of ownership.
  • Commitment to ethical standards.
Benefits
  • Health care, parental leave, wellness programs.
  • Volunteer and matching gift programs.
  • Support for affinity groups and tuition assistance.
About Point72

Point72 Asset Management, led by Steven Cohen, is a global firm investing across asset classes. We prioritize ethical standards, innovation, and talent development. Visit our website for more information.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Singapore, Singapore Razer Inc.

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

Join to apply for the Site Reliability Engineer role at Razer Inc.

3 weeks ago Be among the first 25 applicants

Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work , offering you the opportunity to make an impact globally while working across a team located across 5 continents. Razer is also a great place to work, providing a unique, gamer-centric experience that fosters accelerated growth both personally and professionally.

Job Responsibilities
  1. Administer, monitor, and manage cloud-scale production environments for web services and APIs for global users.
  2. Design and implement cloud architectures and automated self-recovery solutions to ensure service reliability and cost-efficiency.
  3. Implement monitoring and alerts to improve observability of services and decrease response times.
  4. Implement and manage CI/CD pipelines for automated testing and deployments.
  5. Collaborate with engineering and release management to document, enhance, and improve operational procedures and processes.
  6. Conduct application performance testing.
Pre-Requisites
  1. 2+ years of relevant operational experience.
  2. Strong knowledge of Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX).
  3. Comfortable with Linux and Docker administration.
  4. Basic knowledge of AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (MySQL, NoSQL).
  5. Strong coding and scripting skills (preferably Bash and Python).
  6. Ability to quickly learn and use a variety of open source technologies and automation tools.
  7. Comfort with frequent, incremental code testing and deployment.
  8. Good analytical skills to debug deployment problems independently.
  9. Deep hands-on technical expertise and problem-solving skills.
  10. Ability to work collaboratively in a dynamic environment with rapidly changing requirements.
Additional Information
  • Seniority level: Entry level
  • Employment type: Full-time
  • Job function: Engineering and Information Technology
  • Industry: Computers and Electronics Manufacturing
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Singapore, Singapore Apple Inc.

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

There is a lot that goes into building the most secure yet user-friendly devices in the world. We are a unique Software Development group with a charter to secure our platforms, which include iOS software, iOS Devices, and Mac. We build solutions that are used by our customers, engineering teams, and manufacturing environments.We are lookng for Site Reliability Engineer (SRE) who would be responsible for deploying, monitoring, troubleshooting and developing tools for all team's solutions. The SRE position requires a mix of strategic engineering and design along with hands-on, technical work. You will have experience in being a Systems Administrator or a Programmer that has moved on to DevOps/Automation in their career. You will configure, tune, and tackle multi-tiered systems to achieve optimal application performance, stability and availability. You will work closely with the systems engineers, network engineers, database administrators, monitoring team, and information security team.For this position, strict application security and high availability requirements need to be consistent to achieve optimal solutions. This hiring team is a rare team focused on security initiatives that provides critical IT solutions across most of Appleʼs product lines. These solutions are utilized from the manufacturing space all the way to customer facing solutions. We are looking for a hardworking individual who can excel in a dynamic environment, who can be a self starter and bring their passion to ensure quality and reliability of the solutions we maintain.

Description

Review hardware, software infrastructure and application functionality for optimization.Identify performance bottlenecks. Responsible for the full system lifecycle including configuration, code deployment in user acceptance test and production environments.Monitor infrastructure and application services and drive incident management.Collaborate with Apple's production support team, application engineers, project managers, systems engineers, network engineers, database administrators and QA team to effectively ensure availability and reliability of solutions.

Minimum Qualifications
  • Unix or Linux administration and performance tuning skills, 0 ~ 5 years of leading services in a large scale *nix environment.
  • Java and JVM technologies runtime configurations and troubleshooting. Or proficient in Python/Go/other scripting language.
  • Experience with DevOps tools, processes, and culture.
  • Validated experience with Automation skills using Ansible, Chef, Jenkins, Puppet.
Preferred Qualifications
  • Oracle DB knowledge and troubleshooting skills.
  • Infrastructure knowledge of Networks, load balancers, Firewalls and WAF.
  • SDLC and release engineering including source code repository and build tools including SVN and GIT.
  • Network, System and Application Security knowledge.
  • Experience with Kafka or other message queueing technology a plus.
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Site reliability engineer Jobs in Singapore !

Site Reliability Engineer

Singapore, Singapore Vega Solutions

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

Join to apply for the Site Reliability Engineer role at Vega Solutions

Join to apply for the Site Reliability Engineer role at Vega Solutions

Get AI-powered advice on this job and more exclusive features.

Tokka Labs | Singapore | Full-Time

Tokka Labs is a proprietary trading firm with a focus on close collaboration, rigorous research, and cutting-edge technology. We are market makers, searchers, and solvers for top protocols on the most popular blockchains in the world. We design and implement our own trading systems and strategies to provide liquidity in the most diverse and challenging environments. At the core of it all lies our unwavering commitment to pushing boundaries of decentralized finance and we are always on the lookout for like-minded individuals to join us on this journey. If you think you have what it takes, apply now!

Tokka Labs | Singapore | Full-Time

Tokka Labs is a proprietary trading firm with a focus on close collaboration, rigorous research, and cutting-edge technology. We are market makers, searchers, and solvers for top protocols on the most popular blockchains in the world. We design and implement our own trading systems and strategies to provide liquidity in the most diverse and challenging environments. At the core of it all lies our unwavering commitment to pushing boundaries of decentralized finance and we are always on the lookout for like-minded individuals to join us on this journey. If you think you have what it takes, apply now!

Position Summary

As a Site Reliability Engineer (SRE), you will play a crucial role in maintaining and enhancing the security, stability, scalability, and cost-effectiveness of our systems. You will leverage your expertise in tools like Terraform, Ansible, Kubernetes, and AWS, as well as your networking skills, to build and manage a robust infrastructure.

Key Responsibilities

  • System Monitoring and Incident Response:

○ Continuously monitor the performance, availability, and security of systems.

○ Quickly respond to incidents, conducting root cause analysis, and implementing solutions to prevent recurrence.

  • Infrastructure Automation:

○ Automate infrastructure deployment and management using Terraform, Ansible, and related tools.

○ Optimize cloud environments, particularly AWS, to ensure efficient resource use and cost control.

  • Kubernetes and Container Management:

○ Manage containerized applications using Kubernetes, ensuring high availability and scalability.

○ Develop and implement strategies for effective container orchestration and management.

  • Security and Compliance:

○ Implement and maintain security best practices across the infrastructure.

○ Conduct regular security audits and vulnerability assessments to protect against potential threats.

  • Network Management:

○ Design, implement, and manage network infrastructure to support system stability and performance.

○ Troubleshoot and resolve network-related issues, ensuring minimal downtime.

  • Capacity Planning and Performance Optimization:

○ Plan for future infrastructure needs, ensuring the system scales efficiently.

○ Continuously analyze system performance and apply improvements for better stability and cost efficiency.

○ Continuously looking for better infrastructure suppliers, and benchmark the strength and weakness.

○ Explore and operate blockchain technologies, includes: blockchain node, network optimisation, etc.

  • Collaboration and Knowledge Sharing:

○ Work closely with software development, DevOps, and IT teams to align infrastructure strategies with business needs.

○ Document processes, share knowledge with team members, and mentor junior engineers.

Required Qualifications

  • Education: Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • Experience:

○ 3+ years of experience in Site Reliability Engineering, DevOps, or a related role.

○ Proven experience with Terraform, Ansible, Kubernetes, and AWS.

Skills

○ Strong networking skills and experience with cloud networking.

  • Skills:
  • Demonstrated expertise in scripting and automation, with proficiency in Python, Bash, and related tools.
  • Extensive knowledge of Unix/Linux systems, including system administration and troubleshooting.
  • Strong analytical capabilities, with a proven track record in performance tuning and cost optimization.
  • Exceptional communication and interpersonal skills, with the ability to collaborate effectively across cross-functional teams.
  • Consistently meets deadlines and ensures timely completion of tasks through effective time management and attention to detail.
  • Proactive, accountable, and highly self-motivated, with a strong sense of ownership and ability to work independently with minimal supervision.
  • Continuously strives for improvement and seeks opportunities to enhance processes and outcomes.

Preferred Qualifications

  • Experience with multi-cloud environments.
  • Familiarity with database management and data security.
  • Knowledge of CI/CD pipelines and automation tools.

Seniority level
  • Seniority level Mid-Senior level
Employment type
  • Employment type Full-time
Job function
  • Job function Engineering and Information Technology
  • Industries Blockchain Services

Referrals increase your chances of interviewing at Vega Solutions by 2x

Sign in to set job alerts for “Site Reliability Engineer” roles. Site Reliability Engineer Intern - 2025 Start Production Engineer / Site Reliability Engineer Software Engineer Intern, Dev Infra - 2025 Start

Bedok, East Region, Singapore 10 hours ago

WeChat - Senior Site Reliability Engineer Information Technology - Cloud/DevOps Engineer Site Reliability Engineer-(Fresh-Grad)(A98145) Software Development Engineer in Test Intern , TikTok - 2025 Start Backend Software Engineer, Global LIVE Fund Safety Intern- 2025 Start Site Reliability Engineer (SRE) (GovTech) Site Reliability Engineer (EMEA, Japan, Singapore, Australia) Tencent Hunyuan LLM Site Reliability Engineer / Senior SRE

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Singapore, Singapore Viasat

Posted 1 day ago

Job Viewed

Tap Again To Close

Job Description

About us

One team. Global challenges. Infinite opportunities. At Viasat, we’re on a mission to deliver connections with the capacity to change the world. For more than 35 years, Viasat has helped shape how consumers, businesses, governments and militaries around the globe communicate. We’re looking for people who think big, act fearlessly, and create an inclusive environment that drives positive impact to join our team.


What you'll do

The Customer Engineering team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability, and performance of the Service to different Enterprise Customers. The Customer Engineering Team is empowered to drive technical resolutions across the technology stack from hardware through to application and all stops in between. The team is also responsible to build and maintain Alerts to proactively monitor the service and act as the technical liaison between Customer facing teams and the Engineering teams.


The day-to-day

As aSite Reliability Engineer, you will:

  • Identify and investigate potential and actual customer performance problems, recommend, and prioritize remediation, and assess effectiveness of remediation actions
  • Participate in and provide feedback on product design, especially regarding reliability and availability
  • Drive initiatives with partner teams to improve the reliability and performance of the Service through improved system design
  • Drive a culture of intolerance to manual activity which results in a highly automated environment delivering scalable solution
  • Work Closely with Customer facing teams (Technical Account Mangers and Program Teams) to understand and prioritize the Customer issues
  • Drive monitoring and automation initiatives
  • Create and present Performance reports for technical and management stakeholders
  • Work closely with Engineering teams to communicate and prioritize the service impacting issues
  • Reproduce and test the Customer issues in the Lab
  • Develop Automated scripts and tools to Enable monitoring of the Service
  • Be part of on-call rotations

What you'll need

Requirements

  • 5+ years experience in troubleshooting and triage of technical issues in a fast paced environment, to support customers.
  • 5+ years experience in Network Operations or Product Support
  • Advanced knowledge of modern programming languages, especially Python
  • An ability to understand large complex systems and a passion to constantly improve environments
  • Strong networking knowledge: TCP/IP, IPSEC, VPN, NAT, Routing Protocols, AAA
  • Set priorities and work efficiently in a fast-paced environment
  • Demonstrated ability to deliver results on time with high quality and attention to detail
  • Demonstrated ability to work with ambiguous requirements, adapt, and learn
  • Experience with data analytics tools(Splunk, Kibana)
  • Keen (data-driven) decision making skills under incomplete information
  • Excellent face-to-face and remote customer rapport
  • Bachelor’s degree in electrical engineering, Computer Science,or Computer Engineering
  • Up to 10% travel

What will help you on the job

  • Experience analyzing data and trending to gain operational efficiencies
  • Telecom or related operational service experience, especially wireless networks
  • Previous technical role in a DevOps/SRE workflow
  • Experience with Satcom technology
  • Experience/knowledge GCP, AWS, Big Query

EEO Statement

Viasat is proud to be an equal opportunity employer, seeking to create a welcoming and diverse environment. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, ancestry, physical or mental disability, medical condition, marital status, genetics, age, or veteran status or any other applicable legally protected status or characteristic. If you would like to request an accommodation on the basis of disability for completing this on-line application, please click here .

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Singapore, Singapore Tyk

Posted 2 days ago

Job Viewed

Tap Again To Close

Job Description

workfromhome

Get AI-powered advice on this job and more exclusive features.

The Tyk API Management platform is helping to drive the connected world and power new products and services. We’re changing the way that organisations connect any number of their systems and services.Whether internal, external, public or highly encrypted systems, Tyk helps businesses drive value across the retail, finance, telecoms, healthcare, or media industries (to name just a few!)

If you’ve banked online, used an app to check the news, or perhaps even driven a connected car, API’s, and by extension, Tyk, make that possible. Founded in 2015 with offices in London – UK, London – Ontario, Atlanta and Singapore, we have many thousands of users of our B2B platform across the globe. Brands using Tyk range from Lotte, Bell, T Mobile, to RBS, Capital One and Vinci. We have a varied user base hailing from every continent – even Antarctica.

Our Mission

Tyk is on a mission to connect every system in the world. We’ve started by building an API Management platform.

Total flexibility, default remote, radical responsibility

We offer unlimited paid holidays and remote working from anywhere in the world , for everyone, Why? Tyk was founded on the principle of offering flexibility and autonomy to our employees, we believe this allows our employees to achieve their best results. It also means we can build the best possible team, location and working hours are no barrier.

If this sounds like an environment that you believe could work for you then read on to find out more.

The role:

We’re looking for a Site Reliability Engineer to manage, maintain, improve and provide support on our platform. You will be curious by nature, always looking for ways to improve, as we will look to you for new ideas, solutions and metrics on how we can improve the platform. You will also be our first line of incident management to our clients and will help define our response going forward. This is a great opportunity to become an integral part of Tyk as we continue on our journey.

As a remote first company, you will have the opportunity to work with an industry leading distributed team. Having access to expertise from across the globe will give you both the support and opportunity to help shape not only Tyk’s Cloud platform but also the Tyk as a whole as we continue to grow.

Here’s what you’ll be responsible for:

  • Maintaining global Tyk Cloud within SL(A/I/O)s you will help to define
  • Identifying reliability issues and working together with your squad to solve them
  • Identifying and introducing new metrics and building relevant dashboards
  • Participating in the on-call rotation
  • Working with your squad to expand multi-region and multi-cloud reach of the platform
  • Documenting operational knowledge
  • Conducting post-incident analysis
  • Be a key shaper and contributor to our continuous improvement agenda – be it the clarity of our user stories, how we estimate, communicate with other teams or customers – we expect this role to be advocate of continuous improvement
  • Reliability of our new global Tyk Cloud platform
  • Automation of operations and support
  • Writing and maintaining documentation on SRE processes and policies
  • Recommending and implementing ways of driving operational efficiency and driving down our cost to run, without impacting service
  • Assisting in penetration testing for Cloud through liaising with our provider, providing technical details, and environment setup

Here’s what we’re looking for:

Experience

  • Launching and operating production scale kubernetes clusters
  • Designing and operating infrastructure on AWS and other providers
  • Operating MongoDB (or other document database) clusters
  • Operating Redis (or other key-value storage) clusters
  • Operating Prometheus and Grafana
  • Operating logging collection and analysis systems
  • Participating in the on-call rotation(4:00am – 16:00pm UTC)

Skills:

  • AWS / EKS (advanced)
  • Terraform and IaC in general (proficient)
  • Helm (proficient)
  • Go and/or Python (familiar)
  • MongoDB (or similar)
  • Redis (or similar)
  • Monitoring – prometheus, grafana, thanos (familiar)
  • Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)
  • Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP)
  • Proactive, energetic, innovative and change oriented

Nice to have:

  • Bare metal infrastructure engineering
  • Familiarity with Rancher
  • CKA/CKAD/CKS
  • Creating and delivering production software in Go language

Here’s why you should join us:

  • Everyone has unlimited paid holiday.
  • We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all.
  • Employee share scheme
  • Generous maternity and paternity leave
  • Company retreats

We all share the same vision – we value authenticity, respect, responsibility, independence, honesty, diversity and inclusion and most importantly treating others how you wish to be treated. We look for like-minded people who bring their personalities to work everyday, strive to achieve their personal goals and who are willing to challenge the way we do things, why? – to make what we do even better!

Our values tell the story of Tyk – here’s how:

  • It’s ok to screw up!

We’ve found that it’s often the ‘stupid’ or unexpected ideas that turn out to be the successful ones – so try it, at least we can say we have!

  • The only stupid idea, is the untested one!

It’s in our DNA – starting a business with founders 12 hours apart, giving our gateway away for free – sure, we did that, and we’d do it again!

  • Trust starts with you – make it count!

Trust is a two-way street – instill it from day one!

We have each other’s back – we’re all on the same team. Think before you speak or act.

  • Make things, better!

Always try to leave things better than when you found them – change is constant, inevitable and embraced! Be that change we want to see.

What’s it like to work here! check it out:

Tyk is an equal opportunities employer and we are determined to ensure that no applicant or employee receives less favourable treatment on the grounds of gender, age, disability, religion, belief, sexual orientation, marital status, or race, or is disadvantaged by conditions or requirements which cannot be shown to be justifiable.

You can see more about us here

Seniority level
  • Seniority level Mid-Senior level
Employment type
  • Employment type Full-time
Job function
  • Industries Software Development

Referrals increase your chances of interviewing at Tyk by 2x

Sign in to set job alerts for “Site Reliability Engineer” roles. DevOps Engineer / Site-Reliability Engineer Site Reliability Engineer (Crypto Trading) Senior Site Reliability Engineer (Crypto Exchange) Senior Staff Software Engineer_L4 engineering Software Engineer (Frontend, Backend, or Full stack) Software Engineer (Java) - Relocation to Spain or UAE Python and Kubernetes Software Engineer - Data, AI/ML & Analytics Software Engineer, Backend (International Exchange) Freelance Software Developer (Python Engineer) - AI Trainer Software Engineer (Java) - Relocation to UAE

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Site Reliability Engineer Jobs