Company Logo

Site Reliability Engineer

Site Reliability Engineer

1 day ago

undisclosed

Full Time

Permanent

Job Description

The Site Reliability Engineer plays a critical role in monitoring, troubleshooting, and optimizing our production system to ensure the highest levels of performance and stability for our AI and gaming customers worldwide.

Key Responsibilities

  • Monitor, Review, and Respond to Faults: Take on the responsibility of monitoring, reviewing, responding to faults, troubleshooting, resolving, and subsequently optimizing the production system.
  • System Architecture and Performance: Continuously monitor and review the system architecture, process logic, system performance, stability, and other technical areas and indicators to ensure their rationality.
  • Coordination with Business Team: Drive the business team in resolving any issues related to operations and maintenance.
  • Production Failure Response: Respond promptly to production failures, acting as the overall coordinator for resolution.
  • Collaborative Problem-Solving: Organize relevant R&D, operations and maintenance, and product teams to collaboratively investigate and resolve problems.
  • Failure Response Time: Responsible for the failure response time and resolution time, ensuring timely resolution of issues.
  • Case Studies and Optimization: Conduct case studies on production issues and follow up with optimizations to improve system performance and stability.
  • Documentation: Maintain comprehensive documentation of system architecture, processes, and troubleshooting procedures.
  • Continuous Improvement: Identify areas for improvement in the operations and maintenance processes and implement necessary changes.

Skills & Experiences

  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • Experience in operations and maintenance development, preferably in a cloud computing or AI-focused environment.
  • Strong understanding of system architecture, performance monitoring, and troubleshooting methodologies.
  • Excellent communication and collaboration skills.
  • Ability to work in a fast-paced, startup environment.
  • Proficiency in Kubernetes (K8S), CI/CD, and Docker.
  • Expertise in AWS (VPC, S3, EC2, etc.) or Python (one of the two).
  • Responsible for building the operations and maintenance infrastructure platform and handling core business operations.
  • Management experience is a plus, but not required.
  • Prior experience working in structured environments such as Huawei, ZTE, or banking institutions is preferred. 

Share this job

Job Conditions:

Probation Period:

3 month

Age:

28 - 45

Job Type:

Permanent

Allowances:

TBC

Hours:

9am - 6pm

Languages:

english,bahasa

Days:

Mon - Fri

Malaysia Only:

Yes

Annual Leave:

Included

Benefits:

Training and development opportunities, Fantastic Career Progression Opportunities

Job Skills

Python

Cloud Computing

AWS

Docker

Kubernetes

AI

Site Reliability

Company Logo

Confidential (IT Infrastructure)

Primary Industry:

IT

Company Confidential

Refer-A-Talent

Know someone perfect for this role? Refer them to Seekers and earn RM500! Join our referral program now and help us find top talent!

Refer-A-Talent Now!