Job Description
The Site Reliability Engineer plays a critical role in monitoring, troubleshooting, and optimizing our production system to ensure the highest levels of performance and stability for our AI and gaming customers worldwide.
Key Responsibilities
- Monitor, Review, and Respond to Faults: Take on the responsibility of monitoring, reviewing, responding to faults, troubleshooting, resolving, and subsequently optimizing the production system.
- System Architecture and Performance: Continuously monitor and review the system architecture, process logic, system performance, stability, and other technical areas and indicators to ensure their rationality.
- Coordination with Business Team: Drive the business team in resolving any issues related to operations and maintenance.
- Production Failure Response: Respond promptly to production failures, acting as the overall coordinator for resolution.
- Collaborative Problem-Solving: Organize relevant R&D, operations and maintenance, and product teams to collaboratively investigate and resolve problems.
- Failure Response Time: Responsible for the failure response time and resolution time, ensuring timely resolution of issues.
- Case Studies and Optimization: Conduct case studies on production issues and follow up with optimizations to improve system performance and stability.
- Documentation: Maintain comprehensive documentation of system architecture, processes, and troubleshooting procedures.
- Continuous Improvement: Identify areas for improvement in the operations and maintenance processes and implement necessary changes.
Skills & Experiences
- Bachelor's degree in Computer Science, Engineering, or a related field.
- Experience in operations and maintenance development, preferably in a cloud computing or AI-focused environment.
- Strong understanding of system architecture, performance monitoring, and troubleshooting methodologies.
- Excellent communication and collaboration skills.
- Ability to work in a fast-paced, startup environment.
- Proficiency in Kubernetes (K8S), CI/CD, and Docker.
- Expertise in AWS (VPC, S3, EC2, etc.) or Python (one of the two).
- Responsible for building the operations and maintenance infrastructure platform and handling core business operations.
- Management experience is a plus, but not required.
- Prior experience working in structured environments such as Huawei, ZTE, or banking institutions is preferred.