Shen Wei

Site Reliability Engineer
MaleFull stack development/Operations EngineerLive in SingaporeNationality Malaysia
Share

Work experience

  • Site Reliability Engineering(MLOps)

    Huawei International Pte Ltd
    2022.09-Current(4 years)
    Reliability & Performance Optimization Spearheaded the architecture of multi-region infrastructure for 1,000+ production models, ensuring robust regional fault isolation and high availability across distributed global clusters. Maintained 99.99% service availability by implementing cross-region disaster recovery and automated failover mechanisms utilizing GSLB and advanced traffic steering (rate limiting & circuit breaking). Optimized Kubernetes HPA and custom metrics for massive inference workloads, ensuring zero-downtime stability and resolving performance bottlenecks during global traffic spikes. Incident Management & Observability Managed a large-scale global observability stack processing 10B+ daily metrics and logs, providing a unified real-time view for proactive system health monitoring. Led RCA for critical incidents by deeply analyzing Linux kernel load and troubleshooting Kubernetes OOMKilled events, reducing MTTR by 30% while establishing robust SLI/SLO monitoring.. Integrated automated Chaos Engineering to validate self-healing mechanisms, significantly enhancing platform resilience against unpredictable infrastructure failures. Automation & Infrastructure as Code Standardized full-stack lifecycles using Infrastructure as Code (IaC), achieving 100% environment consistency across global regions and eliminating configuration drift. Executed production change management for platform upgrades, configuration updates, scaling changes, and rollback plans, ensuring changes followed internal approval, validation, and documentation procedures. Developed Python/Script automation tools to support software upgrades, service requests, and change execution, reducing manual deployment effort by 40% and improving operational efficiency. Security & Compliance Maintained secure cloud network configurations using VPC, Security Groups, ACLs, and routing controls to support multi-tenant isolation, production access management, and regional compliance requirements.

Projects

  • Certificates

    2022.01-2022.09(9 months)
    Developing a Google SRE Culture. ● Python 3 Programming by University of Michigan on Coursera. ● Python Essentials for MLOps on Duke University. ● AI and Machine Learning Competence for Industry 4.0 by Malaysian-Swiss Smart Factory. ● Course.Machine Learning by Stanford University & DeepLearning.AI on Coursera.

Educational experience

  • Universiti Malaysia Pahang

    Bachelor of Mechatronic Engineering (Hons)
    2018.01-2021.12(4 years)

Languages

Malay
Native
Chinese (Mandarin)
Native
English
Fluent
Chinese (Cantonese)
Good

Certificates

Resume Search
Nationality
Job category
City or country
Jobs
Candidates
Blog
Me