
Description
Role purpose: An individual that Leads a team responsible for running/operating the staging & production environments where our Software Engineering Squads deploy their workloads. This role deals with both people management and system operation, being responsible for meeting optimal levels of availability of the environments while maintaining a strong team cohesion. This role deals directly with Incidents, vulnerabilities, Capacity planning & Disaster Recovery, being only responsible for the environments and not for the applications deployed by the different squads into those environments.
Key accountabilities and decision ownership
• Define team focus & priorities
• Keep stakeholders updated regarding ongoing issues/Incidents and own Root Cause Analysis documentation
• Report on system availability, stability and defects
• Solve team Impediments by interacting with external parties and performing escalations
• Control costs and budgets regarding Production & Staging Platform
• Ensure that Both Production & Staging are compliant to Cyber security posture
• Own contracts & vendors related to Staging & Production environments
• Perform Technical activities such as troubleshooting, Leading by example
Core competencies, knowledge and experience
• Understanding of SDLC
• Ability to Influence with reasoning
• At least 3 Years of proven Experience in a similar or equivalent role
• Experience Deploying and or Operating Software in production
• Client-server Architecture
Must have skills / professional qualifications
• Network Essentials (IP, DNS, TCP/UDP)
• Web architecture
• Leadership and organizational skills
• Outstanding communication skills
• Problem-solving aptitude
• Fluent English Reading and Listening
Desired technical skills
• Using AWS Cloud
• Containers & Kubernetes
• Infrastructure as Code (with Terraform)
• Amazon Linux
• Microservices Architecture
Number of Direct reports: 6
Key performance indicators
• Platform availability
• Security Compliance of Owned environments
• Meantime to Recover and to Detect Platform Incidents including root cause analysis documentation
APPLY ONLINE
Local: Maputo