Site Reliability Engineer

Cape Town
Salary: Market Related
Job Type: Permanent
Sectors: IT
Reference: 113647

Apply before Jan 03 2025 | 46 Days left

Apply

Vacancy Details

Employer: Lesaka Technologies

Kazang – Micro Merchant Division

Senior Site Reliability Engineer

A vacancy exists for a Senior SRE within the Kazang - Micro Merchant Division, in Cape Town, South Africa (Hybrid).

We are seeking a Site Reliability Engineer (SRE) with expertise in Linux-based, open-source environments to ensure the reliability, scalability, and performance of our systems. In this role, you will design and implement automated solutions for monitoring and system optimisation while managing and maintaining critical infrastructure. You will work closely with the DevOps team to support deployments and CI/CD pipelines, leveraging open-source tools to address operational challenges and enhance system resilience.

Key Responsibilities include, but are not limited to:

Design, implement, and maintain reliable systems in a Linux and open-source environment to meet uptime and performance objectives.
Support the DevOps team with CI/CD pipelines, ensuring seamless and reliable deployments.
Manage and optimize AWS-based infrastructure for scalability, cost efficiency, and performance.
Develop and maintain monitoring and alerting systems to ensure observability and proactively address system issues.
Build and maintain robust solutions for metric collection, dashboarding, and alerting to provide actionable insights and real-time system visibility.
Conduct root cause analysis for incidents, implementing preventive measures to improve system resilience.
Perform regular system maintenance, including updates, patches, and optimizations.
Prepare and deliver comprehensive reporting on system performance, incidents, and reliability metrics.
Identify and mitigate risks to system reliability, scalability, and security.
Ensure compliance with organizational and regulatory standards in system design and operations.
Participate in a rotational on-call schedule to ensure the reliability and availability of critical systems.

In order to be considered for this position, the following requirements must be met:

Years of Experience:

A minimum of 5 years of professional experience in Site Reliability Engineering, DevOps, or a related field, with demonstrated expertise in Linux-based, open-source environments, and cloud infrastructure (AWS).

Education:

A Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field is required.
Equivalent practical experience in lieu of a formal degree will be considered for highly qualified candidates.

Technical Competencies:

Fault Finding and Debugging
Expertise in diagnosing and resolving complex system issues, including performance bottlenecks, service outages, and application errors, using debugging tools, logs, and monitoring data.
Scripting and Programming
Proficiency in at least one programming or scripting language (e.g., Python, Bash, Go), with the ability to write automation scripts, develop tools, and optimize system performance.
Cloud Infrastructure Management (AWS)
Hands-on experience with AWS services (e.g., EC2, S3, RDS, VPC), with the ability to design, manage, and optimize cloud-based infrastructure for scalability, reliability, and cost-efficiency.
Monitoring and Observability
Skilled in implementing monitoring solutions (e.g., Prometheus, Grafana, ELK stack) and designing systems for metrics collection, dashboarding, and alerting to ensure system health and performance.
Automation and Infrastructure as Code (IaC)
Proficiency with tools like Ansible, Terraform, or similar frameworks to automate system management, deployments, and configurations, reducing manual effort and ensuring consistency.

Behavioural Competencies:

Problem-Solving and Critical Thinking
Demonstrates a proactive and analytical approach to identifying issues, diagnosing root causes, and implementing effective solutions in complex technical environments.
Collaboration and Teamwork
Works effectively with cross-functional teams, including DevOps, development, and operations, fostering a culture of shared ownership and open communication to achieve reliability goals.
Adaptability and Continuous Learning
Embraces change, learns new technologies quickly, and adjusts strategies to meet evolving system and organizational needs, particularly in fast-paced, dynamic environments.

Apply

36 people have viewed this job.