Back to Search Results

Site Reliability Engineer (SRE) (Remote)

Cape Town
Salary: Market Related
Job Type: Permanent
Sectors: IT
Reference: 25638

Apply before Jan 15 2025 | 25 Days left

Apply

Vacancy Details

Employer: Datafin Recruitment

ENVIRONMENT:

AN analytical thinking & solutions-driven Site Reliability Engineer is sought to join the Remote team of a dynamic provider of a unique and powerful range of LegalTech Solutions. Your core role will entail being responsible for ensuring the reliability, scalability, and performance of our infrastructure, collaborating with Development teams, and driving continuous improvement in system operations. The ideal candidate must preferably have a Masters/Bachelor’s Degree in Computer Science, Engineering, or a similar qualification with relevant Certifications and at least 5 years’ work experience in Site Reliability Engineering, DevOps, or a related field. You will also require extensive experience with cloud services such as OCI, AWS, Google Cloud, or Azure & be proficient in Scripting languages (Python, Bash, etc.) and Configuration Management tools (Terraform, Ansible, Chef, Puppet).

DUTIES:

Infrastructure Management –

Design, build, and maintain highly available and scalable infrastructure using cloud platforms (OCI, AWS, GCP, Azure) and on-premises environments.

Monitoring & Incident Response –

Implement and maintain monitoring, logging, and alerting systems to detect and respond to system issues promptly.
Lead incident response efforts and perform root cause analysis.

Automation –

Develop and deploy automation tools to streamline operations, reduce manual intervention, and improve system reliability.

Performance Optimization –

Analyse system performance metrics and make recommendations to improve application and infrastructure performance.

Security & Compliance –

Ensure systems meet security, compliance, and regulatory requirements by implementing best practices and conducting regular audits.

Collaboration –

Work closely with Development teams to ensure new features and services are scalable, reliable, and maintainable.

Disaster Recovery –

Develop and maintain Disaster Recovery plans, including data backups and system redundancy strategies.

Continuous Improvement –

Identify areas for improvement in the existing infrastructure, propose, and implement solutions to enhance system reliability and performance.

Documentation –

Create and maintain detailed documentation for system configurations, procedures, and processes.

REQUIREMENTS:

Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or a related field.
Extensive experience with cloud services such as OCI, AWS, Google Cloud, or Azure.
Proficiency in Scripting languages (Python, Bash, etc.) and Configuration Management tools (Terraform, Ansible, Chef, Puppet).
Experience with monitoring tools (Zabbix, Prometheus, Grafana, Wazuh) and logging systems (ELK stack, Splunk, Elastic).
Strong understanding of networking concepts, including DNS, load balancing, firewalls, and VPNs.
Experience with Containerization (Docker) and Orchestration tools (Kubernetes).
Familiarity with continuous integration/continuous deployment (CI/CD) pipelines and tools like Jenkins, GitLab CI, or CircleCI.
Strong background in Linux/Unix system administration.
Experience with IT Service Management platforms, optimizing and supporting tools like JIRA, Freshdesk.
Proven ability to handle high-pressure incidents and provide clear communication to stakeholders.

Preferred to haves –

Master’s or Bachelor’s Degree in Computer Science, Engineering, or a related field.
Certifications: Relevant certifications such as Oracle Cloud Infrastructure Architect Associate, AWS Certified Solutions Architect or Google Cloud Professional DevOps Engineer.
Programming: Experience with software development in languages such as Python, Go, Java, or Ruby.
Database Management: Experience managing and optimizing databases (OracleDB, SQL).
Experience in High-Traffic Environments: Prior experience working in environments with large-scale, high-traffic systems.

ATTRIBUTES:

Excellent communication and problem-solving skills.
Someone who is an analytic thinker, who can work effectively in a fast-paced environment.

While we would really like to respond to every application, should you not be contacted for this position within 10 working days please consider your application unsuccessful.

COMMENTS:

When applying for jobs, ensure that you have the minimum job requirements. OnlySA Citizens will be considered for this role. If you are not in the mentioned location of any of the jobs, please note your relocation plans in all applications for jobs and correspondence. Apply here https://www.datafin.com/job/site-reliability-engineer-sre-remote/ OR e-mail a Word copy of your CV to amy@datafin.com and mention the reference number of the job.

Apply

1 person has applied for this job. 222 people have viewed this job.

About Datafin Recruitment

Datafin was established in 1999 due to the need for a specialized IT recruitment solution. We offer a personalized and flexible recruitment service, specializing in providing both client and candidate with the perfect fit. We pride ourselves on the fact that we have established relationships with industry leaders as well as access to some of the most skilled and sought after candidates in the industry. Our database of over 25 000 candidates, cutting edge internal IT systems and extensive PPC marketing has ensured that we at the top of our game and one of SA’s leading recruitment agencies.