Description of the job
- Permanent Full-Time
- Multi-Award Winning Company
- Sydney City / Hybrid Working
An advanced technology company building cutting-edge software products with a global impact. With a strong engineering culture, the team values innovation, collaboration, and technical excellence. The company embraces flexibility, continuous learning, and a modern tech stack to empower it's people and products.
The Position:
The Senior Site Reliability Engineer (SRE) will be building and enhancing the reliability of a modern, distributed SaaS platform. You will work with a supportive team, leveraging Kubernetes, CI/CD, and modern monitoring tools to help scale and stabilise a growing product suite.
Key Responsibilities:
- Lead efforts to improve system reliability, including incident response, traffic planning, and SLOs.
- Maintain and enhance monitoring tools for logs, metrics, and traces.
- Manage and optimise the Kubernetes production environment.
- Use monitoring and experience to boost system performance and reliability
- Work closely with engineers to support CI/CD processes.
- Clearly explain technical concepts to various stakeholders.
- Contribute to infrastructure projects and related tasks.
- Participate in on-call rotations and contribute to improving incident response playbooks and practices.
- Help define and enforce best practices for service deployment, scalability, and fault tolerance.
- Support the integration of observability and reliability into the development lifecycle.
- Tertiary education, and/or relevant industry qualifications.
- Proven experience in DevOps, Site Reliability Engineering, platform operations, or a similar discipline.
- Strong working knowledge of Kubernetes in production environments.
- Hands-on experience with cloud platforms (AWS) and infrastructure-as-code (IaC)
- Proficiency with CI/CD tools such as GitHub Actions or GitLab.
- Exposure to scripting or coding in Python, Rust, Go, or similar languages is a plus.
- Familiarity with observability platforms such as Grafana, Prometheus, or similar.
- Hands-on experience with observability tools (logs, metrics, and traces) and incident management.
- Understanding of CNCF ecosystem projects such as Linkerd and Prometheus.
- Proactive, improvement-focused mindset with a passion for building reliable systems.
- $160K-$180K Base + Super + Bonus. (depending on experience)
- Working from Home allowance.
- Learning and Development allowance.
- Wellness allowance.
Job Ref: 3946537
TO APPLY: please click on the appropriate link.