Intro
The Site Reliability Engineer Lead (SRE Lead) at Screening Eagle will lead a team of SREs to ensure the stability, resilience, and scalability of our services through automation, testing, and engineering. This role involves leveraging expertise from product systems/operations, cloud infrastructure (AWS), build and release engineering, software development, and stress/load testing to guarantee our services are available, cost-efficient, and fit for purpose from the early stages of development.
What will you do
Cloud Infrastructure Management and Networking
* Design, develop, and implement cloud infrastructure using Terraform.
* Optimize resources for cost-efficiency and performance.
* Ensure infrastructure security and implement service control policies (e.g., Control Tower).
* Configure AWS VPC flow logs, load balancer logging, Direct Connect, AWS VPN, TGX, etc.
Monitoring, Support, and Prototyping
* Implement robust monitoring and alerting systems.
* Set up and monitor CI/CD pipelines both on-premises and in the cloud.
* Enhance monitoring, logging, and alerting practices.
* Use tagging and cost categorization for cost analysis.
* Create prototypes and lead development teams in implementing solutions.
Team Leadership, Collaboration, and Documentation
* Lead the SRE team, ensuring technical quality and best practices.
* Guide the team through the software development lifecycle.
* Collaborate with developers and operations to integrate infrastructure changes.
* Document DevOps changes, technical partnerships, design, integration, testing, and deployment.
Innovation, Quality Assurance, and Process Improvement
* Evaluate risks, customize applications, and lead quality practices.
* Focus on agile methodologies, test automation, and continuous integration.
* Simplify and automate complex processes to ensure quality and operational excellence.
* Improve the DevOps toolchain and streamline software delivery processes.
* Stop projects/products if solutions are not technically acceptable.
What do we expect
* 5+ years of experience developing AWS cloud infrastructure and 7+ years of experience leading teams.
* Extensive experience in implementing and evolving DevOps practices across multi-disciplinary teams and business frameworks.
* Strong background in leading technology change programs and managing projects.
* In-depth knowledge and experience with AWS services (EC2, S3, VPC, IAM, etc.).
* Expert-level proficiency in Terraform, including writing reusable modules and leveraging best practices.
* Highly skilled with Kubernetes, Terraform, serverless and AWS in general.
* Proficient in non-functional testing, including performance, security, and cost optimization.
* Experience working with advanced architectures such as ARM and AWS Graviton, optimizing for performance, cost-efficiency, and scalability.
* Knowledge of K8S operator programming and those related with GPU based architectures
* Competent in using different arch build tools and practices.
* Expertise in Git and GitOps philosophy.
* Expert in logging and monitoring tools like ELK, Prometheus, and Grafana.
* Demonstrable MLOps experience.
* Ability to quickly gain domain knowledge.
* Operational experience in maintaining applications.
Our offer
#J-18808-Ljbffr