Key Responsibilities:
• Lead operational management and continuous development of NNDIP, aligning with business needs for rapid AI deployment, overseeing strategic planning and execution.
• Install, configure, and manage key platform components, such as the Seldon application, Istio, Elasticsearch, Prometheus, and Grafana. Ensure seamless integration and optimal performance.
• Collaborate closely with the infrastructure team to optimize underlying resources and configurations, ensuring alignment with Seldon and associated ML workloads.
• Develop and enhance application components, contributing to the platform's evolution and extended functionalities. Monitor, troubleshoot, and resolve platform issues, while documenting operational procedures, best practices, and guidelines. Stay updated on Kubernetes and MLOps advancements for continuous innovation and improvement.
• Write and maintain Python code for platform enhancements and custom solutions, operate and maintain NNDIP in a production environment, ensuring high availability, performance, and security. Provide hands-on MLOps support to data scientists for model deployment.
Required Skills and Qualifications:
• Bachelor’s or master’s degree in computer science, Engineering, or related field.
• Strong experience in Kubernetes management, with expertise in Seldon being advantageous.
• Proficient in CI/CD, (Azure) DevOps, application security, and performance monitoring.
• Familiarity with tools like ArgoCD, Istio, Terraform, Opensearch, Prometheus, and Grafana.
• Solid grasp of Python, FastAPI, general API concepts, unit testing, etc.
• Proficiency in MLOps practices (MLFlow preferred) and experience supporting data scientists in a machine learning deployment environment.
• Solid understanding of cloud services, preferably AWS, along with experience in maintaining large-scale platforms.
To help you with the preselection process, below are the min skills
1. Proficiency in Python and FastAPI.
2. Experience with Kubernetes, AWS and Docker.
3. Understanding of CI/CD processes and tools like ArgoCD.
4. MLOps: Ability to create and manage machine learning models using Seldon and MLFlow.
5. Knowledge of defensive programming, unit testing, and authorization flows.