Job brief.
Job Role: Lead MLops Engineer
Experience: 12-18 Yrs
Location: Bangalore
Work Culture: Hybrid
Key Responsibilities
- MLOps Leadership: Lead the design and implementation of robust MLOps practices, managing the deployment, scaling, and maintenance of machine learning models in production.
- Cloud Infrastructure Strategy: Drive cloud architecture strategy for ML workloads, leveraging Azure or AWS to build scalable, secure, and cost-effective infrastructure for model deployment, monitoring, and scaling.
- CI/CD Pipeline Design & Automation: Lead the development and maintenance of automated CI/CD pipelines for machine learning models, ensuring efficient integration, testing, and deployment workflows.
- Model Monitoring & Optimization: Oversee the monitoring of model performance in production, implement model drift detection, manage retraining schedules, and ensure operational excellence in production ML systems.
- Collaboration & Mentorship: Collaborate with cross-functional teams, including data scientists, software engineers, and infrastructure teams, to ensure successful deployment of ML models. Mentor and guide junior MLOps engineers and team members.
- Continuous Improvement: Advocate and lead the adoption of new tools, techniques, and best practices to improve the efficiency, reliability, and scalability of the MLOps pipeline.
- Automation & Efficiency: Champion the automation of model deployment, monitoring, and testing processes, ensuring that repetitive tasks are minimized, and model updates can be delivered rapidly and reliably.
- Security & Compliance: Ensure that all ML deployments adhere to internal security protocols, governance, and regulatory compliance requirements, particularly in cloud environments.
- Documentation & Reporting: Create and maintain comprehensive documentation for workflows, pipelines, infrastructure, and models. Provide regular reports to stakeholders on system performance, new features, and model health.
- Risk Management: Identify and mitigate risks related to model performance, infrastructure stability, and deployment failures, ensuring the continuous operation of machine learning models.
Technical Skills
- MLOps Expertise: Deep experience in designing, implementing, and optimizing end-to-end MLOps pipelines, including automation of model deployment, testing, scaling, and monitoring.
- Cloud Platforms: Extensive experience with Azure or AWS services such as Azure Machine Learning, AWS Sagemaker, EC2, Lambda, Kubernetes, and related tools for managing cloud infrastructure for ML deployments.
- CI/CD Pipelines: Strong knowledge of CI/CD tools (e.g., Jenkins, GitLab CI, Azure DevOps, AWS CodePipeline) for building automated pipelines for model deployment, testing, and monitoring.
- Containerization & Orchestration: Expertise in Docker for containerizing machine learning models and using Kubernetes or Azure Kubernetes Service (AKS) for managing and orchestrating containers at scale.
- Programming: Proficiency in Python for building ML workflows, automation scripts, data processing, and integration with ML frameworks and cloud infrastructure.
- ML Lifecycle Management: Experience with tools like MLflow, Kubeflow, or TensorFlow Extended (TFX) for managing the full lifecycle of ML models from experimentation to deployment and monitoring.
- Version Control & Collaboration: Expertise with Git and Git-based workflows for code versioning, collaboration, and deployment in team environments.
- Monitoring & Logging: Experience with monitoring and logging tools such as Prometheus, Grafana, or Azure Monitor to track model performance, health, and detect anomalies in production.
- Data Engineering: Familiarity with big data processing tools such as Apache Kafka, Apache Airflow, or Spark for handling large-scale data processing needs related to model training and deployment.