Website Wadhwani AI
The Linux Systems Administrator shall be fully responsible for managing our cutting-edge High-Performance Computing (HPC) infrastructure used for building machine learning applications.
The engineer will be responsible for hands-on management of all technical aspects of the system, planning improvements, budgeting, and communication with both internal and external stakeholders.
We are looking for a highly motivated, creative problem-solver with a deep understanding of Linux System Administration and an automation mindset. You will be constantly challenged to hone your skills and learn while enabling the infrastructure powering solutions to problems of a global scale.
The Wadhwani Institute for Artificial Intelligence (Wadhwani AI) is an independent nonprofit institute developing AI-based solutions for underserved communities in developing countries.
We work with over 40 partners including key government agencies, international development organizations, NGOs, and research organizations to identify real-world challenges, create and deploy innovative AI solutions to reach intended users and beneficiaries – to create sustainable, scalable impact.
Our projects are supported by leading philanthropies such as the Bill & Melinda Gates Foundation, USAID, and Google.org. Our accolades include winning the Google AI Impact Challenge (one among 20 out of 2600 competitors globally) as well as multiple awards at the 2021 Fast Company World Changing Ideas.
Our current portfolio of initiatives spans tuberculosis, maternal, newborn & child health, Covid-19 and cotton farming.
ROLES AND RESPONSIBILITIES
- Design and deploy key components of on-premise machine learning infrastructure including data pipelines, event and monitoring, and security solutions.
- Manage, upgrade and automate existing high-performance computer, storage and network infrastructure, Kubernetes cluster, user authentication and authorization, backup and disaster recovery.
- Plan capacity, upgrades, and monitor and optimize utilization of resources across different projects.
- Troubleshoot issues and implement corrective and preventive actions.
- Build and manage relationships with vendors, solution architects, and other external stakeholders.
- Collaborate closely with engineers and researchers to understand their challenges, and recommend/implement appropriate solutions.
- 5+ years of hands-on experience on Linux operating systems
- Excellent shell/python scripting skills
- Expert in Docker, Kubernetes/Slurm cluster management
- Understanding of storage systems, data backup and restoration, disk management
- Knowledge of configuration management tools like Ansible is preferred
- Experience with NVIDIA DGX, DeepOps
- Basic understanding of HPC system design and architecture
- Knowledge of Amazon Web Services/Google Compute Platform
To apply for this job please visit www.wadhwaniai.org.