job details

Back to jobs search

Jobs search results

2,743 jobs matched
Back to jobs search

Senior Software Engineer, Google Cloud, Machine Learning Infrastructure

GoogleBengaluru, Karnataka, India

Minimum qualifications:

  • Bachelor's degree in Computer Science, a related technical field, or equivalent practical experience.
  • 5 years of experience with software development in one or more programming languages, and with data structures/algorithms.
  • 3 years of experience testing, maintaining, or launching software products, and 1 year of experience with software design and architecture.

Preferred qualifications:

  • 3 years of experience with ML infrastructure (e.g., model deployment, model evaluation, optimization, data processing, debugging).
  • Experience with Kubernetes, Google Kubernetes Engine, GPU Programming, TensorFlow, and Google Cloud.
  • Familiarity with the following concepts: LLM Training/Serving Concepts, ML frameworks and concepts, TPU/GPU concepts and architecture.
  • Ability to work in a fast moving environment and adapt to changing requirements and iterative development model.

About the job

Google Cloud's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google Cloud's needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. You will anticipate our customer needs and be empowered to act like an owner, take action and innovate. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

ML Performance and Observability team is part of the Core ML which is the central machine learning organization that provides ML software tools and hardware infrastructure to all the Google product areas and is driving ML excellence for Google.

Our team is responsible for innovating, designing, and building the required observability, monitoring, tooling and dashboards for the entire fleet of Google's ML resources (TPUs and GPUs). This data is used to determine the performance and efficiency of the fleet, and help identify solutions to improve the fleet-wide efficiency, understand usage patterns adjust fleet deployment, and explore hardware improvements. We collaborate with many Product Areas across Google to help drive many efficiency improvements based on the performance and efficiency data gathered in our team.

Google Cloud accelerates every organization’s ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage Google’s cutting-edge technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to Google Cloud as their trusted partner to enable growth and solve their most critical business problems.

Responsibilities

  • Design, implement and advance the telemetry capabilities needed for monitoring and evaluating the fleet-wide efficiency of ML resources (TPUs and GPUs). This includes identifying the right underlying signals, devising the right high-level metrics of interest, and creating common dashboards for highlighting fleet-wide performance and efficiency.
  • Identify opportunities to improve the efficiency of the ML fleet and build solutions and capabilities to improve ML fleet efficiency.
  • Build reporting and analytic solutions with key partners, and provide in-depth analysis of the metrics to improve the operation and utilization of ML resources. 
  • Drive collaboration with various teams (across different product areas) as needed to accomplish the efficiency improvement goals.
  • Lead junior Software Engineers towards delivering project goals.

Information collected and processed as part of your Google Careers profile, and any job applications you choose to submit is subject to Google's Applicant and Candidate Privacy Policy.

Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents-to-be, criminal histories consistent with legal requirements, or any other basis protected by law. See also Google's EEO Policy, Know your rights: workplace discrimination is illegal, Belonging at Google, and How we hire.

If you have a need that requires accommodation, please let us know by completing our Accommodations for Applicants form.

Google is a global company and, in order to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting.

To all recruitment agencies: Google does not accept agency resumes. Please do not forward resumes to our jobs alias, Google employees, or any other organization location. Google is not responsible for any fees related to unsolicited resumes.

Google apps
Main menu