Back to Search Results

Get alerts for jobs like this Get jobs like this tweeted to you

Company: AMD

Location: Austin, TX

Career Level: Mid-Senior Level

Industries: Technology, Software, IT, Electronics

Apply on company website View all jobs at this company

Description

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

THE ROLE:

We are seeking a highly motivated and skilled GPU Cluster Performance Attainment Engineer to join our dynamic team. In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The focus of this role is the RDMA networks used in AI Clusters, understanding data flows between GPU, NIC and cluster network. The ideal candidate will have a strong background in GPU architectures, parallel computing, and hands-on experience in system level performance tuning and debug methodologies.

THE PERSON:

The team fosters and encourages continuous technical innovation to showcase successes as well as facilitate continuous career development. A seasoned professional who enjoys hands-on problem-solving. In this role, you'll shape long-term strategy and jump in to tackle challenges head-on. You'll have a direct impact on performance, automation, and development, while staying ahead of industry trends to provide strategic insights to senior management. The person should be experienced in debugging complex HW/FW and drivers.

KEY RESPONSIBILITIES:

NIC & Performance Optimization: Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications.

Benchmarking and Analysis: Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.

Scalability Testing: Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE)

Performance Profiling: Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.

Performance Tuning: Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.

Documentation: Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.

Collaboration: Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.

Continuous Learning: Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.

PREFERRED EXPERIENCE:

Proven experience in optimizing the performance of GPU clusters.

Understanding of RDMA network drivers

Strong understanding of GPU architectures, parallel computing concepts, and network protocols.

Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.

Experience with system level performance analysis tools and methodologies for GPU clusters.

Analytical mindset with excellent problem-solving and debug skills.

Familiarity with cluster management tools and systems.

Excellent communication and collaboration skills for effective teamwork.

RDMA network configuration, troubleshooting and performance tuning.

Linux kernel networking expertise

Machine learning and/or HPC system design

ACADEMIC CREDENTIALS:

Bachelors or Master's degree in computer science or equivalent experience

LOCATION:

Austin, TX

#LI-JE1

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

Apply on company website

Senior Cluster Performance Engineer Job Listing at AMD in Austin, TX (Job ID 73361-en-us)

Description

Job Seekers

Senior Cluster Performance Engineer Job Listing at AMD in Austin, TX (Job ID 73361-en-us)

Description

Find Connections via Linkedin

General Tips

Asking for Help

Getting Introduced

Job Seekers