Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
MTS SOFTWARE SYSTEM DESIGN ENGINEER
THE ROLE:
We are looking for a Staff-level GPU Compute / AI Validation, Debug & Performance Engineer to lead validation, deep-debug, and performance optimization for next-generation GPU compute and AI platforms. This role requires strong expertise in GPU architecture, parallel computing, and AI workloads, along with the ability to drive cross-functional technical initiatives in a global MNC environment.
The ideal candidate will own complex validation areas, act as a technical authority for GPU compute/AI debug and performance, and influence architecture and design decisions through data-driven insights.
KEY RESPONSIBILITIES:
GPU Compute / AI Validation Leadership
- Own end-to-end validation strategy for GPU compute and AI workloads (HPC, ML, DL).
- Define validation scope, coverage, and success metrics for compute pipelines.
- Lead post-silicon validation, silicon bring-up, and feature readiness for GPU compute.
- Ensure functional correctness across drivers, firmware, runtime, and frameworks.
Advanced Debug & Root Cause Analysis
- Act as debug lead for complex GPU compute/AI issues spanning HW, FW, drivers, runtimes, and OS.
- Debug GPU hangs, page faults, ECC errors, memory corruption, and scheduler failures.
- Analyze failures using GPU traces, register dumps, crash dumps, JTAG, logs, windbg, counters and using AMD different profiler/debugger tools.
- Work directly with architecture, RTL, and design teams to influence fixes and mitigations.
Performance Analysis & Optimization
- Lead performance characterization and optimization for AI and compute workloads.
- Identify bottlenecks across compute units, memory bandwidth, cache, interconnect, and power.
- Drive workload-aware optimizations for training and inference use cases.
- Validate performance-per-watt and scalability against product and architectural goals.
Automation, Tools & Infrastructure
- Architect and drive automation frameworks for compute/AI validation and performance.
- Develop tooling using Python to improve efficiency and coverage.
- Integrate tests into CI/CD pipelines and regression systems.
- Enable data-driven decision making through dashboards and performance tracking.
Technical Leadership & Cross-Functional Influence
- Drive cross-team alignment with architecture, RTL, firmware, driver, compiler, and AI software teams.
- Influence architectural decisions through early validation and performance feedback.
- Represent the team in global technical forums and design reviews.
REQUIRED QUALIFICATION:
Technical Expertise
- 8+ years of experience in GPU compute / AI validation, debug, or performance
- Deep understanding of GPU architecture and parallel compute models
- Strong experience with AI/ML and HPC workloads
- Expertise in GPU drivers, runtimes, and system software (Linux and Windows)
- Hands-on experience with GPU profiling and debug tools
- Proficiency in Python, Groovy, Github, Linux, Window, CI/CD, Test Development and performance analysis
Leadership & Soft Skills
- Proven technical leadership at Senior/Staff level
- Ability to lead ambiguous, high-impact problem areas
- Strong communication skills
- Mentoring and design-review experience
PREFERRED EXPERIENCE:
- Product development or systems engineering background with hardware platforms and their software & firmware ecosystems
- Excellent verbal communication and written, presentation skills
- Excellent interpersonal, organizational, analytical, planning, and technical leadership skills
- Proven record of accomplishment in delivering large multi-functional product solutions
- Experience working in a fast-paced matrixed technical organization and multi-site environment
- Experience with ROCm, or similar compute stacks
- Experience with compiler or runtime optimizations for AI workloads
- Knowledge of power, thermal, and reliability (RAS) aspects of GPUs
- Prior experience in leading GPU or AI accelerator products
ACADEMIC CREDENTIALS:
- Bachelor's or Master's degree in Computer or Electrical Engineering or equivalent
#LI-NR1
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.
This posting is for an existing vacancy.
Apply on company website