Java Developer
About Candidate
Introduction:
The candidate is a Senior Software Engineer with a strong background in developing high-performance software and hardware stacks, particularly focused on deep-learning and heterogeneous systems. They have extensive experience in memory planning, performance optimization, and code generation, with a deep understanding of accelerator-based architectures and compiler infrastructure. The candidate has contributed to significant projects involving deep learning compilers like Apache TVM and TornadoVM, and has worked on memory management and data transfer optimization for heterogeneous systems. They have a proven track record in designing and implementing cooperative-execution, double-buffering, and performance tracing, which are key in optimizing execution on accelerators. Their expertise spans across various technologies including CUDA, OpenCL, PTX, and Level-Zero, and they have a strong foundation in JVM runtime systems. Additionally, they have experience in microservices, system troubleshooting, and transitioning monolithic systems into microservices architectures. The candidate has led cross-team efforts to support new hardware generations, improving performance and collaboration. Their work involves deep integration with custom hardware and software, targeting both academic and industry-scale systems.
Responsibilities:
- Building end-to-end software/hardware stacks for offloading deep-learning neural networks to custom-built AI accelerators.
- Working on performance optimizations, memory planning, and code generation across the software stack.
- Leading efforts to support new hardware generations in compilers and ensuring target performance.
- Implementing cooperative-execution and batched execution between multiple instances of the same network, sharing weights in accelerator memory.
- Adding double-buffering to hide data transfers between host and device.
- Developing performance and compilation tracers for the compiler infrastructure, helping with operator fine-tuning.
- Integrating TVM compiler by writing host runtime components to support custom hardware implementations.
- Contributing to the development of deep learning compilers like Apache TVM, including porting memory planners and optimizing memory usage.
- Enhancing memory management systems, enabling device memory reuse and buffer caching.
- Implementing PTX compilers alongside OpenCL compilers to improve hardware usage.
- Adding functionality for data batching and overlapping computation and data transfers using double-buffering.
- Improving data transfer speed between host and device by enabling off-heap buffers and page-locked memory.
- Troubleshooting performance issues and bottlenecks in microservices pipelines for public transit systems.
- Implementing business features and maintaining microservices in transit-related applications.
- Dividing monolithic services into microservices, applying performance tests, and measuring overall system performance.