About
👋 Hi, I’m Akash, an applied researcher/engineer with experience in speech, audio (at Microsoft), and most recently multi-modal document understanding and retrieval (at Contextual AI). This incidentally completes the trio of audio, vision & text AI multimodality. :)
I’m currently on a sabbatical. After moving to the USA for grad school ~10 years ago, I decided to take a break to reflect, recharge and tinker before setting sail again. More on this here shortly!
Work
Contextual AI
[2024-25]
Wrangled millions of pages to land the first $ millions in enterprise contracts :)
- Product development (0→1): RAG platform for knowledge agents
- Built core multimodal document understanding system powering ingestion and retrieval
- Critical in landing company’s first multi-million $ enterprise contract with Qualcomm
- Applied research: Synthesis of long complex documents, eval design
- Combining segmentation models, VLMs, and parsers for high-fidelity OCR with bbox provenance
- Token-efficient cross-referencing via ingest-time document enrichment (à la llm-wiki)
-
Demo: Chat with 250 page PDF in Cursor Blog: Agentic alternative to GraphRAG
- SWE things: Workflow/agent framework architecture, testing, observability, and scalability
- Tech Lead Manager: DRI cross-company; Mentored team of 3, interviewed candidates
Microsoft
[2018-23]
Fun fact: ~6M hours of monthly traffic equals 1 *year* of conversations transcribed per hour!
- Model development: state-of-art transcription designed for scale [O(1e7) hrs/month]
- Shipped both batch and streaming models to Azure Batch, Microsoft Word, and Microsoft Teams
- Optimized Conformer batch model (Whisper-comparable) at 50x realtime
- Applied research: diarized multi-speaker multi-mic transcription
- Shipped diarized in-conference room transcription device covered by The Verge
- Lead contributor: ASR training recipes, evaluation metrics, cross-system error analysis
- Research engineering: data pipelines, optimizing distributed training and inference
- Speeding up O(1e20) FLOP training on low-cost V100 GPUs
- Leveraged NVIDIA/ONNX profiling tools to fix bottlenecks in inference throughput
- Other Links:
‘Graduated’ as one of the few non-speech-PhD senior members on the team :)
Misc
Open source
- [2023] 🐥🗣️ Contributed to whisper.cpp (38k+ stars). tinydiarize is a lightweight extension of OpenAI’s Whisper model for speaker diarization, runnable on MacBooks/iPhones.
- [2019-22] 🐋 Co-founded OrcaHello, a real-time alert system listening for endangered orca calls 24/7 at underwater “hydrophones” in the Pacific Northwest. Awarded a $30k AI for Earth grant; covered by Mongabay News.
- [2018] 🗣️ Built Attention, I’m Trying to Speak: speech synthesis with just $75 of compute. Got to fist-bump Richard Socher for Stanford CS224n project award :).
Other
- [2016/17] Wrote case studies on the music streaming industry while studying business/tech strategy at Stanford MS&E.
- [2014] Organized (at the time) Chennai’s largest EDM gig - with 5k+ attendees, during my undergrad at IIT Madras/Chennai.