Ask HN: How to increase LLM inference speed?

3 points by InkCanon 2 days ago

Hi HN,

I'm building software that has a very tight feedback loop with the user. One part involves a short (few hundred tokens) response from an LLM. By far this is the biggest UX problem - currently DeepSeek's total time taken can reach 10 seconds, which is horrific. Would it be possible to practically reduce the speed to maybe ~2 seconds? The LLM just asks to rephrase (while preserving meaning) of a short text, so it does not need to be SOTA. On the whole faster inference time is much more important.

cranberryturkey 2 days ago

you need a faster GPU but that only works for self hosted LLMs (ie: ollama/huggingface)