Anthony Nocentino solves a problem:
When you call out to an external embedding service from T-SQL via REST over HTTPS, you’re limited by the throughput of that backend. If you’re running a single Ollama instance, you’ll quickly hit a ceiling on how fast you can generate embeddings, especially for large datasets. I recently attended an event and discussed this topic. My first attempt at generating embeddings was for a three-million-row table. I had access to some world-class hardware to generate the embeddings. When I arrived at the lab and initiated the embedding generation process for this dataset, I quickly realized it would take approximately 9 days to complete. Upon closer examination, I found that I was not utilizing the GPUs to their full potential; in fact, I was only using about 15% of one GPU’s capacity. So I started to cook up this concept in my head, and here we are, load balancing embedding generation across multiple instances of ollama to more fully utilize the resources.
Click through for the solution.
Leave a Comment