Embedding Datasets
Learn how to define and augment datasets with embedding columns for advanced search capabilities.
Overview​
Spice provides three distinct methods for handling embedding columns in datasets:
- Just-in-Time (JIT) Embeddings: Dynamically computes embeddings, on-demand, during query execution, without precomputing data.
- Accelerated Embeddings: Precomputes embeddings by transforming and augmenting the source dataset for faster query and search performance.
- Passthrough Embeddings: Utilizes pre-existing embeddings directly from the underlying source datasets, bypassing any additional computation.
Configuring Embedding Models​
Before configuring dataset embeddings define the embedding models in the spicepod.yaml, for example:
embeddings:
  - name: local_embedding_model
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
  - from: openai
    name: remote_service
    params:
      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
See Embedding components for more information on embedding models.
Vector Searches​
Spice supports complex searches by utilizing embeddings. Both local and remote embedding models can be used for vector searches.
To run a vector search, embeddings must be defined for the relevant columns in your dataset. Once configured, similarity searches can be performed using the defined embeddings.
For detailed instructions and examples on running vector searches, refer to the Vector-Based Search documentation.
Generating Embeddings in Queries​
The embed() scalar function allows you to generate embeddings directly within SQL queries. This function can process both single text strings and arrays of text, making it useful for ad-hoc embedding generation and comparison operations.
