Graph Sources Overview

Graph sources enable querying specialized indexes and external data sources using the same query interface as regular Fluree ledgers. This document provides a comprehensive overview of graph source architecture and capabilities.

Concept

A graph source is anything you can address by a graph name/IRI and query as part of a single execution. Some graph sources are ledger-backed RDF graphs; others are backed by different systems optimized for specific query patterns.

Regular Ledger:

Stored as RDF triples
Indexed with SPOT, POST, OPST, PSOT
Optimized for graph traversal

Non-ledger Graph Source:

Stored in specialized format
Custom indexing for specific queries
Optimized for particular use cases

Both are queried using the same SPARQL or JSON-LD Query syntax.

Architecture

Components

┌─────────────────────────────────────────┐
│         Fluree Query Engine             │
└─────────────────┬───────────────────────┘
                  │
      ┌───────────┴──────────┐
      │                      │
┌─────▼──────┐      ┌───────▼────────┐
│  Regular   │      │    Graph       │
│  Ledgers   │      │    Sources     │
└─────┬──────┘      └───────┬────────┘
      │                     │
      │             ┌───────┴────────┐
      │             │                │
┌─────▼──────┐ ┌───▼───┐     ┌─────▼──────┐
│ RDF Triple │ │ BM25  │     │  usearch   │
│   Store    │ │ Index │     │  Vector    │
└────────────┘ └───────┘     └────────────┘

Graph Source Registry (Nameservice)

Non-ledger graph sources are registered in nameservice:

{
  "graph_source_id": "products-search:main",
  "type": "graph-source",
  "backend": "bm25",
  "source": "products:main",
  "config": {
    "fields": [...]
  },
  "status": "ready",
  "last_sync": "2024-01-22T10:30:00Z"
}

Graph Source Types

1. BM25 Full-Text Search

Backend: Inverted text index

Purpose: Keyword search with relevance ranking

Configuration:

{
  "type": "bm25",
  "source": "products:main",
  "fields": [
    { "predicate": "schema:name", "weight": 2.0 },
    { "predicate": "schema:description", "weight": 1.0 }
  ]
}

Query:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 20,
      "f:searchResult": { "f:resultId": "?product", "f:resultScore": "?score" }
    }
  ],
  "select": ["?product", "?score"]
}

2. Vector Similarity

Backend: HNSW index (embedded or remote)

Purpose: Semantic search using embeddings

Configuration:

{
  "type": "vector",
  "source": "products:main",
  "embedding_property": "ex:embedding",
  "dimensions": 384,
  "metric": "cosine"
}

Query:

{
  "from": "mydb:main",
  "where": [
    {
      "f:graphSource": "products-vector:main",
      "f:queryVector": [0.1, 0.2, ...],
      "f:distanceMetric": "cosine",
      "f:searchLimit": 10,
      "f:searchResult": {
        "f:resultId": "?product",
        "f:resultScore": "?score"
      }
    }
  ],
  "select": ["?product", "?score"]
}

3. Apache Iceberg

Backend: Iceberg tables / Parquet files via R2RML mapping

Purpose: Analytics on data lake

Iceberg graph sources require an R2RML mapping that defines how table rows become RDF triples. Two catalog modes select how Iceberg metadata is discovered:

REST catalog: connects to an Iceberg REST catalog API (e.g., Polaris)
Direct S3: reads metadata/version-hint.text from the table’s S3 location (no catalog server required)

See Iceberg / Parquet for full configuration details and examples.

Query:

{
  "from": "warehouse-orders:main",
  "select": ["?orderId", "?total"],
  "where": [
    { "@id": "?order", "ex:orderId": "?orderId" },
    { "@id": "?order", "ex:total": "?total" }
  ]
}

Creating Graph Sources

Via Rust API

Graph sources are created and registered via the fluree-db-api Rust API, which publishes the graph source record into the nameservice.

#![allow(unused)]
fn main() {
use fluree_db_api::{FlureeBuilder, R2rmlCreateConfig};

let fluree = FlureeBuilder::default().build().await?;

let config = R2rmlCreateConfig::new_direct(
    "execution-log",
    "s3://bucket/warehouse/logs/execution_log",
    "fluree:file://mappings/execution_log.ttl",
)
.with_s3_region("us-east-1");

fluree.create_r2rml_graph_source(config).await?;
}

Querying Graph Sources

Graph sources come in two flavors with different query models:

Iceberg sources — queried transparently using standard SPARQL/JSON-LD patterns (FROM, GRAPH, or as a direct query target)
Search indexes (BM25, Vector) — queried using the f:graphSource / f:searchText pattern

Iceberg (Transparent)

Iceberg graph sources are queried just like ledgers. No special syntax is needed:

As a direct target:

-- Query the graph source directly
SELECT ?s ?p ?o FROM <execution-log:main> WHERE { ?s ?p ?o } LIMIT 10

Via GRAPH pattern (joining with ledger data):

{
  "from": "mydb:main",
  "select": ["?customer", "?orderId", "?total"],
  "where": [
    { "@id": "?customer", "schema:name": "?name" },
    { "@id": "?customer", "ex:customerId": "?custId" },
    {
      "graph": "warehouse-orders:main",
      "where": [
        { "@id": "?order", "ex:customerId": "?custId" },
        { "@id": "?order", "ex:orderId": "?orderId" },
        { "@id": "?order", "ex:total": "?total" }
      ]
    }
  ]
}

Iceberg graph sources use R2RML mappings to define how table rows become RDF triples. See Iceberg / Parquet and R2RML for details.

Search Indexes (BM25, Vector)

Search indexes use the f:graphSource pattern:

Single Graph Source

Query one graph source:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "select": ["?product", "?score"],
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 20,
      "f:searchResult": { "f:resultId": "?product", "f:resultScore": "?score" }
    }
  ]
}

Multiple Graph Sources

Combine multiple graph sources:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "select": ["?product", "?textScore", "?vecScore"],
  "values": [
    ["?queryVec"],
    [{"@value": [0.1, 0.2, 0.3], "@type": "https://ns.flur.ee/db#embeddingVector"}]
  ],
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 100,
      "f:searchResult": { "f:resultId": "?product", "f:resultScore": "?textScore" }
    },
    {
      "f:graphSource": "products-vector:main",
      "f:queryVector": "?queryVec",
      "f:searchLimit": 100,
      "f:searchResult": { "f:resultId": "?product", "f:resultScore": "?vecScore" }
    }
  ]
}

Graph Sources + Regular Graphs

Combine graph sources and regular ledgers:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "select": ["?product", "?name", "?price", "?score"],
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 20,
      "f:searchResult": { "f:resultId": "?product", "f:resultScore": "?score" }
    },
    { "@id": "?product", "schema:name": "?name" },
    { "@id": "?product", "schema:price": "?price" }
  ]
}

Synchronization

Source Tracking

Graph sources track their source ledger:

Source: products:main @ t=150
Graph Source: products-search:main @ source_t=150

Update Modes

Real-Time:

Updates immediately as source changes
Low latency
Higher overhead

Batch:

Updates periodically
Higher latency
Lower overhead

Manual:

Updates on demand
Full control
Requires manual triggering

Checking Sync Status

curl http://localhost:8090/graph-source/products-search:main/status

Response:

{
  "name": "products-search:main",
  "source": "products:main",
  "source_t": 150,
  "index_t": 148,
  "lag": 2,
  "last_sync": "2024-01-22T10:30:00Z",
  "status": "syncing"
}

Query Execution

Query Planning

Query planner handles graph sources:

Parse Query: Extract graph patterns
Route Subqueries: Identify which graphs handle which patterns
Execute Subqueries: Run against appropriate backends
Join Results: Combine results from multiple graphs
Apply Filters: Final filtering and sorting

Example Execution

Query:

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 50,
      "f:searchResult": { "f:resultId": "?p" }
    },
    { "@id": "?p", "schema:price": "?price" }
  ],
  "filter": "?price < 1000"
}

Execution Plan:

1. Execute BM25 search on products-search:main:
   f:searchText "laptop", f:searchLimit 50
   → Result: ?p = [ex:p1, ex:p2, ex:p3, ...]

2. Execute on products:main:
   SELECT ?p ?price WHERE {
     VALUES ?p { ex:p1 ex:p2 ex:p3 ... }
     ?p schema:price ?price
   }
   → Result: [(ex:p1, 899), (ex:p2, 1200), ...]

3. Join and filter:
   ?price < 1000
   → Result: [(ex:p1, 899)]

Performance Characteristics

BM25 Graph Sources

Index Build: O(n × avg_doc_length)
Query: O(log n) with inverted index
Space: 2-3× source data
Update: Incremental, O(doc_size)

Vector Graph Sources

Index Build: O(n log n) for HNSW
Query: O(log n) approximate
Space: 1.5× embedding size
Update: Incremental, O(1)

Iceberg Graph Sources

Index Build: No index (direct file access)
Query: O(partitions scanned)
Space: Zero overhead (uses Parquet files)
Update: Batch-oriented

Best Practices

1. Choose Appropriate Type

Match graph source type to use case:

Keyword search → BM25
Semantic search → Vector
Analytics / data lake → Iceberg (with R2RML mapping)

2. Monitor Synchronization

Check sync lag regularly:

setInterval(async () => {
  const status = await getGraphSourceStatus('products-search:main');
  if (status.lag > 10) {
    console.warn(`Graph source lag: ${status.lag} transactions`);
  }
}, 60000);

3. Filter in Graph Sources

Push filters to graph sources when possible:

Good (graph source pattern first narrows results before graph traversal):

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "where": [
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 20,
      "f:searchResult": { "f:resultId": "?p" }
    },
    { "@id": "?p", "schema:name": "?name" }
  ]
}

Bad (graph traversal before graph source means scanning all products first):

{
  "@context": {"f": "https://ns.flur.ee/db#"},
  "from": "products:main",
  "where": [
    { "@id": "?p", "schema:name": "?name" },
    {
      "f:graphSource": "products-search:main",
      "f:searchText": "laptop",
      "f:searchLimit": 20,
      "f:searchResult": { "f:resultId": "?p" }
    }
  ]
}

4. Use Explain Plans

Understand query execution:

curl -X POST http://localhost:8090/v1/fluree/explain \
  -d '{...}'

5. Limit Results

Always use LIMIT with graph sources:

{
  "where": [...],
  "limit": 100
}

Troubleshooting

High Sync Lag

Symptom: lag increasing

Causes:

Source ledger write rate too high
Graph source indexing too slow
Resource constraints

Solutions:

Increase indexing resources
Batch updates
Use manual sync mode

Query Performance Issues

Symptom: Slow queries combining graph sources

Solutions:

Check explain plan
Add filters to reduce intermediate results
Ensure graph source is synced
Consider query rewrite

Missing Results

Symptom: Expected results not returned

Causes:

Graph source not synced
Mapping misconfiguration
Filter too restrictive

Solutions:

Check sync status
Verify mapping configuration
Test subqueries independently

BM25 Graph Source - Full-text search
Iceberg - Data lake integration
R2RML - R2RML mapping reference
BM25 Indexing - BM25 details
Vector Search - Vector details
Query Datasets - Multi-graph queries

Keyboard shortcuts

Fluree DB