Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

### Added

- Added optional `/catalogs` route support to enable federated hierarchical catalog browsing and navigation. [#547](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/547)
- Added DELETE `/catalogs/{catalog_id}/collections/{collection_id}` endpoint to support removing collections from catalogs. When a collection belongs to multiple catalogs, it removes only the specified catalog from the collection's parent_ids. When a collection belongs to only one catalog, the collection is deleted entirely. [#554](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/554)
- Added `parent_ids` internal field to collections to support multi-catalog hierarchies. Collections can now belong to multiple catalogs, with parent catalog IDs stored in this field for efficient querying and management. [#554](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/554)
- Environment variable `VALIDATE_QUERYABLES` to enable/disable validation of queryables in search/filter requests. When set to `true`, search requests will be validated against the defined queryables, returning an error for any unsupported fields. Defaults to `false` for backward compatibility.[#532](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/532)

- Environment variable `QUERYABLES_CACHE_TTL` to configure the TTL (in seconds) for caching queryables. Default is `1800` seconds (30 minutes) to balance performance and freshness of queryables data. [#532](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/532)

### Changed

Expand Down
27 changes: 26 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -469,8 +469,10 @@ You can customize additional settings in your `.env` file:
| `STAC_INDEX_ASSETS` | Controls if Assets are indexed when added to Elasticsearch/Opensearch. This allows asset fields to be included in search queries. | `false` | Optional |
| `USE_DATETIME` | Configures the datetime search behavior in SFEOS. When enabled, searches both datetime field and falls back to start_datetime/end_datetime range for items with null datetime. When disabled, searches only by start_datetime/end_datetime range. | `true` | Optional |
| `USE_DATETIME_NANOS` | Enables nanosecond precision handling for `datetime` field searches as per the `date_nanos` type. When `False`, it uses 3 millisecond precision as per the type `date`. | `true` | Optional |
| `EXCLUDED_FROM_QUERYABLES` | Comma-separated list of fully qualified field names to exclude from the queryables endpoint and filtering. Use full paths like `properties.auth:schemes,properties.storage:schemes`. Excluded fields and their nested children will not be exposed in queryables. | None | Optional |
| `EXCLUDED_FROM_QUERYABLES` | Comma-separated list of fully qualified field names to exclude from the queryables endpoint and filtering. Use full paths like `properties.auth:schemes,properties.storage:schemes`. Excluded fields and their nested children will not be exposed in queryables. If `VALIDATE_QUERYABLES` is enabled, these fields will also be considered invalid for filtering. | None | Optional |
| `EXCLUDED_FROM_ITEMS` | Specifies fields to exclude from STAC item responses. Supports comma-separated field names and dot notation for nested fields (e.g., `private_data,properties.confidential,assets.internal`). | `None` | Optional |
| `VALIDATE_QUERYABLES` | Enable validation of query parameters against the collection's queryables. If set to `true`, the API will reject queries containing fields that are not defined in the collection's queryables. | `false` | Optional |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bountx How does this interact with the EXCLUDED_FROM_QUERYABLES environment variable? Should both these env vars be combined into one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did miss this in my implementation so I will fix it, but these two variables serve a different purpose: VALIDATE_QUERYABLES should validate on set of all queryables and EXCLUDED_FROM_QUERYABLES can be used to make this set of all queryables smaller.

Their interaction when both used should look something along:
Cache of all queryables -> Removes from this set all excluded queryables -> Validates on that set

If you had something other in mind let me know

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonhealy1 Fixed this interaction to work the way I mentioned beforehand.

| `QUERYABLES_CACHE_TTL` | Time-to-live (in seconds) for the queryables cache. Used when `VALIDATE_QUERYABLES` is enabled. | `1800` | Optional |


> [!NOTE]
Expand Down Expand Up @@ -526,6 +528,29 @@ EXCLUDED_FROM_QUERYABLES="properties.auth:schemes,properties.storage:schemes,pro
- Excluded fields and their nested children will be skipped during field traversal
- Both the field itself and any nested properties will be excluded

## Queryables Validation

SFEOS supports validating query parameters against the collection's defined queryables. This ensures that users only query fields that are explicitly exposed and indexed.

**Configuration:**

To enable queryables validation, set the following environment variables:

```bash
VALIDATE_QUERYABLES=true
QUERYABLES_CACHE_TTL=1800 # Optional, defaults to 1800 seconds (30 minutes)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this seem like a long default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what my team suggested at first as the cache query for queryable parameters doesn't seem that costly. I'll rediscuss that with them though in detail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to 6 hours default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it should be shorter like 30 minutes maybe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see - for our case the queryables changes are pretty rare so 1h-6h updates wouldn't change that much, but I think 30m default would work too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if someone is adding new data to the db, they may get confused.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the default value to 30 minutes


**Behavior:**

- When enabled, the API maintains a cache of all queryable fields across all collections.
- Search requests (both GET and POST) are checked against this cache.
- If a request contains a query parameter or filter field that is not in the list of allowed queryables, the API returns a `400 Bad Request` error with a message indicating the invalid field(s).
- The cache is automatically refreshed based on the `QUERYABLES_CACHE_TTL` setting.
- **Interaction with `EXCLUDED_FROM_QUERYABLES`**: If `VALIDATE_QUERYABLES` is enabled, fields listed in `EXCLUDED_FROM_QUERYABLES` will also be considered invalid for filtering. This effectively enforces the exclusion of these fields from search queries.

This feature helps prevent queries on non-queryable fields which could lead to unnecessary load on the database.

## Datetime-Based Index Management

### Overview
Expand Down
4 changes: 4 additions & 0 deletions stac_fastapi/core/stac_fastapi/core/base_database_logic.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,10 @@ async def delete_collection(
pass

@abc.abstractmethod
async def get_queryables_mapping(self, collection_id: str = "*") -> Dict[str, Any]:
"""Retrieve mapping of Queryables for search."""
pass

async def get_all_catalogs(
self,
token: Optional[str],
Expand Down
14 changes: 14 additions & 0 deletions stac_fastapi/core/stac_fastapi/core/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@
from stac_fastapi.core.base_settings import ApiBaseSettings
from stac_fastapi.core.datetime_utils import format_datetime_range
from stac_fastapi.core.models.links import PagingLinks
from stac_fastapi.core.queryables import (
QueryablesCache,
get_properties_from_cql2_filter,
)
from stac_fastapi.core.serializers import (
CatalogSerializer,
CollectionSerializer,
Expand Down Expand Up @@ -92,6 +96,10 @@ class CoreClient(AsyncBaseCoreClient):
title: str = attr.ib(default="stac-fastapi")
description: str = attr.ib(default="stac-fastapi")

def __attrs_post_init__(self):
"""Initialize the queryables cache."""
self.queryables_cache = QueryablesCache(self.database)

def extension_is_enabled(self, extension_name: str) -> bool:
"""Check if an extension is enabled by checking self.extensions.

Expand Down Expand Up @@ -844,6 +852,8 @@ async def post_search(
)

if hasattr(search_request, "query") and getattr(search_request, "query"):
query_fields = set(getattr(search_request, "query").keys())
await self.queryables_cache.validate(query_fields)
for field_name, expr in getattr(search_request, "query").items():
field = "properties__" + field_name
for op, value in expr.items():
Expand All @@ -862,7 +872,11 @@ async def post_search(

if cql2_filter is not None:
try:
query_fields = get_properties_from_cql2_filter(cql2_filter)
await self.queryables_cache.validate(query_fields)
search = await self.database.apply_cql2_filter(search, cql2_filter)
except HTTPException:
raise
except Exception as e:
raise HTTPException(
status_code=400, detail=f"Error with cql2 filter: {e}"
Expand Down
105 changes: 105 additions & 0 deletions stac_fastapi/core/stac_fastapi/core/queryables.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
"""A module for managing queryable attributes."""

import asyncio
import os
import time
from typing import Any, Dict, List, Set

from fastapi import HTTPException


class QueryablesCache:
"""A thread-safe, time-based cache for queryable properties."""

def __init__(self, database_logic: Any):
"""
Initialize the QueryablesCache.

Args:
database_logic: An instance of a class with a `get_queryables_mapping` method.
"""
self._db_logic = database_logic
self._cache: Dict[str, List[str]] = {}
self._all_queryables: Set[str] = set()
self._last_updated: float = 0
self._lock = asyncio.Lock()
self.validation_enabled: bool = False
self.cache_ttl: int = 1800 # How often to refresh cache (in seconds)
self.reload_settings()

def reload_settings(self):
"""Reload settings from environment variables."""
self.validation_enabled = (
os.getenv("VALIDATE_QUERYABLES", "false").lower() == "true"
)
self.cache_ttl = int(os.getenv("QUERYABLES_CACHE_TTL", "1800"))

async def _update_cache(self):
"""Update the cache with the latest queryables from the database."""
if not self.validation_enabled:
return

async with self._lock:
if (time.time() - self._last_updated < self.cache_ttl) and self._cache:
return

queryables_mapping = await self._db_logic.get_queryables_mapping()
all_queryables_set = set(queryables_mapping.keys())

self._all_queryables = all_queryables_set

self._cache = {"*": list(all_queryables_set)}
self._last_updated = time.time()

async def get_all_queryables(self) -> Set[str]:
"""
Return a set of all queryable attributes across all collections.

This method will update the cache if it's stale or has been cleared.
"""
if not self.validation_enabled:
return set()

if (time.time() - self._last_updated >= self.cache_ttl) or not self._cache:
await self._update_cache()
return self._all_queryables

async def validate(self, fields: Set[str]) -> None:
"""
Validate if the provided fields are queryable.

Raises HTTPException if invalid fields are found.
"""
if not self.validation_enabled:
return

allowed_fields = await self.get_all_queryables()
invalid_fields = fields - allowed_fields
if invalid_fields:
raise HTTPException(
status_code=400,
detail=f"Invalid query fields: {', '.join(sorted(invalid_fields))}. "
"These fields are not defined in the collection's queryables. "
"Use the /queryables endpoint to see available fields.",
)


def get_properties_from_cql2_filter(cql2_filter: Dict[str, Any]) -> Set[str]:
"""Recursively extract property names from a CQL2 filter.

Property names are normalized by stripping the 'properties.' prefix
if present, to match queryables stored without the prefix.
"""
props: Set[str] = set()
if "op" in cql2_filter and "args" in cql2_filter:
for arg in cql2_filter["args"]:
if isinstance(arg, dict):
if "op" in arg:
props.update(get_properties_from_cql2_filter(arg))
elif "property" in arg:
prop_name = arg["property"]
# Strip 'properties.' prefix if present
if prop_name.startswith("properties."):
prop_name = prop_name[11:]
props.add(prop_name)
return props
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,62 @@
This module provides functions for working with Elasticsearch/OpenSearch mappings.
"""

from typing import Any, Dict
import os
from collections import deque
from typing import Any, Dict, Set


def _get_excluded_from_queryables() -> Set[str]:
"""Get fields to exclude from queryables endpoint and filtering.

Reads from EXCLUDED_FROM_QUERYABLES environment variable.
Supports comma-separated list of field names.

For each exclusion pattern, both the original and the version with/without
'properties.' prefix are included. This ensures fields are excluded regardless
of whether they appear at the top level or under 'properties' in the mapping.

Example:
EXCLUDED_FROM_QUERYABLES="properties.auth:schemes,storage:schemes"

This will exclude:
- properties.auth:schemes (and children like properties.auth:schemes.s3.type)
- auth:schemes (and children like auth:schemes.s3.type)
- storage:schemes (and children)
- properties.storage:schemes (and children)

Returns:
Set[str]: Set of field names to exclude from queryables
"""
excluded = os.getenv("EXCLUDED_FROM_QUERYABLES", "")
if not excluded:
return set()

result = set()
for field in excluded.split(","):
field = field.strip()
if not field:
continue

result.add(field)

if field.startswith("properties."):
result.add(field.removeprefix("properties."))
else:
result.add(f"properties.{field}")

return result


async def get_queryables_mapping_shared(
mappings: Dict[str, Dict[str, Any]], collection_id: str = "*"
mappings: Dict[str, Dict[str, Any]],
collection_id: str = "*",
) -> Dict[str, str]:
"""Retrieve mapping of Queryables for search.

Fields listed in the EXCLUDED_FROM_QUERYABLES environment variable will be
excluded from the result, along with their children.

Args:
mappings (Dict[str, Dict[str, Any]]): The mapping information returned from
Elasticsearch/OpenSearch client's indices.get_mapping() method.
Expand All @@ -20,19 +68,44 @@ async def get_queryables_mapping_shared(

Returns:
Dict[str, str]: A dictionary containing the Queryables mappings, where keys are
field names and values are the corresponding paths in the Elasticsearch/OpenSearch
document structure.
field names (with 'properties.' prefix removed) and values are the
corresponding paths in the Elasticsearch/OpenSearch document structure.
"""
queryables_mapping = {}
excluded = _get_excluded_from_queryables()

def is_excluded(path: str) -> bool:
"""Check if the path starts with any excluded prefix."""
return any(
path == prefix or path.startswith(prefix + ".") for prefix in excluded
)

for mapping in mappings.values():
fields = mapping["mappings"].get("properties", {})
properties = fields.pop("properties", {}).get("properties", {}).keys()
mapping_properties = mapping["mappings"].get("properties", {})

stack: deque[tuple[str, Dict[str, Any]]] = deque(mapping_properties.items())

while stack:
field_fqn, field_def = stack.popleft()

nested_properties = field_def.get("properties")
if nested_properties:
stack.extend(
(f"{field_fqn}.{k}", v)
for k, v in nested_properties.items()
if v.get("enabled", True) and not is_excluded(f"{field_fqn}.{k}")
)

field_type = field_def.get("type")
if (
not field_type
or not field_def.get("enabled", True)
or is_excluded(field_fqn)
):
continue

for field_key in fields:
queryables_mapping[field_key] = field_key
field_name = field_fqn.removeprefix("properties.")

for property_key in properties:
queryables_mapping[property_key] = f"properties.{property_key}"
queryables_mapping[field_name] = field_fqn

return queryables_mapping
Loading