Skip to content

Conversation

@enuraju
Copy link

@enuraju enuraju commented Dec 26, 2025

Description

This change enhances the Dimensional Crawler framework to support minute-level time ranges for historical data ingestion.

Previously, the crawler relied on hour-based granularity when determining whether to run historical or incremental syncs. As a result, sub-hour ranges such as PT15M or PT30M were rounded down to zero, incorrectly triggering incremental sync and skipping historical data pulls.

This update replaces hour-based tracking with minute-based tracking across the framework and the Office365 source plugin, ensuring correct historical ingestion for any ISO-8601 duration expressed in minutes or hours.

How

Framework Updates

  • We updated the Dimensional Crawler framework to operate on minute-level granularity:
  • Replaced remainingHours (int) with remainingMinutes (long) in DimensionalTimeSliceLeaderProgressState
  • Updated the persisted leader state field from remaining_hours to remaining_minutes
  • Updated crawler decision logic to determine historical vs incremental sync using remaining minutes
  • Added support for sub-hour historical ranges by creating a single partition when the duration is less than 60 minutes
  • Handled edge cases for very small ranges (≤ 5 minutes) by skipping the standard delay window to avoid invalid partitions
  • Ensured mixed ranges (e.g., PT2H30M) do not lose time by folding extra minutes into the first hourly partition
  • Updated internal APIs and method signatures to consistently use long minutes

Office365 Plugin Updates

  • The Office365 source plugin was updated to align with the new minute-based framework:
  • Added getLookBackMinutes() to Office365SourceConfig with proper handling of zero and negative durations
  • Updated leader progress state initialization to pass minute-based lookback values
  • Updated audit log search logic to compute lookback windows using Duration.ofMinutes(...)

Is this change backward compatible?

Yes.

  • Existing hour-based configurations (PT1H, PT2H, etc.) continue to work unchanged
  • A lookback value of 0 minutes still triggers incremental sync
  • No configuration changes are required for existing users

Testing

Unit / Functional Validation

  • Verified correct behavior for:
  • Sub-hour historical ranges (PT5M, PT15M, PT30M)
  • Mixed hour/minute ranges (PT1H30M, PT2H15M)
  • Hour-only ranges (regression coverage)
  • Incremental sync when no range is configured

Integration Verification

  • Successfully executed Office365 source connector end-to-end
  • Confirmed historical ingestion is triggered correctly for minute-based ranges
  • Verified no regression in incremental ingestion behavior

Local pipeline run succeeded:

2025-12-26T12:33:58,162 [pool-7-thread-4] INFO org.opensearch.dataprepper.plugins.source.microsoft_office365.auth.Office365AuthenticationProvider - Getting new access token for Office 365 Management API 2025-12-26T12:33:58,162 [pool-7-thread-4] INFO org.opensearch.dataprepper.plugins.aws.AwsSecretsSupplier - Retrieving latest secrets in aws:secrets:m365_secret. 2025-12-26T12:33:58,405 [pool-7-thread-4] INFO org.opensearch.dataprepper.plugins.aws.AwsSecretsSupplier - Finished retrieving latest secret in aws:secrets:m365_secret. 2025-12-26T12:33:58,406 [pool-7-thread-4] INFO org.opensearch.dataprepper.plugins.aws.AwsSecretsSupplier - Retrieving latest secrets in aws:secrets:m365_secret. 2025-12-26T12:33:58,651 [pool-7-thread-4] INFO org.opensearch.dataprepper.plugins.aws.AwsSecretsSupplier - Finished retrieving latest secret in aws:secrets:m365_secret. 2025-12-26T12:33:58,869 [pool-7-thread-4] INFO org.opensearch.dataprepper.plugins.source.microsoft_office365.auth.Office365AuthenticationProvider - Received new access token. Expires in 3599 seconds 2025-12-26T12:33:58,869 [pool-7-thread-5] INFO org.opensearch.dataprepper.plugins.source.microsoft_office365.auth.Office365AuthenticationProvider - Getting new access token for Office 365 Management API 2025-12-26T12:33:58,869 [pool-7-thread-5] INFO org.opensearch.dataprepper.plugins.aws.AwsSecretsSupplier - Retrieving latest secrets in aws:secrets:m365_secret. 2025-12-26T12:33:59,111 [pool-7-thread-5] INFO org.opensearch.dataprepper.plugins.aws.AwsSecretsSupplier - Finished retrieving latest secret in aws:secrets:m365_secret. 2025-12-26T12:33:59,111 [pool-7-thread-5] INFO org.opensearch.dataprepper.plugins.aws.AwsSecretsSupplier - Retrieving latest secrets in aws:secrets:m365_secret. 2025-12-26T12:33:59,354 [pool-7-thread-5] INFO org.opensearch.dataprepper.plugins.aws.AwsSecretsSupplier - Finished retrieving latest secret in aws:secrets:m365_secret. 2025-12-26T12:33:59,529 [pool-7-thread-5] INFO org.opensearch.dataprepper.plugins.source.microsoft_office365.auth.Office365AuthenticationProvider - Received new access token. Expires in 3599 seconds {"CreationTime":"2025-12-26T06:24:11","Id":"cdf867b5-5bcc-4a18-9e8e-0f24e17fe798","Operation":"Update user.","OrganizationId":"e822651b-5027-4253-83f5-904854601a3b","RecordType":8,"ResultStatus":"Success","UserKey":"Not Available","UserType":4,"Version":1,"Workload":"AzureActiveDirectory","ObjectId":"demo.m3connector@trianzazuresb.onmicrosoft.com","UserId":"ServicePrincipal_3616d279-e97d-48d3-af3e-74ed7de78faf","AzureActiveDirectoryEventType":1,"ExtendedProperties":[{"Name":"additionalDetails","Value":"{\"UserType\":\"Member\",\"User-Agent\":\"Apache-HttpClient/4.5.13 (Java/17.0.17)\"}"},{"Name":"extendedAuditEventCategory","Value":"User"}],"ModifiedProperties":[{"Name":"JobTitle","NewValue":"[\r\n \"Updated by Canary test at 2025-12-26T06:24:11.153502640Z\"\r\n]","OldValue":"[\r\n \"Updated by Canary test at 2025-12-26T06:22:44.545736667Z\"\r\n]"},{"Name":"Included Updated Properties","NewValue":"JobTitle","OldValue":""},{"Name":"TargetId.UserType","NewValue":"Member","OldValue":""},{"Name":"ActorId.ServicePrincipalNames","NewValue":"fb6b0f13-8f1e-4a28-a772-d32d3133da23","OldValue":""},{"Name":"SPN","NewValue":"fb6b0f13-8f1e-4a28-a772-d32d3133da23","OldValue":""}],"Actor":[{"ID":"entraId_app","Type":1},{"ID":"fb6b0f13-8f1e-4a28-a772-d32d3133da23","Type":2},{"ID":"ServicePrincipal_3616d279-e97d-48d3-af3e-74ed7de78faf","Type":2},{"ID":"3616d279-e97d-48d3-af3e-74ed7de78faf","Type":2},{"ID":"ServicePrincipal","Type":2}],"ActorContextId":"e822651b-5027-4253-83f5-904854601a3b","InterSystemsId":"10d1c797-d154-4726-8dde-4cc559babbed","IntraSystemId":"9401b745-76c4-4518-891d-365a8739882e","SupportTicketId":"","Target":[{"ID":"User_0fba2a0b-2680-45c1-9ae6-d20b74edb3ec","Type":2},{"ID":"0fba2a0b-2680-45c1-9ae6-d20b74edb3ec","Type":2},{"ID":"User","Type":2},{"ID":"demo.m3connector@trianzazuresb.onmicrosoft.com","Type":5},{"ID":"1003200511619FF9","Type":3}],"TargetContextId":"e822651b-5027-4253-83f5-904854601a3b"} 2025-12-26T12:34:46,572 [pool-7-thread-1] INFO org.opensearch.dataprepper.plugins.source.source_crawler.base.DimensionalTimeSliceCrawler - Total partitions created in this crawl: 5.0 2025-12-26T12:35:46,588 [pool-7-thread-1] INFO org.opensearch.dataprepper.plugins.source.source_crawler.base.DimensionalTimeSliceCrawler - Total partitions created in this crawl: 5.0 2025-12-26T12:36:46,608 [pool-7-thread-1] INFO org.opensearch.dataprepper.plugins.source.source_crawler.base.DimensionalTimeSliceCrawler - Total partitions created in this crawl: 5.0

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: enugraju <enugraju@amazon.com>
DimensionalTimeSliceLeaderProgressState leaderProgressState =
(DimensionalTimeSliceLeaderProgressState) leaderPartition.getProgressState().get();
int remainingHours = leaderProgressState.getRemainingHours();
long remainingMinutes = leaderProgressState.getRemainingMinutes();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have better to use Instant type instead of long

* @deprecated Use {@link #getLookBackMinutes()} for minute-level granularity support
*/
@Deprecated
public int getLookBackHours() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to remove this method instead of marking as @Deprecated if it has no references at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that some of the state objects still holds hours and you kept this for backward compatibility 👍

@san81 san81 self-requested a review January 8, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants