Skip to content

Conversation

@brianna-dardin
Copy link
Member

There are a few changes here so that ODAP can process an Automated Archive like Unit B, and also that the end result is the same working schema that is the output of the eFiction repo so it can be fed into steps 3-6.

Conversion to working schema changes

  • Use the working sql file from the eFiction repo as the starting point for the database. I didn't want to replicate the file in this repo so the script uses a GET request. If there are better ways to handle this I can make updates
  • Insert item_authors rows to record the relationships between stories and authors
  • Insert unique tags into the tags table then insert item_tags rows to record the relationships between stories and tags

Changes due to issues processing Unit B

  • Its ARCHIVE_DB.pl file was not utf-8 encoded but latin-1 encoded, so it prompts for the encoding of the file
  • The values in the Date field were Unix timestamps not datetime strings so it checks whether the date is in the Unix timestamp format
  • The chapter URLs (the Location field) included "/", example "5/heatenough.html", which caused an issue since I was running the script on windows and the files were downloaded locally. So it corrects for this issue if it's run on windows.

Other

  • Since the other PR had an issue with the github workflows I tried to fix that here.

Copy link
Collaborator

@ariana-paris ariana-paris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments but some may be because my Python is getting rusty or I'm misreading things!

_extract_date(args, FILES[i], log),
FILES[i].get("Location", "").replace("'", "\\'"),
FILES[i]
.get("LocationURL", FILES[i].get("StoryURL", ""))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why these lines weren't indented, or is that a Python thing I've forgotten?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the items in this array have the same indentation, though it is kinda confusing that the ruff formatter broke up certain lines but not others (like the Location line vs the LocationURL line). So it may look odd but it is fine, unless you're talking about something else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just about it looking odd and more difficult for a human to interpret since continuation lines are usually indented relative to their first line. However, this file isn't a high priority so if ruff is going to be run on the whole repo, we'll have to let it do its thing to avoid noisy diffs!

brianna-dardin and others added 2 commits March 18, 2025 18:15
per Ariana's suggestion

Co-authored-by: Ariana <ariana-paris@users.noreply.github.com>
@brianna-dardin brianna-dardin merged commit cdb6bd6 into master Mar 22, 2025
3 checks passed
@brianna-dardin brianna-dardin deleted the fix/types branch March 22, 2025 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants