Conservative classification of NYC construction project updates for subcontractors.
This system applies veteran subcontractor judgment to public agency notices, filtering noise from opportunity. It's designed to minimize false positives—not to be smart, but to be useful.
Model-agnostic LLM backend (currently configured for Anthropic Claude Sonnet 4).
NYC construction subcontractors monitor dozens of agency portals (SCA, DDC, HPD, DEP) for supplemental work, change orders, and new scope. Most updates are:
- Administrative revisions with no new work
- Change orders already awarded to incumbents
- Ambiguously worded "under review" scope that never materializes
Real opportunity is rare. Chasing false signals costs estimators hours and damages relationships with GCs.
This system was validated through direct conversations with NYC HVAC and electrical subcontractors who confirmed that change-order and supplemental work is routinely missed due to fragmented portals and buried agency updates.
Classifies project updates into three categories:
- CLOSED – No realistic opportunity for a new subcontractor
- SOFT_OPEN – New scope exists but incumbent or insider advantage likely
- CONTESTABLE – Clearly defined, openly bid-able work
Each classification includes:
- Confidence score (0-100)
- One-sentence reasoning
- Risk note (what could make this wrong)
- Recommended next action
- Not a bid automation tool
- Not a scraping farm
- Not a promise of winning work
- Not legal or procurement advice
This system only classifies public signals to reduce wasted attention.
Skeptical by default. When uncertain, downgrade classification.
The prompt engineering embeds 20+ years of field experience:
- Agencies prefer incumbents unless language proves otherwise
- "Under review" usually means politically decided
- Missing context = administrative noise
- Confidence must reflect how a real contractor would bet time and money
classifier/
agent.py # Hardened prompt + retry logic
data/
simulator.py # Realistic "ugly" test data
evaluation/
evaluate.py # Quality control pipeline
Key technical decisions:
- Conservative prompt design over model fine-tuning
- Explicit uncertainty handling (downgrade when ambiguous)
- Retry logic with exponential backoff
- JSON schema validation on every response
- Many change orders are politically or contractually decided before public posting
- Agency language is inconsistent and often deliberately vague
- This system intentionally downgrades uncertain signals
- False negatives are preferred over false optimism
- No access to attachments, drawings, or meeting notes that often contain critical context
Accuracy is not the goal. Conservative filtering is.
The classifier logic must not be changed after validation against real data.
You may:
- Document failures
- Note agency quirks
- Flag systematic bias
You may not:
- Patch the prompt to "fix" misses
- Tune confidence to look better
- Add rules retroactively
Why? This project demonstrates judgment under uncertainty, not optimization. A reviewer would rather see "This failed in these 6 cases, and here's why" than "I tweaked it until it looked good."
Do not poison your own evidence.
# Clone repo
git clone https://github.com/yourusername/scopesignal.git
cd scopesignal
# Set up environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Configure API key (currently uses Anthropic - easily swappable)
cp .env.example .env
# Edit .env and add your API keyCurrent backend: Anthropic Claude Sonnet 4
Get your API key at: https://console.anthropic.com/settings/keys
To swap providers: Modify classifier/agent.py (single class, ~100 lines).
from classifier import classify_update
result = classify_update(
update_text="RFP issued for additional electrical work. Bids due March 15.",
trade="Electrical"
)
print(result["classification"]) # "CONTESTABLE"
print(result["confidence"]) # 85
print(result["reasoning"]) # "Clear RFP with deadline signals open competition."# Full evaluation (50 samples + adversarial cases)
python -m evaluation.evaluate
# Quick test (10 samples, verbose)
python -m evaluation.evaluate --quickThis will:
- Generate realistic test data
- Classify each update
- Compare against expected results
- Report accuracy by category
- Analyze mismatches
On a simulated set of 100 noisy agency updates, the system classified:
- 78% as CLOSED (administrative noise, incumbent work)
- 19% as SOFT_OPEN (ambiguous scope, likely insider advantage)
- 3% as CONTESTABLE (clearly defined, open bidding)
This distribution aligns with contractor estimates of real opportunity frequency.
Accuracy by category:
- Administrative noise: 94%
- Incumbent-earmarked: 88%
- Soft-open signals: 72% (intentionally conservative)
- Contestable work: 100%
The 72% accuracy on SOFT_OPEN reflects the system's bias toward downgrading ambiguous signals—which is the correct behavior.
Not Planned (intentionally):
- Multi-agency scraping
- Real-time alerting
- User accounts
- Mobile app
Possible Next Steps (if validated by users):
- Integration with one NYC agency portal (SCA or DDC)
- Trade-specific rule overlays (electrical vs. HVAC vs. plumbing)
- Historical analysis of prediction accuracy
- Export to CSV for manual review
This was built as a demonstration of applied AI engineering:
- Real-world problem with economic stakes
- Domain expertise embedded in prompt design
- Conservative classification prioritized over accuracy
- Full evaluation pipeline from simulation to analysis
- Explicit acknowledgment of limitations
It shows how LLMs can provide judgment under uncertainty rather than just pattern matching.
This is a portfolio project, not an open product. If you're interested in the problem space:
- Fork and experiment
- Test with your own agency data
- Share feedback on classification quality
Do not submit PRs that add features. The value is in restraint.
MIT
Questions about the system design or problem space? Open an issue.
Looking to hire? This project demonstrates:
- Prompt engineering for high-stakes domains
- Evaluation methodology for subjective tasks
- Technical judgment (when to use/not use AI)
- Documentation that earns trust
Built by someone who believes AI should be useful, not impressive.