Skip to main content

Global Congress Data Collection Goal

Goal MG-CONGRESS-01

Objective: build and maintain a global, evidence-backed corpus of multiday congress-family events for Mi Gente.

Scope includes:

  • Congress
  • Weekender
  • Festival
  • Marathon
  • Retreat
  • Cruise
  • Dance Vacation

Scope and exclusions

  • Include: confirmed or highly probable multiday events with verifiable sources and date ranges.
  • Exclude by default: Bootcamp and any single-day records unless product policy is changed.
  • Geography: global.
  • Source quality: official websites, ticketing pages, organizer pages, and official social channels.
  • Confidence levels: 95–100 (Level 1), 75–94 (Level 2), 40–74 (Level 3), 0–39 (Level 4).
  • Core-corpus candidate logic: enforced by SQL in 20260530000000_migente_event_data_dictionary_v1.sql.

Multi-agent setup

This program is split into three lanes:

Architecture Agents

  • Own schema validity, migration health, triggers, generated views, docs, and release evidence.
  • Maintain corpus rules (Bootcamp and single-day exclusions), constraints, and FK integrity.
  • Own PR review expectations for architecture changes and maintain release safety notes.

Data-collection Agents

  • Own research, source map development, and batch ingestion.
  • Produce mobilis-research-upload/v2 YAML with canonical fields, raw/rich facts, field-level evidence, source links, media, and exclusion reasons for non-core records.
  • Preserve legacy candidate rows only as reconciliation inputs, not as the ongoing research contract.

Quality Agent

  • Run audit checks against the service-role-only public.migente_worldwide_congress_candidate_view.
  • Detect date, scope, and confidence exceptions.
  • Convert quality gaps into issue-level follow-up tasks.

Public launch pages must use the unified public.migente_public_routes_view, public.migente_public_listings_view, and public.migente_public_listing_details_view projections.

Use the dedicated issue templates:

Use the execution playbook:

First-pass overnight posture

  • Run only high-confidence, multiday entries with official or high-quality evidence in the first pass.
  • Skip and backlog rows that are expensive to validate (login walls, inconsistent source formats, weak provenance, repeated failures).
  • Keep backlog_only and needs_review explicit with causes and next action.
  • Continue the shard until the row-level hard-stop signals trigger.

Overnight collection writes v2 YAML and review outputs only to:

  • tmp/migente-congress-runs/<shard>/<run-id>/

Required files per shard run:

  • mobilis-research-upload-v2.yaml
  • exceptions_or_backlog.csv
  • validation_summary.md
  • mobilis-research-upload-v2-preview.json
  • mobilis-research-upload-v2-promote-preview.json

Then run local preview and promotion dry-run:

cd workers/playwright-source-scanner
export SUPABASE_URL=https://local-test-placeholder.supabase.co
export SUPABASE_SERVICE_ROLE_KEY=local_locked_preview_placeholder

npm run dev -- mobilis-research-upload-preview \
--file ../../tmp/migente-congress-runs/americas/2026-05-30/mobilis-research-upload-v2.yaml \
--out-file ../../tmp/migente-congress-runs/americas/2026-05-30/mobilis-research-upload-v2-preview.json

npm run dev -- mobilis-research-upload-promote \
--file ../../tmp/migente-congress-runs/americas/2026-05-30/mobilis-research-upload-v2.yaml \
--out-file ../../tmp/migente-congress-runs/americas/2026-05-30/mobilis-research-upload-v2-promote-preview.json

Model and budget posture

  • Default per-batch model: OPENAI_TEXT_MODEL=gpt-4.1-mini.
  • Keep extraction in low-cost mode by default for overnight bulk collection.
  • Use stronger models only when explicitly approved for a blocked batch or a hard-to-parse source class.
  • Record model and quality thresholds in each batch issue header.

Public launch readiness

  • Supabase remains canonical for public congress data.
  • First-pass v2 YAML feeds review, import preview, media discovery, and public promotion preview before any write.
  • Public publication state is separate from validation state: draft, discovery, published, needs_review, hidden, archived.
  • discovery rows may appear publicly when they have source evidence, multiday dates, location, style, and confidence >= 75.
  • published rows require stricter readiness, no duplicate risk, confidence >= 85, and full public fields.
  • Existing migentedance.com and migentedmv.com URLs are captured as legacy URL mappings for dedupe and future redirects.

Phase 1 (Architecture)

  • Finalize corpus eligibility rules in SQL triggers and documented assumptions.
  • Publish migration + seed runbook in workers/playwright-source-scanner/supabase/.
  • Add ERD and issue-level validation docs to portal references.
  • Add milestone check that all architecture tasks are in GitHub issues labeled with ready-for-codex when Codex execution starts.

Phase 1 (Data collection)

  • Build a country-by-country source list for congress-family events.
  • Prioritize all active 2026–2028 known annual congresses.
  • For each event: ingest event brand, edition, dates, primary links, and venue.
  • Add initial ticket and hotel references only for confident matches.
  • Exclude bootcamps and single-day events from corpus eligibility, but preserve in research history where useful.
  • Add validation_status tagging for first-pass rows (pass, needs_review, backlog_only).
  • Generate v2 YAML review outputs before Supabase import (mobilis-research-upload-v2-preview.json, mobilis-research-upload-v2-promote-preview.json, backlog CSV, validation summary).
  • Review public readiness through the v2 preview/promotion reports before publishing anything on alpha.migentedance.com.

Phase 2 (Quality and backlog)

  • Add periodic consistency checks (core candidates, multiday checks, duplicates).
  • Create issues for missing sources and unresolved event type classification.
  • Set acceptance criteria for “worldwide complete enough” by continent and year.
  • mg-congress-01
  • migente
  • world-congress
  • phase-1, phase-2, phase-3
  • architecture or data-collection
  • data-quality
  • ready-for-codex

For each batch:

  • include one data-collection issue for a country cluster,
  • include one architecture issue if rule exceptions are required,
  • add a weekly data-quality issue linked to audit snapshots.

GitHub tracking guidance

  • Create one issue per major city region when onboarding 25+ events.
  • Use labels like migente-world-congress, phase-1, data-collection for architecture split.
  • Attach SQL results snapshots to each issue (at minimum: event, edition, and social link rows).