Help: Editorial standards

From WikiLeague, the free baseball governance encyclopedia.

All Help pages

STANDARDS

The archive is research-grade. These are the standards a document must meet at each stage of its life in the collection. They apply uniformly: a 1903 National Agreement and a 2026 commissioner memo are held to the same evidentiary bar.

1. The status ladder

Every document carries a status field. The ladder is:

placeholder  →  needs_review  →  verified
                                   ↓ (problem discovered)
                                 needs_review

1.1 placeholder

We have a credible bibliographic citation that the document exists but do not yet have the file. A placeholder has:

  • A title (the best we know it).
  • An approximate date.
  • A doc_type and category guess.
  • A source.primary_url if we have a likely retrieval target (or source.lead_url, a separate field, for a softer pointer to where we expect to find it).
  • All content fields (key_provisions, quoted_excerpts, abstract, summary body) empty or marked unknown.

Don't create a placeholder for a document you've only vaguely heard about. A placeholder is a commitment to acquire; bibliographic anchor required.

Placeholders are tracked in both their target folder (if we're committed to acquiring them) and in catalog/WANTLIST.md.

1.2 needs_review

We have a copy of the document on disk. The metadata is filled in to the best of our knowledge from a single source. The document has not yet been confirmed against a second independent source. A needs_review document is usable for casual reference but should not be cited as authoritative until promoted.

1.3 verified

The document has been confirmed against at least two independent sources, both recorded in source.confirmation_sources. Date, parties, citation, and all key provisions have been spot-checked against both sources. A Wayback Machine snapshot has been captured for any web-sourced URL. SHA256 hash is recorded. A verified document is the archive's product.

1.4 Demotion

A verified document can be demoted to needs_review if a discrepancy is later discovered, a source is retracted, or a more authoritative version surfaces. Demotion is logged in catalog/PROVENANCE_LOG.md with date and reason.

2. Sourcing requirements

2.1 Primary source defined

A primary source is the original publisher of the document:

  • Court rulings: the issuing court's own publication (PACER, court website) or the official reporter (U.S. Reports, Federal Reporter).
  • Legislation: congress.gov, the U.S. Government Publishing Office (GPO), or the Public Law text in the U.S. Statutes at Large.
  • Congressional hearings: GPO's published hearing record, congress.gov, or the committee's own website.
  • MLB-issued documents: MLB.com press releases or official MLB publications.
  • MLBPA-issued documents: MLBPlayers.com, MLBPA press materials.
  • CBAs: jointly published by MLB and MLBPA, usually accessible through MLBPlayers.com.
  • Stadium and franchise agreements: the relevant government body's records (city council, sports commission), or court filings if exhibited.
  • Foundational historical documents: original publication where extant, or the earliest reliable reproduction (e.g., a 19th-century newspaper of record, Spalding Guide).

2.2 Acceptable secondary sources

A secondary source is a faithful reproduction or excerpt by a non-primary publisher: SABR's library, Justia, Cornell LII, FindLaw, businessofbaseball.com (via Wayback), HeinOnline, the Brooklyn Sports Law Blog, ProQuest Congressional, Field of Schemes for stadium docs. Secondary sources are acceptable as the retrieval point but the document's lineage to a primary source must be traceable.

2.3 Unacceptable sources for content

  • Wikipedia and Wikipedia-derived sites — acceptable as a pointer, never as a substitute.
  • Podcast transcripts and YouTube videos.
  • AI-generated summaries of any kind.
  • News articles describing or paraphrasing a document, when the document itself is available.
  • Pay-to-access summary services (Bloomberg Law summaries, etc.) — use them to find the document, not to be the document.
  • Forum posts and Reddit threads.

2.4 The two-source confirmation rule

Promotion to verified requires two independent sources that present the document's substantive content. "Independent" means:

  • Not derived from one another (Cornell LII and Justia are both independent because they each take from the official U.S. Reports; one cannot be derived from the other).
  • Not co-owned (a publisher's own mirror does not count as a second source).
  • Not relying on a common feed (two news outlets publishing AP wire copy are not two sources).

When in doubt, treat them as dependent and find a third.

2.5 archive.org snapshots

For any web-sourced URL recorded as primary_url or confirmation_sources, a Wayback Machine snapshot must be captured (or confirmed to already exist) and recorded in source.archive_url. If the URL cannot be snapshotted (firewalled, paywalled, robots.txt blocked), record this in source.snapshot_notes and downgrade the path to verification — the file must come from somewhere else.

3. File integrity

3.1 SHA256 hash required

Every file stored in documents/ has its SHA256 hash computed and recorded in file.sha256. This catches:

  • Silent corruption of files over time.
  • Inadvertent replacement.
  • Mismatches with the source publisher's version.

Compute hashes with shasum -a 256 <filename> (macOS) or sha256sum <filename> (Linux).

3.2 File replacement

If a file is replaced (e.g., a higher-quality scan is found, OCR is added, a corrected version is published):

  • The new SHA256 is recorded.
  • The old SHA256 is preserved in file.previous_hashes with the date of replacement and reason.
  • The replacement is logged in catalog/PROVENANCE_LOG.md.

3.3 File format preferences

  • PDF is preferred for legal documents, hearings, formal reports, and any document whose original was paginated.
  • Plain text is preferred for short documents where layout is not load-bearing (e.g., a single press release).
  • HTML is acceptable only if the source is HTML and a PDF print is not possible without loss; prefer to save as PDF via headless print or print-to-file for archival stability.
  • Avoid Word/Pages/proprietary formats unless the original was such.
  • Avoid screenshots unless no other form of the document exists.

3.4 OCR

For scanned PDFs without an embedded text layer, run OCR before storage (e.g., ocrmypdf) and store the OCR'd version. The fact that OCR was applied, and the tool used, must be noted in file.processing_notes.

4. Date precision

ISO 8601 dates only. Partial dates are allowed:

  • 2022-03-10 — full date, date_precision: day
  • 2022-03 — month known, date_precision: month
  • 2022 — year only, date_precision: year
  • For documents with a range (a CBA covering 2022–2026), date is the date of execution and effective_period (separate field) captures the range.

Do not invent precision. "1921 Major League Agreement" is date: "1921", date_precision: "year" unless we have a verified signing date.

5. Citation requirements

For documents that have an established citation format, the canonical citation must be recorded:

  • Court rulings: Bluebook citation in citation.bluebook. Example: Flood v. Kuhn, 407 U.S. 258 (1972). Include parallel citations in citation.parallel if known.
  • Federal statutes: Public Law number and Statutes at Large citation. Example: Pub. L. 105-297, 112 Stat. 2824 (1998) for the Curt Flood Act.
  • Congressional hearings: GPO catalog identifier where available. Example: S. Hrg. 107-427.
  • CBAs and league documents: title plus effective date range. Example: 2022–2026 Basic Agreement between MLB and the MLBPA.

If the canonical form is unknown, record unknown rather than guessing.

6. Required metadata fields (summary)

The authoritative schema lives in METADATA_SCHEMA.md §2. This is a short prose summary; if it ever diverges from the schema, the schema wins.

Required at every status level (including placeholder):

  • schema_version, title, date, date_precision, date_type, doc_type, category, status, status_history, source.retrieved_date, source.retrieved_by

Required at needs_review and above (in addition to the above):

  • source.primary_url or source.physical_provenance, file.filename, file.format, file.sha256

Required at verified (in addition to all of the above):

  • source.archive_url (for any web-sourced URL), source.confirmation_sources (≥2 independent), citation block (where applicable to the doc type), key_provisions (≥1 entry), parties where applicable, summary text written from the document itself

7. Documentation discipline

  • Every retrieval is logged in catalog/PROVENANCE_LOG.md with date, retriever (Claude session ID or human), source, status assigned.
  • Every status promotion is logged.
  • Every demotion is logged with reason.
  • Every file deletion is logged with reason and preserved metadata snapshot.
  • Discrepancies between sources are written up in research-logs/discrepancies/.

8. The smell test

Before promoting to verified, run the smell test:

  • Could a serious researcher cite this and not be embarrassed?
  • If the source went down tomorrow, would the local copy + snapshot be sufficient?
  • If a discrepancy were discovered next year, could we trace exactly how this version got here?

If any answer is "no," it stays needs_review.