AfriData

Data Standards

Guidelines for high-quality, interoperable datasets

Supported File Formats

AfriData Commons supports a wide range of file formats to ensure maximum accessibility and interoperability. Here are our supported formats and recommendations:

Format Extension Support Level Use Case Max Size
CSV .csv Recommended Structured tabular data 100MB
JSON .json Recommended Structured data, APIs 50MB
Excel .xlsx, .xls Supported Spreadsheet data with multiple sheets 25MB
XML .xml Supported Hierarchical structured data 25MB
Parquet .parquet Beta Large-scale analytics 500MB
GeoJSON .geojson Recommended Geographic data 50MB
Shapefile .shp + .shx + .dbf Supported GIS vector data 100MB
Proprietary .sav, .dta, .mat Convert Required Statistical software formats N/A
Best Practice: For maximum compatibility and longevity, we recommend using open, non-proprietary formats like CSV, JSON, and GeoJSON. These formats are widely supported and accessible across different platforms and tools.

Metadata Standards

Comprehensive metadata is crucial for dataset discovery and proper usage. Our metadata schema follows international standards while accommodating African-specific context.

Basic Information

  • Dataset title and description
  • Creator/author information
  • Creation and modification dates
  • Version information
  • License and usage rights

Geographic Context

  • Country/region coverage
  • Coordinate system (if applicable)
  • Administrative boundaries
  • Urban/rural classification
  • Language(s) used

Temporal Coverage

  • Data collection period
  • Temporal resolution
  • Update frequency
  • Historical context
  • Seasonal considerations

Methodology

  • Data collection methods
  • Sampling techniques
  • Quality control measures
  • Processing steps
  • Limitations and biases

Example Metadata Schema (JSON)

// AfriData Commons Metadata Schema { "title": "Kenya Agricultural Survey 2024", "description": "Comprehensive survey of agricultural practices...", "creator": { "name": "Dr. Jane Doe", "affiliation": "University of Nairobi", "email": "jane.doe@uonbi.ac.ke" }, "geographic_coverage": { "country": "Kenya", "regions": ["Central", "Eastern", "Western"], "coordinate_system": "WGS84" }, "temporal_coverage": { "start_date": "2024-01-01", "end_date": "2024-12-31", "frequency": "Annual" }, "methodology": { "collection_method": "Survey", "sample_size": 5000, "sampling_method": "Stratified random sampling" }, "license": "CC BY 4.0", "version": "1.0", "created_date": "2024-03-15", "file_format": "CSV", "file_size": "45.7 MB" }

Data Quality Standards

High-quality data is essential for reliable research and analysis. Our quality standards ensure datasets meet international best practices.

Completeness

Ensure your dataset is as complete as possible:

  • Missing values < 5% per column
  • Clear indication of null/missing data
  • Explanation for missing values
  • Complete geographic coverage
Target: 95%+

Accuracy

Verify data accuracy through validation:

  • Cross-validation with external sources
  • Outlier detection and handling
  • Unit consistency checks
  • Temporal consistency validation
Target: 98%+

Timeliness

Ensure data is current and relevant:

  • Regular update schedule
  • Clear versioning system
  • Timestamp for data collection
  • Deprecation notices for old data
Target: 90%+

Consistency

Maintain consistent data formats:

  • Standardized naming conventions
  • Uniform data types
  • Consistent units of measurement
  • Harmonized categorical values
Target: 95%+
Quality Assurance: All datasets undergo automated quality checks upon submission. Datasets not meeting minimum quality thresholds will be flagged for review before publication.

Documentation Standards

Comprehensive documentation ensures your dataset can be understood and used effectively by other researchers and practitioners.

Data Dictionary

  • Column/variable descriptions
  • Data types and formats
  • Valid ranges and constraints
  • Relationships between variables
  • Coding schemes for categorical data

Methodology Document

  • Research objectives and questions
  • Sampling methodology
  • Data collection procedures
  • Quality control measures
  • Known limitations and biases

Technical Documentation

  • Processing scripts and code
  • Software versions and dependencies
  • Hardware specifications
  • Computational environment details
  • Reproducibility instructions

Usage Guidelines

  • Intended use cases
  • Appropriate analysis methods
  • Citation requirements
  • Contact information for questions
  • Update and maintenance schedule
Example Data Dictionary Entry
// Variable: household_income { "name": "household_income", "description": "Monthly household income in Kenyan Shillings", "type": "numeric", "format": "integer", "unit": "KES", "range": { "min": 0, "max": 500000 }, "missing_values": "-999", "notes": "Self-reported income; may include informal sources" }

Ethics and Privacy Standards

Ethical data sharing is fundamental to AfriData Commons. All datasets must comply with ethical guidelines and privacy regulations.

Privacy Protection

Personal and sensitive data must be properly anonymized or aggregated. Direct identifiers should be removed, and indirect identifiers should be assessed for re-identification risks.

Informed Consent

Data subjects must have provided informed consent for data collection and sharing. If consent was not explicitly obtained for public sharing, data must be sufficiently anonymized.

Fairness and Non-discrimination

Datasets should not perpetuate or amplify existing biases. Consider the potential for discriminatory use and provide appropriate warnings or safeguards.

Cultural Sensitivity

Respect cultural contexts and sensitivities. Engage with local communities and stakeholders when appropriate, especially for data about indigenous or marginalized populations.

Anonymization Requirements

  • Remove direct identifiers (names, IDs, addresses)
  • Assess quasi-identifiers (age, location, profession)
  • Apply k-anonymity (k≥5) for sensitive data
  • Use differential privacy for high-risk datasets
  • Document anonymization methods

Ethical Review

  • Institutional Review Board (IRB) approval
  • Ethics committee review documentation
  • Data sharing agreements
  • Consent forms and protocols
  • Risk assessment documentation
Ethics Review Required: All datasets containing human subjects data must undergo ethics review before publication. Contact our ethics committee if you're unsure about requirements.

Submission Process

Follow these steps to ensure your dataset meets our standards and is successfully published on AfriData Commons.

1
Prepare Your Dataset

Ensure your data is in a supported format, properly structured, and cleaned. Remove any sensitive information and create comprehensive documentation.

2
Complete Metadata

Fill out all required metadata fields using our online form. Provide detailed descriptions, geographic coverage, and methodology information.

3
Upload Files

Upload your dataset files, documentation, and any supplementary materials. Ensure file sizes are within limits and formats are supported.

4
Quality Check

Our automated system will run quality checks on your dataset. Review any flagged issues and make necessary corrections.

5
Peer Review

Your dataset will undergo peer review by domain experts. This typically takes 2-4 weeks depending on complexity and reviewer availability.

6
Publication

Once approved, your dataset will be published with a DOI and made available to the research community. You'll receive a notification with the publication details.

Need Help? Our data curation team is available to assist with the submission process. Contact us at data@afridata.org for guidance.