Artha AI Documentation

Everything you need to run multilingual dataset generation end to end.

1. What is Artha AI

Artha AI is a multilingual dataset generation platform that collects, cleans, labels, quality-checks, and exports training-ready data across Indian and English languages for modern ML teams.

2. How It Works

  1. scrape: source collection across configured connectors.
  2. clean: language filtering, normalization, deduplication.
  3. label: LLM-backed annotation by selected label type.
  4. quality_check: confidence scoring and distribution checks.
  5. merge: per-language outputs combined into unified dataset.
  6. export: parallel export to requested formats + metadata.

3. Supported Languages

LanguageCodeScriptSources
EnglishenLatinReddit, YouTube, Play Store, News
HindihiDevanagariReddit, YouTube, Play Store, News
GujaratiguGujaratiReddit, YouTube, Play Store, News
MarathimrDevanagariReddit, YouTube, Play Store, News
TamiltaTamilReddit, YouTube, Play Store, News

4. Label Types

TypeDescriptionExample Labels
sentimentClassifies polaritypositive / negative / neutral
topicClassifies themepolitics / sports / entertainment
nerClassifies entity typePERSON / LOCATION / ORGANIZATION
allRuns all labelerssentiment + topic + ner

5. Export Formats

FormatBest ForNotes
CSVAnalytics and spreadsheet workflowsUTF-8 SIG for Excel compatibility
JSONAPIs and data interchangeIndented and Unicode-safe
ExcelBusiness reportingIncludes Dataset and Quality_Info sheets
ParquetML pipelines and lakehousesStrict numeric/boolean dtypes
HuggingFaceModel training teamsSaved as Dataset folder with Arrow data

6. Data Schema

ColumnTypeDescription
idint64Sequential row id
text_originalstringRaw source text
text_cleanstringNormalized text for labeling
languagestringLanguage code
language_namestringHuman language name
scriptstringWriting script
domainstringSelected domain
sourcestringSource provider
source_urlstring | nullDirect content URL
source_subredditstring | nullSubreddit name if applicable
label_sentimentstring | nullSentiment label
label_topicstring | nullTopic label
label_nerstring | nullNER label
confidencefloat64Label confidence
confidence_reasonstringModel reasoning summary
llm_usedstringLabel provider
needs_reviewbooleanManual review flag
app_idstring | nullPlay Store app id
star_ratingint | nullApp rating value
rating_hintstring | nullRating context
created_atISO datetimeExport creation timestamp
job_idstringDataset job id

7. Quality Scoring

Quality is derived from labeling confidence:

quality_score = average(confidence) * 100

  • >= 90: Excellent
  • 80-89: Good
  • 70-79: Acceptable
  • < 70: Review recommended

English acts as a benchmark language when included, helping compare relative quality between language outputs.

8. API Reference

MethodPathDescriptionExample Response
POST/api/generate-datasetSubmit dataset request and get job id{"job_id":"...","estimated_minutes":4,"message":"Dataset generation queued successfully"}
GET/api/job-status/{job_id}Fetch live progress for a job{"status":"labeling","progress_percent":62,...}
GET/api/quality-report/{job_id}Get final quality report{"overall_quality_score":90.1,...}
GET/api/download/{job_id}/{format}Download generated export fileBinary file response
GET/api/healthService health and dependency checks{"status":"ok","version":"1.0.0",...}