Artha AI Documentation

Everything you need to run multilingual dataset generation end to end.

1. What is Artha AI

Artha AI is a multilingual dataset generation platform that collects, cleans, labels, quality-checks, and exports training-ready data across Indian and English languages for modern ML teams.

2. How It Works

scrape: source collection across configured connectors.
clean: language filtering, normalization, deduplication.
label: LLM-backed annotation by selected label type.
quality_check: confidence scoring and distribution checks.
merge: per-language outputs combined into unified dataset.
export: parallel export to requested formats + metadata.

3. Supported Languages

Language	Code	Script	Sources
English	en	Latin	Reddit, YouTube, Play Store, News
Hindi	hi	Devanagari	Reddit, YouTube, Play Store, News
Gujarati	gu	Gujarati	Reddit, YouTube, Play Store, News
Marathi	mr	Devanagari	Reddit, YouTube, Play Store, News
Tamil	ta	Tamil	Reddit, YouTube, Play Store, News

4. Label Types

Type	Description	Example Labels
sentiment	Classifies polarity	positive / negative / neutral
topic	Classifies theme	politics / sports / entertainment
ner	Classifies entity type	PERSON / LOCATION / ORGANIZATION
all	Runs all labelers	sentiment + topic + ner

5. Export Formats

Format	Best For	Notes
CSV	Analytics and spreadsheet workflows	UTF-8 SIG for Excel compatibility
JSON	APIs and data interchange	Indented and Unicode-safe
Excel	Business reporting	Includes Dataset and Quality_Info sheets
Parquet	ML pipelines and lakehouses	Strict numeric/boolean dtypes
HuggingFace	Model training teams	Saved as Dataset folder with Arrow data

6. Data Schema

Column	Type	Description
id	int64	Sequential row id
text_original	string	Raw source text
text_clean	string	Normalized text for labeling
language	string	Language code
language_name	string	Human language name
script	string	Writing script
domain	string	Selected domain
source	string	Source provider
source_url	string \| null	Direct content URL
source_subreddit	string \| null	Subreddit name if applicable
label_sentiment	string \| null	Sentiment label
label_topic	string \| null	Topic label
label_ner	string \| null	NER label
confidence	float64	Label confidence
confidence_reason	string	Model reasoning summary
llm_used	string	Label provider
needs_review	boolean	Manual review flag
app_id	string \| null	Play Store app id
star_rating	int \| null	App rating value
rating_hint	string \| null	Rating context
created_at	ISO datetime	Export creation timestamp
job_id	string	Dataset job id

7. Quality Scoring

Quality is derived from labeling confidence:

quality_score = average(confidence) * 100

>= 90: Excellent
80-89: Good
70-79: Acceptable
< 70: Review recommended

English acts as a benchmark language when included, helping compare relative quality between language outputs.

8. API Reference

Method	Path	Description	Example Response
POST	/api/generate-dataset	Submit dataset request and get job id	{"job_id":"...","estimated_minutes":4,"message":"Dataset generation queued successfully"}
GET	/api/job-status/{job_id}	Fetch live progress for a job	{"status":"labeling","progress_percent":62,...}
GET	/api/quality-report/{job_id}	Get final quality report	{"overall_quality_score":90.1,...}
GET	/api/download/{job_id}/{format}	Download generated export file	Binary file response
GET	/api/health	Service health and dependency checks	{"status":"ok","version":"1.0.0",...}