Artha AI Documentation
Everything you need to run multilingual dataset generation end to end.
1. What is Artha AI
Artha AI is a multilingual dataset generation platform that collects, cleans, labels, quality-checks, and exports training-ready data across Indian and English languages for modern ML teams.
2. How It Works
scrape: source collection across configured connectors.clean: language filtering, normalization, deduplication.label: LLM-backed annotation by selected label type.quality_check: confidence scoring and distribution checks.merge: per-language outputs combined into unified dataset.export: parallel export to requested formats + metadata.
3. Supported Languages
| Language | Code | Script | Sources |
|---|---|---|---|
| English | en | Latin | Reddit, YouTube, Play Store, News |
| Hindi | hi | Devanagari | Reddit, YouTube, Play Store, News |
| Gujarati | gu | Gujarati | Reddit, YouTube, Play Store, News |
| Marathi | mr | Devanagari | Reddit, YouTube, Play Store, News |
| Tamil | ta | Tamil | Reddit, YouTube, Play Store, News |
4. Label Types
| Type | Description | Example Labels |
|---|---|---|
| sentiment | Classifies polarity | positive / negative / neutral |
| topic | Classifies theme | politics / sports / entertainment |
| ner | Classifies entity type | PERSON / LOCATION / ORGANIZATION |
| all | Runs all labelers | sentiment + topic + ner |
5. Export Formats
| Format | Best For | Notes |
|---|---|---|
| CSV | Analytics and spreadsheet workflows | UTF-8 SIG for Excel compatibility |
| JSON | APIs and data interchange | Indented and Unicode-safe |
| Excel | Business reporting | Includes Dataset and Quality_Info sheets |
| Parquet | ML pipelines and lakehouses | Strict numeric/boolean dtypes |
| HuggingFace | Model training teams | Saved as Dataset folder with Arrow data |
6. Data Schema
| Column | Type | Description |
|---|---|---|
| id | int64 | Sequential row id |
| text_original | string | Raw source text |
| text_clean | string | Normalized text for labeling |
| language | string | Language code |
| language_name | string | Human language name |
| script | string | Writing script |
| domain | string | Selected domain |
| source | string | Source provider |
| source_url | string | null | Direct content URL |
| source_subreddit | string | null | Subreddit name if applicable |
| label_sentiment | string | null | Sentiment label |
| label_topic | string | null | Topic label |
| label_ner | string | null | NER label |
| confidence | float64 | Label confidence |
| confidence_reason | string | Model reasoning summary |
| llm_used | string | Label provider |
| needs_review | boolean | Manual review flag |
| app_id | string | null | Play Store app id |
| star_rating | int | null | App rating value |
| rating_hint | string | null | Rating context |
| created_at | ISO datetime | Export creation timestamp |
| job_id | string | Dataset job id |
7. Quality Scoring
Quality is derived from labeling confidence:
quality_score = average(confidence) * 100
- >= 90: Excellent
- 80-89: Good
- 70-79: Acceptable
- < 70: Review recommended
English acts as a benchmark language when included, helping compare relative quality between language outputs.
8. API Reference
| Method | Path | Description | Example Response |
|---|---|---|---|
| POST | /api/generate-dataset | Submit dataset request and get job id | {"job_id":"...","estimated_minutes":4,"message":"Dataset generation queued successfully"} |
| GET | /api/job-status/{job_id} | Fetch live progress for a job | {"status":"labeling","progress_percent":62,...} |
| GET | /api/quality-report/{job_id} | Get final quality report | {"overall_quality_score":90.1,...} |
| GET | /api/download/{job_id}/{format} | Download generated export file | Binary file response |
| GET | /api/health | Service health and dependency checks | {"status":"ok","version":"1.0.0",...} |