# Datasets

## 📚 Overview

Halaman ini berisi kumpulan dataset terbaik untuk belajar dan mengembangkan Machine Learning. Dataset ini mencakup berbagai domain seperti computer vision, natural language processing, time series, dan data tabular yang cocok untuk berbagai level pembelajaran.

## 🌐 General Datasets

### **Popular ML Datasets**

Dataset yang sering digunakan untuk pembelajaran ML:

#### **1. UCI Machine Learning Repository**

* **Source**: [UCI ML Repository](https://archive.ics.uci.edu/ml/)
* **Content**: 500+ datasets
* **Domains**: Classification, regression, clustering
* **Size**: Small to medium (KB to MB)
* **Format**: CSV, ARFF
* **Best Datasets**:
  * [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/iris) - Classification classic
  * [Wine Quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) - Regression
  * [Breast Cancer](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+\(diagnostic\)) - Medical classification
* **Pros**: Well-documented, clean data, educational value
* **Cons**: Some datasets outdated, limited size
* **Use Cases**: Learning ML, algorithm testing, education
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **2. Kaggle Datasets**

* **Source**: [Kaggle Datasets](https://www.kaggle.com/datasets)
* **Content**: 100,000+ datasets
* **Domains**: All ML domains
* **Size**: Small to very large (KB to GB)
* **Format**: CSV, JSON, images, audio
* **Best Datasets**:
  * [Titanic](https://www.kaggle.com/c/titanic/data) - Beginner classification
  * [House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) - Regression
  * [MNIST](https://www.kaggle.com/c/digit-recognizer) - Image classification
* **Pros**: Large variety, active community, competitions
* **Cons**: Variable quality, some require account
* **Use Cases**: Competitions, learning, real-world projects
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **3. Google Dataset Search**

* **Source**: [Google Dataset Search](https://datasetsearch.research.google.com/)
* **Content**: Millions of datasets
* **Domains**: All domains
* **Size**: Variable
* **Format**: Various formats
* **Best Features**:
  * Comprehensive search
  * Multiple sources
  * Metadata information
  * Free access
* **Pros**: Large collection, good search, multiple sources
* **Cons**: Variable quality, some require access
* **Use Cases**: Research, discovery, exploration
* **Rating**: ⭐⭐⭐⭐ (4/5)

#### **4. AWS Open Data Registry**

* **Source**: [AWS Open Data](https://registry.opendata.aws/)
* **Content**: 100+ datasets
* **Domains**: Scientific, government, research
* **Size**: Large to very large (GB to TB)
* **Format**: Various formats
* **Best Datasets**:
  * [Common Crawl](https://registry.opendata.aws/commoncrawl/) - Web data
  * [OpenStreetMap](https://registry.opendata.aws/osm/) - Geographic data
  * [1000 Genomes](https://registry.opendata.aws/1000-genomes/) - Genomics
* **Pros**: High quality, well-maintained, cloud-optimized
* **Cons**: Large size, requires AWS knowledge
* **Use Cases**: Research, production, large-scale analysis
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

## 🖼️ Computer Vision Datasets

### **Image Classification**

Dataset untuk klasifikasi gambar:

#### **5. MNIST**

* **Source**: [MNIST](http://yann.lecun.com/exdb/mnist/)
* **Content**: 70,000 handwritten digits
* **Size**: 11 MB
* **Format**: Binary
* **Classes**: 10 (digits 0-9)
* **Best Features**:
  * Clean, well-structured
  * Perfect for beginners
  * Fast training
  * Good documentation
* **Use Cases**: Learning computer vision, algorithm testing
* **Difficulty**: Beginner
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **6. CIFAR-10**

* **Source**: [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html)
* **Content**: 60,000 color images
* **Size**: 170 MB
* **Format**: Binary
* **Classes**: 10 (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)
* **Best Features**:
  * Color images
  * Real-world objects
  * Good for CNNs
  * Balanced classes
* **Use Cases**: Learning CNNs, image classification
* **Difficulty**: Beginner to Intermediate
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **7. CIFAR-100**

* **Source**: [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html)
* **Content**: 60,000 color images
* **Size**: 170 MB
* **Format**: Binary
* **Classes**: 100 (20 superclasses, 100 fine classes)
* **Best Features**:
  * More challenging than CIFAR-10
  * Hierarchical structure
  * Good for transfer learning
  * Real-world objects
* **Use Cases**: Advanced image classification, transfer learning
* **Difficulty**: Intermediate
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **8. ImageNet**

* **Source**: [ImageNet](http://www.image-net.org/)
* **Content**: 14+ million images
* **Size**: 150+ GB
* **Format**: Various
* **Classes**: 1,000 (ILSVRC)
* **Best Features**:
  * Large-scale dataset
  * High-quality images
  * Industry standard
  * Good for pre-training
* **Use Cases**: Research, pre-training models, benchmarking
* **Difficulty**: Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

### **Object Detection & Segmentation**

Dataset untuk deteksi dan segmentasi objek:

#### **9. COCO (Common Objects in Context)**

* **Source**: [COCO](https://cocodataset.org/)
* **Content**: 330K+ images, 2.5M+ instances
* **Size**: 25+ GB
* **Format**: JSON annotations, images
* **Tasks**: Object detection, segmentation, captioning
* **Best Features**:
  * Industry standard
  * Multiple tasks
  * High-quality annotations
  * Active community
* **Use Cases**: Object detection, segmentation research
* **Difficulty**: Intermediate to Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **10. Pascal VOC**

* **Source**: [Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/)
* **Content**: 20,000+ images
* **Size**: 2 GB
* **Format**: XML annotations, images
* **Classes**: 20 object classes
* **Best Features**:
  * Well-established benchmark
  * Good documentation
  * Multiple tasks
  * Educational value
* **Use Cases**: Learning object detection, benchmarking
* **Difficulty**: Intermediate
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **11. YOLO Datasets**

* **Source**: [YOLO Datasets](https://github.com/AlexeyAB/YOLO_mark)
* **Content**: Various annotated datasets
* **Size**: Variable
* **Format**: YOLO format
* **Best Features**:
  * YOLO-specific format
  * Multiple domains
  * Community contributions
  * Good for YOLO training
* **Use Cases**: YOLO training, custom detection
* **Difficulty**: Intermediate
* **Rating**: ⭐⭐⭐⭐ (4/5)

### **Face Recognition & Analysis**

Dataset untuk analisis wajah:

#### **12. LFW (Labeled Faces in the Wild)**

* **Source**: [LFW](http://vis-www.cs.umass.edu/lfw/)
* **Content**: 13,000+ face images
* **Size**: 200 MB
* **Format**: Images, text annotations
* **Tasks**: Face recognition, verification
* **Best Features**:
  * Real-world conditions
  * Good for face recognition
  * Well-established benchmark
  * Free access
* **Use Cases**: Face recognition, verification research
* **Difficulty**: Intermediate
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **13. CelebA (Celebrity Faces)**

* **Source**: [CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)
* **Content**: 200K+ celebrity face images
* **Size**: 2 GB
* **Format**: Images, attribute annotations
* **Attributes**: 40 binary attributes
* **Best Features**:
  * Large-scale dataset
  * Rich annotations
  * Good for GANs
  * Multiple applications
* **Use Cases**: Face analysis, GAN training, attribute learning
* **Difficulty**: Intermediate to Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

## 📝 Natural Language Processing Datasets

### **Text Classification**

Dataset untuk klasifikasi teks:

#### **14. AG News**

* **Source**: [AG News](https://huggingface.co/datasets/ag_news)
* **Content**: 120K+ news articles
* **Size**: 12 MB
* **Format**: Text
* **Classes**: 4 (World, Sports, Business, Sci/Tech)
* **Best Features**:
  * Clean text data
  * Balanced classes
  * Good for text classification
  * Easy to use
* **Use Cases**: Text classification, NLP learning
* **Difficulty**: Beginner to Intermediate
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **15. IMDB Reviews**

* **Source**: [IMDB Reviews](https://huggingface.co/datasets/imdb)
* **Content**: 50K+ movie reviews
* **Size**: 80 MB
* **Format**: Text
* **Classes**: 2 (Positive, Negative)
* **Best Features**:
  * Sentiment analysis
  * Real-world text
  * Good for sentiment analysis
  * Well-balanced
* **Use Cases**: Sentiment analysis, text classification
* **Difficulty**: Beginner to Intermediate
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **16. 20 Newsgroups**

* **Source**: [20 Newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset)
* **Content**: 20K+ newsgroup posts
* **Size**: 20 MB
* **Format**: Text
* **Classes**: 20 newsgroups
* **Best Features**:
  * Topic classification
  * Clean text
  * Good for text analysis
  * Educational value
* **Use Cases**: Topic classification, text analysis
* **Difficulty**: Beginner to Intermediate
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

### **Language Modeling & Generation**

Dataset untuk language modeling:

#### **17. WikiText**

* **Source**: [WikiText](https://huggingface.co/datasets/wikitext)
* **Content**: Wikipedia articles
* **Size**: 100+ MB
* **Format**: Text
* **Tasks**: Language modeling, text generation
* **Best Features**:
  * High-quality text
  * Good for language models
  * Multiple versions
  * Clean formatting
* **Use Cases**: Language modeling, text generation
* **Difficulty**: Intermediate to Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **18. Common Crawl**

* **Source**: [Common Crawl](https://commoncrawl.org/)
* **Content**: Billions of web pages
* **Size**: TBs
* **Format**: WARC, text
* **Tasks**: Large-scale language modeling
* **Best Features**:
  * Massive scale
  * Real-world text
  * Multiple languages
  * Regular updates
* **Use Cases**: Large language models, research
* **Difficulty**: Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **19. BookCorpus**

* **Source**: [BookCorpus](https://huggingface.co/datasets/bookcorpus)
* **Content**: 11K+ free books
* **Size**: 1+ GB
* **Format**: Text
* **Tasks**: Language modeling, text generation
* **Best Features**:
  * High-quality text
  * Long-form content
  * Good for language models
  * Clean data
* **Use Cases**: Language modeling, text generation
* **Difficulty**: Intermediate to Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

### **Machine Translation**

Dataset untuk machine translation:

#### **20. WMT (Workshop on Machine Translation)**

* **Source**: [WMT](http://www.statmt.org/wmt20/)
* **Content**: Parallel text in multiple languages
* **Size**: GBs
* **Format**: Parallel text
* **Languages**: Multiple language pairs
* **Best Features**:
  * Industry standard
  * Multiple languages
  * High quality
  * Regular competitions
* **Use Cases**: Machine translation, multilingual NLP
* **Difficulty**: Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **21. OPUS (Open Parallel Corpus)**

* **Source**: [OPUS](http://opus.nlpl.eu/)
* **Content**: 1000+ parallel corpora
* **Size**: Variable
* **Format**: Parallel text
* **Languages**: 100+ languages
* **Best Features**:
  * Large collection
  * Multiple languages
  * Open access
  * Good documentation
* **Use Cases**: Machine translation, multilingual research
* **Difficulty**: Intermediate to Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

## 📈 Time Series Datasets

### **Financial Data**

Dataset untuk analisis finansial:

#### **22. Yahoo Finance**

* **Source**: [Yahoo Finance](https://finance.yahoo.com/)
* **Content**: Stock prices, financial data
* **Size**: Variable
* **Format**: CSV, JSON
* **Features**: OHLCV, indicators
* **Best Features**:
  * Real-time data
  * Historical data
  * Multiple markets
  * Free access
* **Use Cases**: Financial analysis, time series forecasting
* **Difficulty**: Intermediate
* **Rating**: ⭐⭐⭐⭐ (4/5)

#### **23. Alpha Vantage**

* **Source**: [Alpha Vantage](https://www.alphavantage.co/)
* **Content**: Financial market data
* **Size**: Variable
* **Format**: JSON, CSV
* **Features**: Real-time, historical
* **Best Features**:
  * Real-time data
  * Multiple data types
  * API access
  * Good documentation
* **Use Cases**: Financial ML, algorithmic trading
* **Difficulty**: Intermediate
* **Rating**: ⭐⭐⭐⭐ (4/5)

### **Sensor & IoT Data**

Dataset untuk sensor dan IoT:

#### **24. UCI HAR (Human Activity Recognition)**

* **Source**: [UCI HAR](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones)
* **Content**: Smartphone sensor data
* **Size**: 60 MB
* **Format**: CSV
* **Classes**: 6 activities
* **Best Features**:
  * Real sensor data
  * Good for time series
  * Clean structure
  * Educational value
* **Use Cases**: Time series classification, sensor analysis
* **Difficulty**: Intermediate
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **25. NASA Bearing Dataset**

* **Source**: [NASA Bearing](https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/)
* **Content**: Bearing vibration data
* **Size**: 100+ MB
* **Format**: Text
* **Tasks**: Predictive maintenance
* **Best Features**:
  * Real industrial data
  * Good for predictive maintenance
  * Well-documented
  * Free access
* **Use Cases**: Predictive maintenance, time series analysis
* **Difficulty**: Intermediate to Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

## 🎵 Audio & Speech Datasets

### **Speech Recognition**

Dataset untuk speech recognition:

#### **26. LibriSpeech**

* **Source**: [LibriSpeech](http://www.openslr.org/12/)
* **Content**: 1000+ hours of speech
* **Size**: 6+ GB
* **Format**: Audio, transcriptions
* **Tasks**: Speech recognition, TTS
* **Best Features**:
  * High-quality audio
  * Good transcriptions
  * Multiple speakers
  * Industry standard
* **Use Cases**: Speech recognition, TTS research
* **Difficulty**: Intermediate to Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

#### **27. Common Voice**

* **Source**: [Common Voice](https://commonvoice.mozilla.org/)
* **Content**: Crowdsourced speech data
* **Size**: GBs
* **Format**: Audio, transcriptions
* **Languages**: 100+ languages
* **Best Features**:
  * Multiple languages
  * Crowdsourced
  * Open access
  * Regular updates
* **Use Cases**: Multilingual speech recognition
* **Difficulty**: Intermediate to Advanced
* **Rating**: ⭐⭐⭐⭐⭐ (5/5)

### **Music & Audio**

Dataset untuk analisis musik:

#### **28. GTZAN Genre Collection**

* **Source**: [GTZAN](http://marsyas.info/download/data_sets/)
* **Content**: 1000+ music files
* **Size**: 1+ GB
* **Format**: Audio
* **Genres**: 10 music genres
* **Best Features**:
  * Well-organized
  * Good for music analysis
  * Balanced genres
  * Educational value
* **Use Cases**: Music genre classification, audio analysis
* **Difficulty**: Intermediate
* **Rating**: ⭐⭐⭐⭐ (4/5)

## 🚀 Getting Started

### **For Beginners**

1. **Start with**: MNIST, Iris, Titanic
2. **Then**: CIFAR-10, AG News
3. **Finally**: UCI datasets

### **For Intermediate Learners**

1. **Start with**: CIFAR-100, COCO, IMDB
2. **Then**: LibriSpeech, HAR
3. **Finally**: Domain-specific datasets

### **For Advanced Learners**

1. **Start with**: ImageNet, Common Crawl
2. **Then**: Large-scale datasets
3. **Finally**: Custom dataset creation

## 💡 Best Practices

### **Dataset Selection**

1. **Match your goal**: Choose datasets relevant to your task
2. **Consider size**: Start small, scale up gradually
3. **Check quality**: Look for clean, well-documented data
4. **Verify licensing**: Ensure you can use the data
5. **Assess difficulty**: Match dataset complexity to your level

### **Data Preparation**

1. **Clean data**: Handle missing values, outliers
2. **Preprocess**: Normalize, encode, scale as needed
3. **Split properly**: Train/validation/test splits
4. **Augment if needed**: Data augmentation for small datasets
5. **Document process**: Keep track of preprocessing steps

### **Ethical Considerations**

1. **Privacy**: Respect data privacy and consent
2. **Bias**: Be aware of dataset biases
3. **Representation**: Ensure diverse representation
4. **Transparency**: Document data sources and limitations
5. **Responsibility**: Use data responsibly and ethically

## 🔍 Finding More Datasets

### **Search Strategies**

1. **Use dataset search engines**: Google Dataset Search, Kaggle
2. **Check research papers**: Papers often include dataset links
3. **Join communities**: Reddit, Discord, forums
4. **Follow researchers**: See what datasets they use
5. **Attend conferences**: Learn about new datasets

### **Dataset Repositories**

1. **Academic**: UCI, Stanford, MIT
2. **Industry**: Google, Microsoft, Facebook
3. **Government**: Data.gov, Eurostat
4. **Research**: Papers With Code, arXiv
5. **Community**: Kaggle, GitHub, Hugging Face

***

*Last updated: December 2024* *Contributors: \[Your Name]*

**Note**: Dataset availability and access may change. Always check official sources for the most up-to-date information and licensing terms.
