# Catatan Seekor: Python ML

## 📚 Overview

Python adalah bahasa pemrograman yang dominan dalam dunia Machine Learning dan Data Science. Ekosistem Python menyediakan berbagai library dan framework yang powerful untuk mengembangkan model ML, melakukan analisis data, dan membangun aplikasi AI.

## 🛠️ Core Libraries

### 📊 Data Manipulation & Analysis

#### NumPy

Library fundamental untuk komputasi numerik dengan array multidimensi.

```python
import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6]])

# Mathematical operations
arr_squared = arr ** 2
matrix_transpose = matrix.T

# Random numbers
random_array = np.random.randn(1000)
normal_dist = np.random.normal(0, 1, 1000)

# Linear algebra
eigenvalues, eigenvectors = np.linalg.eig(matrix)
```

**Key Features:**

* Fast array operations
* Broadcasting capabilities
* Linear algebra functions
* Random number generation

**Use Cases:**

* Data preprocessing
* Mathematical computations
* Scientific computing
* Foundation for other ML libraries

#### Pandas

Library untuk data manipulation dan analysis dengan struktur data tabular.

```python
import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
})

# Data exploration
print(df.info())
print(df.describe())
print(df.head())

# Data manipulation
df['bonus'] = df['salary'] * 0.1
filtered_df = df[df['age'] > 28]

# Grouping and aggregation
age_groups = df.groupby('age')['salary'].mean()
```

**Key Features:**

* DataFrame and Series data structures
* Data cleaning and preprocessing
* Time series analysis
* SQL-like operations

**Use Cases:**

* Data exploration
* Data cleaning
* Feature engineering
* Data analysis and reporting

### 🤖 Machine Learning

#### Scikit-learn

Library machine learning yang comprehensive dengan berbagai algoritma dan tools.

```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load data
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Preprocess data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
```

**Key Features:**

* Supervised and unsupervised learning algorithms
* Model selection and evaluation tools
* Data preprocessing utilities
* Pipeline functionality

**Use Cases:**

* Traditional ML algorithms
* Model evaluation and selection
* Feature engineering
* Production ML pipelines

### 🧠 Deep Learning

#### TensorFlow

Library deep learning yang dikembangkan oleh Google dengan focus pada production deployment.

```python
import tensorflow as tf
from tensorflow import keras

# Create simple neural network
model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10, activation='softmax')
])

# Compile model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train model
history = model.fit(
    X_train, y_train,
    epochs=10,
    validation_data=(X_test, y_test),
    batch_size=32
)

# Save model
model.save('my_model.h5')
```

**Key Features:**

* High-level Keras API
* TensorFlow Serving for production
* TensorBoard for visualization
* Multi-platform support

**Use Cases:**

* Neural network development
* Production ML systems
* Research and experimentation
* Large-scale training

#### PyTorch

Library deep learning yang dikembangkan oleh Facebook dengan focus pada research dan flexibility.

```python
import torch
import torch.nn as nn
import torch.optim as optim

# Define neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        self.dropout = nn.Dropout(0.2)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

# Create model and optimizer
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Training loop
for epoch in range(10):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
```

**Key Features:**

* Dynamic computational graphs
* Python-first approach
* Excellent research support
* Strong community

**Use Cases:**

* Research and experimentation
* Custom model architectures
* Academic projects
* Prototyping

### 📈 Visualization

#### Matplotlib

Library plotting dasar untuk Python dengan berbagai chart types.

```python
import matplotlib.pyplot as plt
import numpy as np

# Create data
x = np.linspace(0, 10, 100)
y = np.sin(x)

# Create plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', label='sin(x)', linewidth=2)
plt.plot(x, np.cos(x), 'r--', label='cos(x)', linewidth=2)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Trigonometric Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.hist(np.random.normal(0, 1, 1000), bins=30, alpha=0.7)
ax1.set_title('Normal Distribution')
ax1.set_xlabel('Value')
ax1.set_ylabel('Frequency')

ax2.scatter(np.random.randn(100), np.random.randn(100), alpha=0.6)
ax2.set_title('Random Scatter Plot')
ax2.set_xlabel('X')
ax2.set_ylabel('Y')

plt.tight_layout()
plt.show()
```

#### Seaborn

Library visualization yang built on top of matplotlib dengan focus pada statistical graphics.

```python
import seaborn as sns
import pandas as pd

# Load sample data
iris = sns.load_dataset('iris')

# Create various plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Distribution plot
sns.histplot(data=iris, x='sepal_length', hue='species', ax=axes[0,0])
axes[0,0].set_title('Sepal Length Distribution')

# Box plot
sns.boxplot(data=iris, x='species', y='petal_length', ax=axes[0,1])
axes[0,1].set_title('Petal Length by Species')

# Scatter plot
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width', 
                hue='species', ax=axes[1,0])
axes[1,0].set_title('Sepal Length vs Width')

# Correlation heatmap
correlation_matrix = iris.drop('species', axis=1).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', ax=axes[1,1])
axes[1,1].set_title('Feature Correlation')

plt.tight_layout()
plt.show()
```

## 🔧 Development Tools

### Jupyter Notebooks

Interactive development environment untuk data science dan ML.

```python
# Cell 1: Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Cell 2: Load data
df = pd.read_csv('data.csv')
print(f"Dataset shape: {df.shape}")
df.head()

# Cell 3: Data exploration
df.info()
df.describe()

# Cell 4: Visualization
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='target_column', bins=30)
plt.title('Target Distribution')
plt.show()

# Cell 5: Model training
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X = df.drop('target_column', axis=1)
y = df['target_column']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier()
model.fit(X_train, y_train)
```

### Virtual Environments

Isolasi dependencies untuk project ML.

```bash
# Create virtual environment
python -m venv ml_env

# Activate (Windows)
ml_env\Scripts\activate

# Activate (Linux/Mac)
source ml_env/bin/activate

# Install packages
pip install numpy pandas scikit-learn tensorflow torch matplotlib seaborn jupyter

# Deactivate
deactivate
```

## 📊 Data Science Workflow

### 1. Data Loading & Exploration

```python
import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('dataset.csv')

# Basic exploration
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
print(f"Missing values:\n{df.isnull().sum()}")

# Statistical summary
print(df.describe())

# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")
```

### 2. Data Preprocessing

```python
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Handle missing values
imputer = SimpleImputer(strategy='mean')
df['numeric_column'] = imputer.fit_transform(df[['numeric_column']])

# Encode categorical variables
le = LabelEncoder()
df['categorical_column'] = le.fit_transform(df['categorical_column'])

# Scale numerical features
scaler = StandardScaler()
df[['feature1', 'feature2']] = scaler.fit_transform(df[['feature1', 'feature2']])
```

### 3. Feature Engineering

```python
# Create new features
df['feature_ratio'] = df['feature1'] / df['feature2']
df['feature_squared'] = df['feature1'] ** 2

# Extract datetime features
df['date'] = pd.to_datetime(df['date_column'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek

# Binning numerical features
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 50, 75, 100], 
                         labels=['Young', 'Adult', 'Senior', 'Elderly'])
```

### 4. Model Training & Evaluation

```python
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Split data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Evaluate on test set
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
```

## 🚀 Best Practices

### 1. Code Organization

* Use virtual environments
* Organize code into functions and classes
* Document your code with docstrings
* Use version control (Git)

### 2. Performance Optimization

* Use vectorized operations (NumPy/Pandas)
* Profile your code for bottlenecks
* Use appropriate data structures
* Consider using Cython for critical sections

### 3. Reproducibility

* Set random seeds
* Save model artifacts
* Document data preprocessing steps
* Use requirements.txt for dependencies

### 4. Testing

* Write unit tests for critical functions
* Test with different datasets
* Validate model assumptions
* Monitor model performance over time

## 📚 References & Resources

### 📖 Books

* [**"Python for Data Analysis"**](https://wesmckinney.com/book/) by Wes McKinney
* [**"Hands-On Machine Learning"**](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) by Aurélien Géron
* [**"Python Machine Learning"**](https://sebastianraschka.com/books.html) by Sebastian Raschka

### 🎓 Courses

* [**DataCamp Python Track**](https://www.datacamp.com/tracks/python-programming)
* [**Coursera Python for Everybody**](https://www.coursera.org/specializations/python)
* [**Real Python Tutorials**](https://realpython.com/)

### 📰 Documentation

* [**NumPy Documentation**](https://numpy.org/doc/)
* [**Pandas Documentation**](https://pandas.pydata.org/docs/)
* [**Scikit-learn Documentation**](https://scikit-learn.org/stable/)
* [**TensorFlow Documentation**](https://www.tensorflow.org/guide)
* [**PyTorch Documentation**](https://pytorch.org/docs/)

### 🐙 GitHub Repositories

* [**Awesome Python**](https://github.com/vinta/awesome-python)
* [**Python Data Science Handbook**](https://github.com/jakevdp/PythonDataScienceHandbook)
* [**Scikit-learn Examples**](https://github.com/scikit-learn/scikit-learn/tree/main/examples)

### 📊 Datasets

* [**Kaggle Datasets**](https://www.kaggle.com/datasets)
* [**UCI Machine Learning Repository**](https://archive.ics.uci.edu/ml/)
* [**Hugging Face Datasets**](https://huggingface.co/datasets)

## 🔗 Related Topics

* [🧠 ML Fundamentals](https://mahbubzulkarnain.gitbook.io/catatan-seekor-the-series/machine-learning/fundamentals)
* [🔢 Supervised Learning](https://mahbubzulkarnain.gitbook.io/catatan-seekor-the-series/machine-learning/fundamentals/supervised-learning)
* [🤖 OpenAI Integration](https://github.com/mahbubzulkarnain/catatan-seekor-the-series/blob/master/machine_learning/catatan-seekor-open-ai/README.md)
* [🔍 RAG Systems](https://github.com/mahbubzulkarnain/catatan-seekor-the-series/blob/master/machine_learning/catatan-seekor-rag/README.md)

***

*Last updated: December 2024* *Contributors: \[Your Name]*
