This project implements drug classification using diverse machine learning models. We analyze chemical and pharmacological features to categorize drugs based on their properties. The goal is to enhance accuracy in predicting drug classes using algorithms like Support Vector Machine (SVM), Naive Bayes, k-Nearest Neighbors (k-NN), and Weighted k-Nearest Neighbors.
By applying these algorithms to a carefully curated dataset, we identify which algorithm and data split ratios (e.g., 80:20, 70:30, 60:40, 50:50) result in the highest accuracy in classifying drug types.
Drug Type (Drug x, Drug y, Drug z)| Algorithm | 80:20 Accuracy | 70:30 Accuracy | 60:40 Accuracy | 50:50 Accuracy |
|---|---|---|---|---|
| Naive Bayes | 81.67% | 81.67% | 81.67% | 81.67% |
| Simple K-NN | 65.00% | 65.00% | 65.00% | 65.00% |
| Weighted K-NN | 70.00% | 70.00% | 70.00% | 70.00% |
| SVM | 85.00% | 85.00% | 85.00% | 85.00% |
Overall Highest Accuracy Algorithm: SVM with 85% accuracy.
To run this project locally, follow these steps:
Clone the repository: ```bash git clone https://github.com/your-username/drug-classification.git cd drug-classification
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
pip install -r requirements.txt
../input/drug-classification/drug200.csv).The following Python libraries are required to run this project:
numpypandasmatplotlibseabornscikit-learnimbalanced-learn (for SMOTE)You can install all dependencies using:
pip install -r requirements.txt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import CategoricalNB
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
# Load dataset
df_drug = pd.read_csv("../input/drug-classification/drug200.csv")
# Data preprocessing
bin_age = [0, 19, 29, 39, 49, 59, 69, 80]
category_age = ['<20s', '20s', '30s', '40s', '50s', '60s', '>60s']
df_drug['Age_binned'] = pd.cut(df_drug['Age'], bins=bin_age, labels=category_age)
df_drug = df_drug.drop(['Age'], axis=1)
bin_NatoK = [0, 9, 19, 29, 50]
category_NatoK = ['<10', '10-20', '20-30', '>30']
df_drug['Na_to_K_binned'] = pd.cut(df_drug['Na_to_K'], bins=bin_NatoK, labels=category_NatoK)
df_drug = df_drug.drop(['Na_to_K'], axis=1)
# Splitting the dataset
X = df_drug.drop(["Drug"], axis=1)
y = df_drug["Drug"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
# Oversampling using SMOTE
X_train, y_train = SMOTE().fit_resample(X_train, y_train)
# K-Nearest Neighbors
KNclassifier = KNeighborsClassifier(n_neighbors=20)
KNclassifier.fit(X_train, y_train)
y_pred = KNclassifier.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
KNAcc = accuracy_score(y_pred, y_test)
print('K Neighbours accuracy is: {:.2f}%'.format(KNAcc * 100))
Contributions are welcome! Follow these steps:
This project is licensed under the MIT License. See the LICENSE file for details.