Data Scientist/ Data Analyst
Master of Data Science (Monash) graduate proficient in deep learning, machine learning, with hands-on experience training models and building interactive dashboards. A collaborative team player who is continually learning new tools, I am driven by a passion for leveraging data to build innovative solutions that serve communities.
Projects
Illicit Content Detection — LLM vs Classical
Unified text-classification pipeline comparing BERT/Llama/Gemma against SVM/NB baselines. Reproducible CLI, YAML configs, tests, CI.
- Binary & 40-class setups with class weights & robust splits
- BERT fine-tuning + PEFT stubs for Llama/Gemma
- Packaging, unit tests, GitHub Actions
Mental Health & Happiness — Kaggle (Reg + 5-class)
Reframed competition work: OOF stacking, F1-based selection, CatBoost/XGB/LGBM baselines, tidy notebooks.
- Ordinal-aware classification with threshold tuning
- OOF stacks + Optuna-ready configs
- Reproducible environment & CI
Youth Offender Dashboard (R Shiny)
Interactive choropleth + time-series + gender/age bubble + screen-use heatmap. Clean state-name harmonisation & guardrails.
- sf + leaflet + plotly, pre-simplified ASGS shapes
- Data-health banner & causality disclaimers
- Fast reactivity via leafletProxy
Big Data Fraud Detection — Spark
PySpark pipelines: explicit schemas, feature engineering (L1/L2/L3 actions), RF vs GBT, ROC/AUC, KMeans with silhouette.
- Spec-compliant SparkConf + ≤16MB partition bytes
- No Pandas in core ETL; MLlib-only modelling
- Model persistence for data streaming
Big Data ETL & EDA
Exploratory analysis + schema-first ingestion for retail data. Clean joins, QA checks, and reproducible visuals.
- Typed schemas, null-safety, profiling
- EDA figures and automated summaries
- Notebook as report; code modules for reuse
Experience
- Face-to-face interviews & surveys across Melbourne; high-integrity primary data collection
- Maintained clean datasets with market research tools; collaborated to meet collection targets
- SQL/Python analysis to inform initiatives; KPI design and Power BI tracking
- P&L initiative saved ~45% team cost (> $500K) and increased profit by > $1M in 2 months; Top 5% performance in 2022
- Owned revenue growth for Books & Automotive; subcategory targeting via weekly/monthly data
- Book category grew 38% in average daily orders in 6 months
- Customer discovery, volume analysis, and solution scoping across logistics opportunities
- Supported wins incl. Nestlé Central distribution and Yeah1 Group transportation
Education
GPA 3.6/4; Dean’s List S1 2024; Research: LLMs for illicit content classification
GPA 3.8/4; Dean’s List S1 2018; Scholarship for excellent academic performance
Foundational coursework in international economics
Certifications
Publications
Research project fine-tuning large language models such as Llama 3.2 and Gemma 3 for detecting illicit product listings in online marketplaces. Currently under review for publication.
About
Hi, I'm Quoc Khoa Tran (Kevin Tran), a data scientist who loves working with numbers and turning data into stories that make sense. Outside of work, I spend a lot of time at the gym, I am a beginner tennis player always trying to improve my swing, and I enjoy traveling whenever I get the chance.
Interests: AI & ML systems, LLMs for text, Spark pipelines, and human-friendly analytics.
- Python (pandas, PySpark, scikit-learn; PyTorch, TensorFlow)
- NLP: Transformers (Hugging Face), tokenizers, PEFT
- Data engineering: Apache Kafka, Spark Streaming, Snowflake (basics)
- Visualisation: Power BI, Tableau, R Shiny (leaflet, sf, plotly)
- SQL, dbt basics, data modeling
Data Science / ML Engineer internships and grad roles. Open to collaborations on applied NLP and analytics.