Data Scientist/ Data Analyst
Master of Data Science (Monash) graduate proficient in deep learning, machine learning, with hands-on experience training models and building interactive dashboards. A collaborative team player who is continually learning new tools, I am driven by a passion for leveraging data to build innovative solutions that serve communities.
Projects
Illicit Content Detection — LLM vs Classical
Unified text-classification pipeline comparing BERT/Llama/Gemma against SVM/NB baselines. Reproducible CLI, YAML configs, tests, CI.
- Binary & 40-class setups with class weights & robust splits
- BERT fine-tuning + PEFT stubs for Llama/Gemma
- Packaging, unit tests, GitHub Actions
Mental Health & Happiness — Kaggle (Reg + 5-class)
Reframed competition work: OOF stacking, F1-based selection, CatBoost/XGB/LGBM baselines, tidy notebooks.
- Ordinal-aware classification with threshold tuning
- OOF stacks + Optuna-ready configs
- Reproducible environment & CI
Youth Offender Dashboard (R Shiny)
Interactive choropleth + time-series + gender/age bubble + screen-use heatmap. Clean state-name harmonisation & guardrails.
- sf + leaflet + plotly, pre-simplified ASGS shapes
- Data-health banner & causality disclaimers
- Fast reactivity via leafletProxy
Big Data Fraud Detection — Spark
PySpark pipelines: explicit schemas, feature engineering (L1/L2/L3 actions), RF vs GBT, ROC/AUC, KMeans with silhouette.
- Spec-compliant SparkConf + ≤16MB partition bytes
- No Pandas in core ETL; MLlib-only modelling
- Model persistence for data streaming
Big Data ETL & EDA
Exploratory analysis + schema-first ingestion for retail data. Clean joins, QA checks, and reproducible visuals.
- Typed schemas, null-safety, profiling
- EDA figures and automated summaries
- Notebook as report; code modules for reuse
Experience
- Conduct face-to-face interviews and surveys across Melbourne, collecting high-quality primary data, directly supporting client delivery and insight generation.
- Ensure adherence to data integrity and accuracy in a fast-paced environment; managed data using market research tools to prepare clean datasets for analysis.
- Collaborate with cross-functional teams to meet data collection targets, demonstrating a consulting mindset in meeting client needs.
- Executed end-to-end analytics, from data sourcing and data warehousing query optimization (SQL) to advanced quantitative analysis (Python) to provide data-driven decision-making support for new initiatives and business problems.
- Developed performance reporting and predictive analytics (e.g., forecasting) using Power BI dashboards for senior stakeholders, supporting continuous monitoring and optimisation of product performance.
- Acted as a key analytical partner, managing stakeholder management across multiple teams to roll out projects and recommend strategic, system-related solutions based on deep data insights.
- Collaborated with multiple stakeholders to roll out projects, manage timelines, and recommend system-related solutions based on data insights.
- Achievement: Led a P&L initiative that saved 45% of total team cost (more than 500K USD) and increased profit by more than 1M USD in 2 months; awarded A+ (top 5% of company) in 2022
- Communicated with customers to have a deep understanding of their problems
- Analyzed customers' volume to find out missing information and clarified confusing information, upholding high standards of data review.
- Achievements: Won Nestlé distribution project (3M USD value); won Unilever warehouse project (1M USD value).
Education
GPA 3.6/4; Dean’s List S1 2024; Research: Fine-tuning LLMs for illicit content detection on online marketplaces
GPA 3.8/4; Dean’s List S1 2018; Scholarship for excellent academic performance
Certifications
Publications
Research project fine-tuning large language models such as Llama 3.2 and Gemma 3 for detecting illicit product listings in online marketplaces. Has been accepted for oral presentation and publication at International Conference of Natural Language Processing 2026.
About
Hi, I'm Quoc Khoa Tran (Kevin Tran), a data scientist who loves working with numbers and turning data into stories that make sense. Outside of work, I spend a lot of time at the gym, I am a beginner tennis player always trying to improve my swing, and I enjoy traveling whenever I get the chance.
Interests: AI & ML systems, LLMs for text, Spark pipelines, and human-friendly analytics.
- Python (pandas, PySpark, scikit-learn; PyTorch, TensorFlow)
- NLP: Transformers (Hugging Face), tokenizers, PEFT
- Data engineering: Apache Kafka, Spark Streaming, Snowflake (basics)
- Visualisation: Power BI, Tableau, R Shiny (leaflet, sf, plotly)
- SQL, dbt basics, data modeling
Data Science / ML Engineer internships and grad roles. Open to collaborations on applied NLP and analytics.