
ABOUT THE PROJECT
This project involved preparing a 50,000+ restaurant review dataset of plagued by missing values, inconsistent data types, duplicates, and multi-valued fields. The cleaned data was uploaded to a MySQL database, providing a solid foundation for data management to enabling data-driven business insights in the restaurant sector..
PROJECT TOOLS, SKILLS AND ACTIVITIES
- Defined clear project goals and success criteria to ensure data quality and readiness for analysis - Conducted a thorough review of raw datasets to understand their structure, completeness, and consistency - Standardized the dataset by normalizing column names, fixing formats, and enforcing consistent data types - Handled missing data using targeted imputation where possible and excluded records that could not be reliably recovered - Removed duplicate records and ensured entity consistency using rule-based matching and deduplication - Identified and addressed outliers using statistical methods and domain knowledge to improve analysis accuracy - Engineered new features and aggregated metrics to enhance data understanding and management - Split and restructured multi-value fields into usable formats while maintaining relational integrity using indexing - Validated the final dataset through summary statistics and spot checks, delivering comprehensive documentation for easy handoff - Restructured and exploded multi-value columns: Dish_liked, Rest_type and Cuisines, into usable formats while maintaining relationships with primary table utilizing pandas index and created columns as primary keys and foreign keys
PROJECT LINKS
- GitHub Repository: View Project on GitHub
- Cleaned Datasets: View Datasets