Data Validation Strategy
Arun SharmaData Validation StrategiesCompany: Verisk Innovation AnalyticsProject: POLITICAL EVENTS LIKELIHOOD PREDICTIONData for this project will be collected from various articles published in magazines. So, the first step is to select the authors and the magazines which have history of publishing articles on political situation. Once the data collection is finalized, it is essential to check for the sanity of the collected data. This step will ensure that the data collected is as per our requirements. So, any missing data, length of article, author validity etc. will be corrected in this step. Once data sanity step is done, we can move towards to data exploration to check for any inconsistencies in data. Removing common English words, words irrelevant to our analysis, names, from our data will ensure that data is compact and contains only that data which is good for our model building. Further EDA will be carried out to ensure that outlier treatment is done, ranges are satisfied and data is normalized. Post EDA, we will pay attention on getting appropriate sample from the data. Samples will be thoroughly checked for any biases. Since we may have large number of features in our data, it is essential to select important features for our data. Feature selection techniques like principal component analysis(PCA) and singular value decomposition(SVD) will be used to get most important features from the data set.Company: IronPlanetProject: PRICING RECOMMENDER SYSTEMData for this project is supplied by the IronPlanet. First step in the data validation is to check for the sanity of the data. Since the data is maintained in-house by IronPlanet so it is essential to understand âhow Ironplanet is getting this data and how they are storing itâ. Data will be thoroughly checked for any inconsistencies like data type mismatch, missing data, outliers in data, demographics mismatch etc. Once we have sanitized data with us, we can do âexploratory data analysisâ to summarize their main characteristics. This will help us further in removing any data inconsistencies and to make the data normalized. Once we have normalized data, we will employ sampling techniques to get appropriate sample from the data. Since Ironplanet will have high number of features in a very large data set, it is necessary to use only most important features for our analysis. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) will be used to get most important features. If there is high correlation among the features, then binning and combining features will be done.
Essay About Data Collection And Sanity Of The Collected Data
Essay, Pages 1 (416 words)
Latest Update: June 28, 2021
//= get_the_date(); ?>
Views: 70
//= gt_get_post_view(); ?>
Data Collection And Sanity Of The Collected Data. (June 28, 2021). Retrieved from https://www.freeessays.education/data-collection-and-sanity-of-the-collected-data-essay/