![]() We’re here assuming that we’re only going to use the tweets data, so we’re going to extract the tweets data out of the file. Don’t forget to store the new dataframe to a new file without including the index on it, so that we could explore the data more freely later on. ![]() If your dataframe have indices included on it, once you drop those data duplicates, you need to store the new dataframe in a new file. new_data = data.drop_duplicates(‘Tweet Content’,keep=’first’) #delete the duplicates by dropping them and store the result value to a new variable new_data.head() analysis) these data duplicates could mess up the result by messing up the measurement. Most of the time, we don’t need the data duplicates, because in further use (i.e. The first things that we’re going to clean are data duplicates. Once we have imported the data, we’re now ready for the data cleaning process. pd.set_option(‘display.max_colwidth’, None) data = pd.read_csv(‘your_sample.csv’) data.head() We’re taking advantage of the pandas library here to import the data. in this case, I use CSV Twitter data, you may adjust the code if you use another extension type of file. Secondly, we need to import the Twitter data. Import pandas as pd import html import re from rpus import stopwords from nltk.tokenize import word_tokenize ![]()
0 Comments
Leave a Reply. |