Home Insights Articles Understanding the structure of the data in Twitter streams for sentiment analysis applications

Understanding the structure of the data in Twitter streams for sentiment analysis applications

Understanding the structure of the data in Twitter streams for sentiment analysis applications

In the previous post we outlined the basic scientific method used and formalized the problem statement we are solving, which is, “Based on of the tweets of English-speaking population of the United States related to selected new movie releases, can we identify patterns in the public’s sentiments towards these movies in real-time and track the progression of these sentiments over time?” In this post we address the first step in the process, focused on the understanding of the data.

Our goal in the earliest stage of the project is to understand as much as we can about the data: what data sources are available; how much of the data is being produced; how is it captured and transmitted, with what latencies and on what channels; how long it stays available; how secure is it; how accurate it is, and so on. In our case, we need the following types of data:

  • Information about trending movies from which to create a short list of movies we’ll be analyzing. We will use the IMDB database to select several movies of various genres recently released in US, along with their user ratings.
  • Actual tweet stream. We will use an official Twitter streaming API to get a tweet stream filtered by a set of keywords.
  • Dictionaries of positive and negative words. Dictionaries serve two purposes: they are basis for the dictionary based classification and for dimensionality reduction (removing all irrelevant info, like words such as “an,” “the,” and others that don’t impact a sentiment) in an approach employing machine learning. In our demo application, we prefer reusing existing dictionaries over creation of our own for the sake of cost and efficiency. On a long-term commercial project we might have decided to invest in the creation of a custom dictionary. There is a number of dictionaries available under open licenses as good starting point, e.g. dictionary of root-words that based on a dictionary created by Finn Arup Nielsen, Jeffrey Breen unigram dictionary and MPQA Subjectivity Lexicon, so we used those.
  • Training and test datasets. The quality of our models will largely depend on the size of our training datasets and the quality of our testing datasets. For our purposes, we are limited to sources that are open and freely available to the research community. Experientially, we know that the best datasets for our type of analysis consists of tweets labeled manually by people as carrying a “positive” or “negative” sentiment. We will use two datasets: IMDB Large Movie Review Dataset (a dataset topical to our subject) and 5K manually labeled tweets from Niek Sanders.

Once the data scientists build the basic understanding of the data, they may begin formulating the hypotheses on the insights that might be minable from the data and on approach they may use to gain these insights.
The first task is to get the stream of tweets related to some specific movie. We will employ the filtering capability of the Twitter streaming API. Every tweet containing words similar to a movie name is considered as the movie-related. E.g. for the movie “Lights Out” both texts will match. Examples of the data we’ll be dealing with quickly reveal quality issues we’ll have to deal with:

  • Tweet 1: “i hope light’s out is worth my time” (relevant)
  • Tweet 2: “WOW! DNC Turned Lights Out on Bernie Hecklers in Audience to Control Optics.” (irrelevant)

A data received with every tweet looks like this, after being captured in a json format:

{
	"created_at" : "Fri Jul 22 21:34:48 +0000 2016",
	"id" : 756603425161261057,
	"id_str" : "756603425161261057",
	"text" : "i hope light's out is worth my time. ud83dude34",
	"source" : "u003ca href="http://twitter.com/download/iphone" 
	rel="nofollow"u003eTwitter for iPhoneu003c/au003e",
	"truncated" : false,	"in_reply_to_status_id" : null,
	"in_reply_to_status_id_str" : null,
	"in_reply_to_user_id" : null,
	"in_reply_to_user_id_str" : null,
	"in_reply_to_screen_name" : null,
	"user" : {
		"Id" : 2989278494,
		"Id_str" : "2989278494",
		"name" : "- qu03c5eenu0455u043day u2728",
		"screen_name" : "yungbarbiex0",
		"location" : "Queens, NY",
		"url" : "http://Instagram.com/xobabyshay",
		"description" : "u03b9'u043c u0442u043dau0442 u0432u03b9u0442cu043d you03c5 u043doeu0455 u043dau0442e. $ | u2651ufe0f |",
		"Protected" : false,
		"verified" : false,
		"followers_count" : 1226,
		"friends_count" : 551,
		"listed_count" : 7,
		"favourites_count" : 26466,
		"statuses_count" : 40548,
		"created_at" : "Mon Jan 19 04:01:54 +0000 2015",
		"utc_offset" : null,
		"time_zone" : null,
		"geo_enabled" : true,
		"lang" : "en",
		"contributors_enabled" : false,
		"is_translator" : false,
		"profile_background_color" : "C0DEED",
		"profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png",
		"profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png",
		"profile_background_tile" : true,"profile_link_color" : "DD2E44",
		"profile_sidebar_border_color" : "000000",
		"profile_sidebar_fill_color" : "000000",
		"profile_text_color" : "000000",
		"profile_use_background_image" : true,
		"profile_image_url" : "http://pbs.twimg.com/profile_images/755988257163317248/SRdOYbJA_normal.jpg",
		"profile_image_url_https" : "https://pbs.twimg.com/profile_images/755988257163317248/SRdOYbJA_normal.jpg",
		"profile_banner_url" : "https://pbs.twimg.com/profile_banners/2989278494/1468891291",
		"default_profile" : false,
		"default_profile_image" : false,
		"following" : null,
		"follow_request_sent" : null,
		"notifications" : null
	},
	"geo" : null,
	"coordinates" : null,
	"place" : null,
	"contributors" : null,
	"is_quote_status" : false,
	"retweet_count" : 0,
	"favorite_count" : 0,
	"entities" : {
		"hashtags" : [],		"urls" : [],
		"user_mentions" : [],
		"Symbols" : []
	},
	"favorited" : false,
	"retweeted" : false,
	"filter_level" : "low",
	"lang" : "en",
	"timestamp_ms" : "1469223288227"
}

The tweet screenshot is 

twitter sentiment tweet example

What potentially valuable data we can see here?
What are potential issues and challenges with the data?

The most important data, naturally, is the field “text” containing the tweet itself. Also there is a location information in fields “location”, “coordinates”, “place”. The information reflecting the tweeter’s social power like “followers_count” also could be interesting. Let’s have a quick overview of followers distribution based on a data sample of about 220K tweets collected during July 22-27, 2016 for several movies.
Quantiles of the “followers number”

Quantiles of the “followers number”

An ANOVA  (ANalysis Of VAriance) test says the “followers” number is statistically different for different movies.

Analysis of Variance Table


Response: followers_count
Df Sum Sq Mean Sq F value Pr(>F)
movie 4 1.2078e+12 3.0195e+11 10.464 1.789e-08 ***
Residuals 219302 6.3280e+15 2.8855e+10

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

The next important aspect of data understanding is amount of data. Working with Twitter Streaming API we get several dozens tweets per second for a stream filtered for one movie name. 

Once our data science team looked at enough data samples, they could summarize the initial findings:

  • Tweet text may contain emoticons and something else encoded in unicode. In this tweet example, the text (“i hope light’s out is worth my time. ud83dude34”) ends with a unicode-encoded “sleeping face” icon. It will influence the process of data cleansing and feature extraction.
  • Location data is empty for more than 90% of tweets. That makes this information almost useless for analysis.
  • The ”followers” number range is extremely wide, starting with zero on the low side and reaching hundreds of thousands on the high side. That could be an interesting area to look for any patterns in sentiment distribution if we track the number of followers per movie over time.
  • We should be prepared to process 100 tweets per second per movie to be on the safe side. Depending on the number of movies we want to monitor, we will need to size the processing infrastructure appropriately.

That all gives us insight into what kind of data we have and stimulates our thinking on hypotheses and directions for further data exploration. It also gives us the necessary ground to proceed with selection of the right dictionary, which is the subject of the next blog post. 

References

Tags

You might also like

Vibrant translucent cubes and silhouettes of people in a digital cityscape, visually representing the dynamic and layered nature of AI software development, where diverse technologies, data, and human collaboration intersect to build innovative, interconnected digital solutions
Article
Your centralized command center for managing AI-native development
Article Your centralized command center for managing AI-native development

Fortune 1000 enterprises are at a critical inflection point. Competitors adopting AI software development are accelerating time-to-market, reducing costs, and delivering innovation at unprecedented speed. The question isn’t if you should adopt AI-powered development, it’s how quickly and effectivel...

Colorful, translucent spiral staircase representing the iterative and evolving steps of the AI software development lifecycle
Article
Agentic AI now builds autonomously. Is your SDLC ready to adapt?
Article Agentic AI now builds autonomously. Is your SDLC ready to adapt?

According to Gartner, by 2028, 33% of enterprise software applications will include agentic AI. But agentic AI won’t just be embedded in software; it will also help build it. AI agents are rapidly evolving from passive copilots to autonomous builders, prompting organizations to rethink how they dev...

Code on the left side with vibrant pink, purple, and blue fluid colors exploding across a computer screen, representing the dynamic nature of modern web development.
Article
Tailwind CSS: The developers power tool
Article Tailwind CSS: The developers power tool

When it comes to the best web development frameworks, finding the right balance between efficiency, creativity, and maintainability is key to building modern, responsive designs. Developers constantly seek tools and approaches that simplify workflows while empowering them to create visually strikin...

Cube emitting colorful data points, with blue, red, and gold light particles streaming upward against a black background, representing data transformation and AI capabilities.
Article
Data as a product: The missing link in your AI-readiness strategy
Article Data as a product: The missing link in your AI-readiness strategy

Most enterprise leaders dip their toe into AI, only to realize their data isn’t ready—whether that means insufficient data, legacy data formats, lack of data accessibility, or poorly performing data infrastructure. In fact, Gartner predicts that through 2026, organizations will abandon 60% of AI pr...

Multicolor whisps of smoke on a black background
Article
Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud
Article Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud

For many businesses, moving away from familiar but inherently unadaptable legacy suites is challenging. However, eliminating this technical debt one step at a time can bolster your confidence. The best starting point is transitioning from a monolithic CMS to a headless CMS. This shift to a modern c...

Orange blocks against a grey background to represent microservices in the cloud
Article
Cloud modernization playbook: From monolith to microservices
Article Cloud modernization playbook: From monolith to microservices

Many organizations have already embraced practices like Agile and DevOps to enhance collaboration and responsiveness in meeting customer needs. While these advancements mark significant milestones, the journey doesn't end here. Microservices offer another powerful way to accelerate business capabil...

5 emerging Kubernetes use cases beyond container scheduling
Article
Kubernetes use cases beyond container scheduling
Article Kubernetes use cases beyond container scheduling

From AI/ML workloads and multi-tenancy to test labs and edge computing, uncover 5 practical examples of Kubernetes-based platform engineering.

Get in touch

Let's connect! How can we reach you?

    Invalid phone format
    Submitting
    Understanding the structure of the data in Twitter streams for sentiment analysis applications

    Thank you!

    It is very important to be in touch with you.
    We will get back to you soon. Have a great day!

    check

    Something went wrong...

    There are possible difficulties with connection or other issues.
    Please try again after some time.

    Retry