City, University of London
100 days from_RDT figshare version - Archive.xls (166 kB)

100 Days of Tweet IDs and Most Frequent Terms in Tweets from_user_id_str 25073877

Download (166 kB)
posted on 2017-04-29, 12:04 authored by Ernesto PriegoErnesto Priego
This is an Excel workbook containing two sheets. The first sheet contains 503 rows corresponding to 503 Tweet id strings from_user_id_str 25073877 and the following corresponding metadata:

created_at time
in_reply_to_user_id_str f

Tweet texts, URLs and other metadata such as profile_image_url, status_url and entities_str have not been included.

An attempt to remove duplicated entries was made but duplicates might have remained so further data refining might be required prior to analyses.

The second sheet contains 400 rows corresponding to the most frequent terms in the dataset's Tweets' texts. The text analysis was performed with the Terms Tool from Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell (2017). An edited English stop words list was applied to remove Twitter data specific terms such as, https, user names, etc. The analysed Tweets contained emojis and other special characters; due to character encoding these will be reflected in the terms list as character combinations.

Motivations to Share this Data

Archived Tweets can provide interesting insights for the study of contemporary history of media, politics, diplomacy, etc. The queried account is a public account widely agreed to be of exceptional national and international public interest. Though they provide public access to tweeted content in real time, Twitter Web and mobile clients are not suited for appropriate Tweet corpus analysis. For anyone researching social media, access to the data is absolutely essential in order to perform, review and reproduce studies.

Archiving Tweets of public interest due to their historic significance is a means to both preserve and enable reproducible study of this form of rapid online communication that otherwise can very likely become unretrievable as time passes. Due to Twitter's current business model and API limits, to date collecting in real time is the only relatively reliable method to archive Tweets at a small scale.

So far Twitter data analysis and visualisation has been done without researchers providing access to the source data that would allow reproducibility. It is appreciated that an Excel workbook is far from ideal as a file format, but due to the small scale the intention is to make this data human readable and available to researchers in a variety of non-technical fields.

Methodology and Limitations

The Tweets contained in this file were collected by Ernesto Priego using a Python script. The data collection search query was from:realdonaldtrump. A trigger was scheduled to collect atuomatically every hour, this means that any Tweets immediately deleted after publication have not been collected.

The original data harvesting was refined to delete duplications, to subscribe to Twitter's Terms and Conditions and so that the data was sorted in chronological order.

Duplication of data due to the automated collection is possible so further data refining might be required.

The file may not contain data from Tweets deleted by the queried user account immediately after original publication.

Both research and experience show that the Twitter search API is not 100% reliable. (Gonzalez-Bailon, Sandra, et al. 2012).

Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet posted by the queried account during the indicated period. This file dataset is shared for archival, comparative and indicative educational research purposes only.

The content included is from a public Twitter account and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.

The original Tweets, their contents and associated metadata were published openly on the Web from the queried public account and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually. The license on this output applies to the data collection; third-party content should be attributed to the original authors and copyright owners.

Please note that usernames, user profile pictures and full text of the Tweets collected have not been included in this file. No private personal information is shared in this dataset. As indicated above this dataset does not contain the text of the Tweets. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road.

This dataset is shared to archive, document and encourage open educational research into political activity on Twitter.

Other Considerations

All Twitter users agree to Twitter's Privacy and data sharing policies. Social media research remains in its infancy and though work has been done to develop best practices there is yet no agreement on a series of grey areas relating to reseach methodologies including ad hoc social media specific research ethics guidelines for reproducible research.

It is understood that public figures Tweet publicly with the conscious intention to have their Tweets publicly accessed and discussed. It is assumed that a public figure Tweeting publicly is of public interest and that such figure, as a Twitter user, has given implicit consent, by agreeing explicitly to Twitter's Terms and Conditions, for their Tweets to be publicly accessed and discussed, including critical analysis, without the need for prior written permission. There is therefore no difference between collecting data and performing data analysis from a public printed or online publication and collecting data and performing data analysis of a dataset containing Twitter data from a public account from a public user in a public role. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time. Reproducibility is considered here a key value for robust and trustworthy research.

Different scholarly professional associations like the Modern Language Association recognise Tweets, datasets and other online and digital resources as citeable scholarly outputs.

The data contained in the deposited file is otherwise available elsewhere through different methods.