Social media can be defined as a “web-based communication tool” which enables users to share and consume information. The usage of social media has been increased for the last few years. In other words, social media usage has affected several fields of study like politics, healthcare, education, and research. Social media is the best platform to post about one’s research. In this paper, we demonstrate how can we predict the popularity of an article based on the Facebook social media content. Based on the number of likes, which is a target variable in the dataset we predict the highest attention gained article. We performed linear regression and neural network algorithms on the dataset and drawn a conclusion on which type of articles are being popular.

PROJECT PLANNING AND ANALYSIS:

Introduction:

Nowadays social media plays a critical tool in scholarly papers. According to the social media facts, the networks with the most penetration among social media users in 2018 so far are Facebook, Instagram, and Snapchat. In this paper, we analyze which article receives much attention on the social media platform: Facebook. The features which are considered in this paper are author rank, number of papers published by the author, number of citations for the author, the team size, the publication venue, reference count, number of fields being cited by a paper. The most challenging work in our study is data extraction. The data extracted from online sources is in the “. JSON” format. Added to that, we have numerous .json files i: e millions of data. So, in order to avoid this problem, we have selected around 70,000 data fields. Algorithms, Linear regression, and neural networks are used to predict the most attention gained paper.

Problem significance: Why should we care? What is the need? Who will benefit?

When constructing a research paper, it is important to include reliable sources. Academic research papers are typically based on scholarly sources and primary sources. When using the sources of high ranked authors, it strengthens your research paper. The ones who get benefitted from this paper are the organizations, authors, students.

Research hypotheses: What questions are you trying to answer?

If a paper is published, I can predict how much attention the paper will gain in a social media platform which will be based on the above-mentioned features.

Related work: What other work has been done before? Make sure that you cite

appropriate related work and provide a list of references at the conclusion of your

proposal (at least 20 citations).

https://arxiv.org/ftp/arxiv/papers/1801/1801.02383.pdf

We have gone through many papers related to my project, but these were the citations that are most related to my project. These papers have retrieved their data from plum analysis which has been integrated into Scopus. They have performed correlation analysis between Facebook and twitter scores to find out Social attention whereas in my project we have focused only on Facebook to show social attention.

https://www.researchgate.net/publication/263612647_Use_of_social_networks_for_academic_purposes_A_case_study

The results are based on a single case study. This study provides new insights on the impact of social media in academic contexts by analyzing the user profiles and benefits of a social network service that is specifically targeted at the academic community. Whereas my project is focused on the article's popularity on social media.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4363625/

Social and mainstream media metrics analyzed in this paper include scientific blogs, Twitter, Facebook, Google+, and mainstream media and newspaper mentions, as covered by Altmetric.com. By combining these various social media sources with traditional bibliometric indicators, this paper aims to perform the first large-scale characterization of the drivers of social media metrics and to contrast them with the patterns observed for citations.

Data:

The dataset used for this project will be the altimetric dataset provided to us in the big data course at NIU. This dataset consists of articles and citations, which consists of features including author rank, number of papers published by the author, number of citations for the author, the team size, the publication venue, reference count, number of fields being cited by a paper.

------------------------------------------------------------------

Methods:

Linear Regression:

Linear regression was the first type of regression analysis to be studied rigorously and to be used extensively in practical applications. Linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables. In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Linear regression is used for continuous data in our project. We are going to use sci-kit-learn, TensorFlow, Theano, and Keras machine learning library for our project.

Artificial neural networks:

Neural Networks are a framework for many machine learning algorithms to work together and process complex data inputs. Such systems "learn" to perform tasks by considering examples, generally without being programmed with any task-specific rules. In image recognition, they might learn to identify images that contain cats by analyzing example images that have been manual. They automatically generate identifying characteristics from the learning material that they process. In our project, we are going to use artificial neural networks that are initially trained or fed large amounts of data. Training consists of providing input and telling the network what the output should be.

Innovation: What differentiates your proposal from earlier work? How is it new?

In my project, we are implementing the Artificial neural networks along with the Linear regression for better analysis (data evaluation, processing, and predicting the Target values).

Evaluation: How will you evaluate your project? What metrics will validate your

results?

In my Project, we are going to evaluate the metrics with many different methods such as the Confusion matrix, AUC-ROC, Gini co-efficient, Root mean squared error, and cross-validation. These are done to make our model efficient with reduced error and more accurate values at the output.

Expected results:

I will be using many features straight from the dataset and also features that we have created/generated to get a better and more precise result. My research will potentially use Artificial neural networks. There are many different ways to get a prediction model but based on our dataset, I believe that this is the best way to approach this problem. Build a prediction model using the above methods to obtain the maximum efficiency in each model by feature analysis and create a visualization to represent the results and differentiate the results between the training model and test model.

PROJECT IMPLEMENTATION

INTRODUCTION

Social media gave people an opportunity to be content creators, controllers, and transparent users, to a great extent. In other words, social media is an act of engagement where users can share their points of view on miscellaneous topics. Because of its ease of use, speed and reach, social media is fast changing the public discourse in society and setting trends and agendas in topics that range from agriculture to the research industry. We have numerous websites for showcasing your work, discovering research data, and collaborating online. Examples include Facebook, Mendeley, CiteULike, blogs, news, Wikipedia, google plus, qna, Reddit, and altimetric. Since social media can also be construed as a form of collective wisdom, we decided to predict real-world outcomes on social media data. Our paper reports one such study. Surprisingly, we discovered that the chatter of a community can indeed be used to make quantitative predictions that outperform those of artificial markets.

The ones who get benefitted from our study are:

Organizations that buy copyrights from the authors based on their popularity.
Students who can use the sources of high ranked authors strengthen their research paper as well.
Can know what field people are really interested in

We considered the task of predicting the popularity of an article on social media. In other words, which type of paper gets more citation counts in the future. We used the altimetric dataset provided by NIU where we had millions of JSON files and each JSON file represented a tuple in the altimetric data, which were processed to create the final dataset. Due to the large dataset, we extracted around 3,00,000 tuples using random. The features in our dataset are Mendeley_count, citeulike_count, connotea_count, blogs_count, news_count, Wikipedia_count, facebook_count, googleplus_count, qna_count, policy_count, reddit_count and altmetric_score. Secondly, we cleaned the dataset using the various cleaning techniques which will be discussed later in the below section. We then converted the JSON file to a CSV file and then separated the target variable, facebook_count, and considered the rest of the data as test data (i;e drop the facebook_count feature and considering all other features). The reason why we considered the facebook_count as the target variable is because it is one of the largest social media platforms.

Our goals in this paper are as follows. First, we split the data using train_test_split from sklearn and then performed a linear regression algorithm which will be explained clearly in the below sections. Next, we scattered a plot to know how similar the features in the given dataset are. Later, the mean square error is calculated to know how close a fitted line is to data points. Then, we performed the above steps considering altmetric_score as a target variable and the rest of the data as test data.

Next, we performed neural networks on the data where the target variable is facebook_count and calculated mean square error. We observed that neural networks performed better when compared to the linear regression.

This paper is organized as follows. Next, we survey recent related work. We then provide a short introduction to the dataset that we collected. We then discuss our study using algorithms, linear regression, and neural networks. We conclude in the last section.

RELATED WORK

. Tarek A. El-Badawy and Yasmin Hashem [1] studied the impact of social media on the academic development on students. They conducted a chi-square analysis between the use of social media and the number of hours spent on studying. There is no significant relationship between using social media and the students’ academic performance.

Xianwen wang, Yunxue Cui, and others [2] studied the social media attention increases article visits. We have gone through many papers related to our project, but these were the citations that are most related to our project. These papers have retrieved their data from plum analysis which has been integrated into Scopus. They have performed a correlation analysis between Facebook and twitter scores to find out social media attention.

Gemma Nandez and Angel Borrego [3] studied the use of social networks for academic purposes. The results are based on a single case study. This study provides new insights on the impact of social media in academic contexts by analyzing the user profiles and benefits of a social network service that is specifically targeted at the academic community.

Stefanie Hasten and others [4] studied characterizing social media metrics of scholarly papers. Social and mainstream media metrics analyzed in this paper include scientific blogs, Twitter, Facebook, Google+, and mainstream media and newspaper mentions, as covered by Altmetric.com. By combining these various social media sources with traditional bibliometric indicators, this paper aims to perform the first large-scale characterization of the drivers of social media metrics and to contrast them with the patterns observed for citations.

DATASET CHARACTERISTICS

The altimetric dataset of size 27GB was obtained from NIU Big data course. The data was in JSON (JavaScript-object notation) file format. There were several millions of JSON files where each JSON file represented a tuple in the altimetric data. The JSON file format was difficult to be processed and analyze directly. Hence, we converted the data in JSON file format into a CSV (comma-separated values) file format and extracted the dataset of 12 columns where each column represents the attributes of the dataset and 235771 rows where each row represents a tuple of respective values of the specific attribute. Due to the large dataset, data cleaning seemed to be challenging work for us. We used the following cleaning techniques to clean the data:

Remove outliers
Dropna() to remove null values using NumPy
Imputer to fill in the missing values using sklearn
Drop_duplicates() to remove duplicate values

Each attribute in the dataset is a specific count of each paper on various social media platforms. The considered social media platforms are Mendeley, CiteULike, Connotea, news, Wikipedia, Facebook, google plus, qna, policy, Reddit, and altimetric. Our main goal was to analyze the data from eleven attributes and make a prediction of popularity on a target variable, facebook_count. Since Facebook has a greater number of active users, we chose facebook_count as a target variable. The features were explained in Table 1.

Attribute	Social media Platform
mendeley_count	The number of Mendeley users that have added to a particular document to a Mendeley library
citeulike_count	An examination of citation counts in a new scholarly communication environment
connotea_count	Number of views a paper gets in Connotea org
blogs_count	Count indicates the number of times a web page has been loaded.
news_count	Number of times a paper viewed in the news
wikipedia_count	The total times the user has viewed a paper in Wikipedia
facebook_count	number of unique users who saw your Page post in News Feed, Facebook
googleplus_count	counts for any view of them in any Google+ stream
qna_count	Counts on qna platform
policy_count	Counts a paper gets on Policy
reddit_count	Counts a paper gets on Reddit
altmetric_score	the attention that research outputs such as scholarly articles and datasets receive online.

Table 1

Attributes and what they mean

We then plotted graphs for the features’ vs facebook_count (number of likes) to know how similar they are. Later, we plotted histograms for the above-listed features and calculated the Pearson coefficient. Based on our calculations, the Pearson coefficient for connotea_count and facebook_count is very low. Then, we considered a few data points (50, 100, 150, 200) and generated a heatmap for the distance matrix to know which features are similar. In the later section, we explain the algorithms performed on the dataset in detail.

Fig1: Heatmap

ALGORITHMS

The algorithms we used to process, analyze, and predict the data are the Linear regression and Artificial Neural networks. Since our data is continuous, we performed linear regression. Our dataset was huge, so we didn’t get a higher rate of accuracy. So, that’s the reason why we conducted the neural networks algorithm on our dataset. Our results and calculations were explained in detail in the later sections.

LINEAR REGRESSION

Linear regression is one of the most efficient regression analysis algorithms which can be used to fit a model that is predictive to observed data. It is the most commonly used type of predictive analysis. These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. So, we split the dataset into test and train data sets before we could proceed with the regression analysis where Facebook as the target variable. We imported the linear regression algorithms from the sklearn library. This process is mainly used because it can be used to model the relationships from unknown model parameters through linear functions. The linear regression is done with the analysis done on the Test set and predicting the resultant target set for the respective test set. When we implemented the Linear regression on the gained dataset, we received fair predictions. These predictions were measured for efficiency which was done using one of the ways to evaluate linear regression. We opted for the mean square error(mse). The root mean square error [5] represents the sample standard deviation of the differences between predicted values and observed values. It is calculated by using the above formula:

The MSE was around 3.20. This indicated that the predictions received were fair with fewer errors or variations on the resulting predictions.

Fig2: Scatter plot with Facebook as Target.

We also implemented the Linear regression on the dataset with the Altimetric score as the Target variable just to check which dataset would give the best Linear regression predictions.

Fig3: Scatter plot with Altimetric score as Target.

NEURAL NETWORKS

A neural network is a series of algorithms that help us to recognize the underlying relationships in a dataset through a process, just like the human brain. Though it is complexed, it can adapt to changing input so that the network generates the best outcome without redesigning the output criteria.

We implemented the neural networks on the data frame from Keras [6] where the target variable is facebook_count. In deep learning, epoch [7] is a hyperparameter defined before training a model. One epoch is when an entire dataset is passed both forward and backward through the neural networks once. We considered 50 epochs, which means the model will be trained 50 times. In this process, the loss/error will be considerably lowered.

Fig 3. A plot between epoch and loss

When we measured the mean square error for evaluating the neural networks, our error rate came down to 2.0 from 3.2. Thus, the neural network was an improvement in predicting the values with a reduced error rate.

CONCLUSION

In this article, we have focused on how social media platforms can be utilized to predict the popularity of the paper in a specific focused social media platform. Using data from given altimetric data from NIU, we have cleaned the data and converted the data in JSON file format into a CSV format to do efficient processing and analysis of the data. Further, we implemented the Linear regression model on the data to predict the required values. The outcome of this was pretty convincing as the predictions came out with good accuracy. But this model was outperformed by the Artificial neural network model which was performed with Facebook as the Target variable and predicting the values in that specific Platform. The outcome of this model has outperformed the Previous model with a mean square error reduced considerably compared to the previous one.

We mainly focused on analyzing the data more efficiently using the data and giving out more accurate predictions using the linear Regression and Artificial neural network models. One of the problems we faced in this study was giving efficient predictions using the semi-structured data which was overcome using the python language code used to convert the data in the desired format. Further methods can be implemented on this dataset to verify and check for the accuracy of the predictions.

This paper describes how many social media platforms can be used as a basis to make predictions to know the popularities of papers in several other platforms.