Word count, sentiment analysis, ngram analysis, X posts by this ID, word differences by days, and graphics.

Go to file

Lucky ea2428f45b Delete Aug 25 2023 10:51:42.csv		2023-08-25 14:36:43 -03:00
4ChanScraperv2.R	Update 4ChanScraperv2.R	2023-08-24 16:33:37 -03:00
Aug 24 2023 18:11:19.csv	August 24th Scrape	2023-08-25 14:36:34 -03:00
Difference Between Data Frame Observations by Day.R	Rewrote entire script to show top positive, and negative changes. See PDF	2023-08-25 14:27:46 -03:00
Difference-between-Days-X,-and-Y.pdf	PDF Example	2023-08-25 14:34:14 -03:00
Pol-Scrape-Example-Version2.pdf	Add files via upload	2023-08-24 01:28:07 -03:00
README.md	Update README.md	2023-08-24 18:58:04 -03:00
pol sentiment scores.png	Sentiment Scores Example	2023-08-24 18:29:44 -03:00

README.md

4 Chan Webscraper, Version 2

Consider doing your own data analysis. If you save your CSV, and make a pull request, I can add it to this repository for plotting word usage changes over time.

Highlights:

Written in R.
Objective: Datamining text, and displaying word frequencies.
Uses the following libraries: rvest, tidyverse, tidytext, ggplot2, wordcloud, tinytex, syuzhet, scales, reshape2, and dplyr.
If you don't have these installed in your RStudio software, then install them.
After installing, and running this script into your RStudio IDE, you can download all posts.
Downloaded posts are then manipulated to show word frequencies.
Differs from V1 by scraping all replies to OP, and has a much larger noise filter.
Sentiment analysis is also performed.

html_text vs html_text2 from rvest

I did an experiment comparing the tidy_pol_fixed2 output of text.

html_text = 21776 observations

html_text2 = 20004 observations

I will continue using html_text because it contains more observations, which I can later filter out the noise as needed. There were no substantial differences that I noticed in the graphs, so retaining a greater number of observations seems better than less.

From the rvest::html_text website:

There are two ways to retrieve text from a element: html_text() and html_text2(). html_text() is a thin wrapper around xml2::xml_text() which returns just the raw underlying text. html_text2() simulates how text looks in a browser, using an approach inspired by JavaScript's innerText(). Roughly speaking, it converts
to "\n", adds blank lines around <p> tags, and lightly formats tabular data.

html_text2() is usually what you want, but it is much slower than html_text() so for simple applications where performance is important you may want to use html_text() instead.