Word count, sentiment analysis, ngram analysis, X posts by this ID, word differences by days, and graphics.
Go to file
Lucky ea2428f45b
Delete Aug 25 2023 10:51:42.csv
2023-08-25 14:36:43 -03:00
4ChanScraperv2.R Update 4ChanScraperv2.R 2023-08-24 16:33:37 -03:00
Aug 24 2023 18:11:19.csv August 24th Scrape 2023-08-25 14:36:34 -03:00
Difference Between Data Frame Observations by Day.R Rewrote entire script to show top positive, and negative changes. See PDF 2023-08-25 14:27:46 -03:00
Difference-between-Days-X,-and-Y.pdf PDF Example 2023-08-25 14:34:14 -03:00
Pol-Scrape-Example-Version2.pdf Add files via upload 2023-08-24 01:28:07 -03:00
README.md Update README.md 2023-08-24 18:58:04 -03:00
pol sentiment scores.png Sentiment Scores Example 2023-08-24 18:29:44 -03:00

4 Chan Webscraper, Version 2

Consider doing your own data analysis. If you save your CSV, and make a pull request, I can add it to this repository for plotting word usage changes over time.

Highlights:

  • Written in R.
  • Objective: Datamining text, and displaying word frequencies.
  • Uses the following libraries: rvest, tidyverse, tidytext, ggplot2, wordcloud, tinytex, syuzhet, scales, reshape2, and dplyr.
  • If you don't have these installed in your RStudio software, then install them.
  • After installing, and running this script into your RStudio IDE, you can download all posts.
  • Downloaded posts are then manipulated to show word frequencies.
  • Differs from V1 by scraping all replies to OP, and has a much larger noise filter.
  • Sentiment analysis is also performed.

html_text vs html_text2 from rvest

I did an experiment comparing the tidy_pol_fixed2 output of text.

html_text = 21776 observations

html_text2 = 20004 observations

I will continue using html_text because it contains more observations, which I can later filter out the noise as needed. There were no substantial differences that I noticed in the graphs, so retaining a greater number of observations seems better than less.

From the rvest::html_text website:

There are two ways to retrieve text from a element: html_text() and html_text2(). html_text() is a thin wrapper around xml2::xml_text() which returns just the raw underlying text. html_text2() simulates how text looks in a browser, using an approach inspired by JavaScript's innerText(). Roughly speaking, it converts
to "\n", adds blank lines around <p> tags, and lightly formats tabular data.

html_text2() is usually what you want, but it is much slower than html_text() so for simple applications where performance is important you may want to use html_text() instead.