From bc279e01f15a0bee0f4b2de7c63fcd4eabe4c932 Mon Sep 17 00:00:00 2001 From: Lucky <66523959+l-ucky@users.noreply.github.com> Date: Thu, 24 Aug 2023 18:20:37 -0300 Subject: [PATCH] Update html_text vs html_text2.md --- html_text vs html_text2.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/html_text vs html_text2.md b/html_text vs html_text2.md index ee3fd2a..139597f 100644 --- a/html_text vs html_text2.md +++ b/html_text vs html_text2.md @@ -1,16 +1,2 @@ -# html_text vs html_text2 from rvest -I did an experiment comparing the `tidy_pol_fixed2` output of text. - -html_text = 21776 observations -html_text2 = 20004 observations - -I will continue using html_text because it contains more observations, which I can later filter out the noise as needed. -There were no substantial differences that I noticed in the graphs, so retaining a greater number of observations seems better than less. - -From the rvest::html_text website: - -There are two ways to retrieve text from a element: html_text() and html_text2(). html_text() is a thin wrapper around xml2::xml_text() which returns just the raw underlying text. html_text2() simulates how text looks in a browser, using an approach inspired by JavaScript's innerText(). Roughly speaking, it converts
to "\n", adds blank lines around

tags, and lightly formats tabular data. - -html_text2() is usually what you want, but it is much slower than html_text() so for simple applications where performance is important you may want to use html_text() instead.