Форумы Homeworld3.RU  

Go Back   Форумы Homeworld3.RU > HOMEWORLD > Homeworld 2
Register Forum Rules FAQ Members List Calendar Today's Posts

Reply
 
Thread Tools

Mining With R | Text

word_counts <- cleaned_austen %>% count(word, sort = TRUE) word_counts %>% head(10)

graph LR A[Raw Text] --> B[Preprocessing] --> C[Tokenization] --> D[Stop Word Removal] --> E[Analysis] --> F[Visualization] library(tidyverse) library(tidytext) library(janeaustenr) Load sample text (Jane Austen's books) austen_books <- austen_books() head(austen_books) 3.2. Preprocessing & Tokenization Tokenization splits text into meaningful units (words, sentences, n-grams). tidytext uses unnest_tokens() .

sentiment_scores library(wordcloud) word_counts %>% with(wordcloud(word, n, max.words = 100, colors = brewer.pal(8, "Dark2"))) 3.7. Term Frequency – Inverse Document Frequency (TF-IDF) TF-IDF identifies words that are important to a document within a corpus. Text Mining With R

1. Introduction In the age of big data, most information exists as unstructured text —emails, social media posts, reviews, news articles, and research papers. Unlike numerical data, text cannot be directly fed into a statistical model. Text mining (or text analytics) is the process of transforming this free-form text into structured, quantifiable data for analysis, pattern discovery, and prediction.

tf_idf <- cleaned_austen %>% count(book, word) %>% bind_tf_idf(word, book, n) %>% arrange(desc(tf_idf)) tf_idf %>% group_by(book) %>% slice_max(tf_idf, n = 3) 4.1. N-grams (Pairs of Words) austen_bigrams <- austen_books() %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) Count common bigrams bigram_counts <- austen_bigrams %>% separate(bigram, into = c("word1", "word2"), sep = " ") %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% count(word1, word2, sort = TRUE) 4.2. Topic Modeling (Latent Dirichlet Allocation) Using tidytext + topicmodels to discover hidden themes. Introduction In the age of big data, most

# Using bing lexicon (positive/negative) bing_sent <- get_sentiments("bing") sentiment_scores <- cleaned_austen %>% inner_join(bing_sent, by = "word") %>% count(book = austen_books()$book, sentiment) %>% # approximate pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% mutate(net_sentiment = positive - negative)

is an exceptional language for text mining. With a rich ecosystem of packages—most notably the tidytext , quanteda , and tm frameworks—R allows analysts to clean, tokenize, analyze sentiment, model topics, and visualize textual patterns efficiently. | Installation: install.packages(c("tidytext"

This write-up outlines a reproducible workflow for text mining using R, emphasizing tidy data principles. | Package | Purpose | | :--- | :--- | | tidytext | Converts text to tidy data frames (one token per row). Integrates with dplyr , ggplot2 . | | dplyr | Data manipulation (filter, group, mutate). | | ggplot2 | Visualization of text metrics (word frequencies, sentiment scores). | | janeaustenr | Sample texts for practice. | | tidyverse | Meta-package for data science. | | wordcloud | Generates word clouds. | | quanteda | Advanced text analysis (DFM, keywords-in-context). | | tm | Classic text mining (corpus, term-document matrix). | Installation: install.packages(c("tidytext", "tidyverse", "wordcloud", "quanteda")) 3. The Text Mining Workflow A standard text mining pipeline in R consists of these steps:

word_counts %>% filter(n > 500) %>% ggplot(aes(x = reorder(word, n), y = n)) + geom_col(fill = "steelblue") + coord_flip() + labs(title = "Most Frequent Words in Jane Austen's Novels", x = "Word", y = "Count") + theme_minimal() Sentiment lexicons (e.g., AFINN , bing , nrc ) assign emotional valence to words.

data(stop_words) cleaned_austen <- tidy_austen %>% anti_join(stop_words, by = "word") Count most common words:

Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 

Text Mining With R Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT +3. The time now is 02:22 PM.

Page generated: 0.461 seconds (91.63% - PHP and 8.37% - MySQL), 11 queries total

Text Mining With R