Lesson visualization of information in text documents. Visualization of information in text documents. To conduct a lesson, you need

In the Russian-speaking sector of the Internet, there are very few educational practical examples (and even fewer with code examples) of analyzing text messages in Russian. Therefore, I decided to pull the data together and consider an example of clustering, since data preparation for training is not required.

Most of the libraries used are already in the Anaconda 3 distribution, so I advise you to use that. Missing modules / libraries can be installed as standard via pip install "package name".
We connect the following libraries:

Import numpy as np import pandas as pd import nltk import re import os import codecs from sklearn import feature_extraction import mpld3 import matplotlib.pyplot as plt import matplotlib as mpl
Any data can be taken for analysis. Then this task came to my eye: Statistics of search queries of the State expenditures project. They needed to break down the data into three groups: private, government, and commercial. I didn't want to come up with anything extraordinary, so I decided to check how clustering would behave in this case (looking ahead - not very much). But you can download data from VK of some public:

Import vk # pass the session id session = vk.Session (access_token = "") # URL to get access_token, instead of tvoi_id insert the id of the created Vk application: # https://oauth.vk.com/authorize?client_id=tvoi_id&scope=friends, pages, groups, offline & redirect_uri = https: //oauth.vk.com/blank.html&display=page&v=5.21&response_type=token api = vk.API (session) poss = id_pab = -59229916 #id publics start with minus, user wall id no minus info = api.wall.get (owner_id = id_pab, offset = 0, count = 1) kolvo = (info // 100) +1 shag = 100 sdvig = 0 h = 0 import time while h 70): print (h) # not a prerequisite, just to control the approximate end of the process pubpost = api.wall.get (owner_id = id_pab, offset = sdvig, count = 100) i = 1 while i< len(pubpost): b=pubpost[i]["text"] poss.append(b) i=i+1 h=h+1 sdvig=sdvig+shag time.sleep(1) len(poss) import io with io.open("public.txt", "w", encoding="utf-8", errors="ignore") as file: for line in poss: file.write("%s\n" % line) file.close() titles = open("public.txt", encoding="utf-8", errors="ignore").read().split("\n") print(str(len(titles)) + " постов считано") import re posti= #удалим все знаки препинания и цифры for line in titles: chis = re.sub(r"(\<(/?[^>] +)>) "," ", line) #chis = re.sub () chis = re.sub (" [^ a-zA-Z] "," ", chis) posti.append (chis)
I will use search query data to show how poorly clustered short text data. I cleared the text of special characters and punctuation marks in advance, plus replaced abbreviations (for example, an individual entrepreneur is an individual entrepreneur). The result was a text with one search query per line.

We read the data into an array and proceed to normalization - reducing the word to its initial form. This can be done in several ways using Porter's stemmer, MyStem's stemmer and PyMorphy2. I want to warn you - MyStem works through a wrapper, so the speed of operations is very slow. Let's dwell on Porter's stemmer, although no one bothers to use others and combine them with each other (for example, walk PyMorphy2, and then Porter's stemmer).

Titles = open ("material4.csv", "r", encoding = "utf-8", errors = "ignore"). Read (). Split ("\ n") print (str (len (titles)) + "requests read") from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer ("russian") def token_and_stem (text): tokens = filtered_tokens = for token in tokens: if re.search ("[a-za-z]" , token): filtered_tokens.append (token) stems = return stems def token_only (text): tokens = filtered_tokens = for token in tokens: if re.search ("[a-za-z]", token): filtered_tokens.append (token) return filtered_tokens # Create dictionaries (arrays) from the resulting basics totalvocab_stem = totalvocab_token = for i in titles: allwords_stemmed = token_and_stem (i) #print (allwords_stemmed) totalvocab_stem.extend (allwords_stemmed) allwords_tokenized = towords_tokenized = towords_tokenized = allwords_tokenized)

Pymorphy2

import pymorphy2 morph = pymorphy2.MorphAnalyzer () G = for i in titles: h = i.split ("") #print (h) s = "" for k in h: #print (k) p = morph.parse ( k) .normal_form #print (p) s + = "" s + = p #print (s) # G.append (p) #print (s) G.append (s) pymof = open ("pymof_pod.txt", "w", encoding = "utf-8", errors = "ignore") pymofcsv = open ("pymofcsv_pod.csv", "w", encoding = "utf-8", errors = "ignore") for item in G : pymof.write ("% s \ n"% item) pymofcsv.write ("% s \ n"% item) pymof.close () pymofcsv.close ()


pymystem3

The analyzer executable files for the current operating system will be automatically downloaded and installed when the library is used for the first time.

From pymystem3 import Mystem m = Mystem () A = for i in titles: #print (i) lemmas = m.lemmatize (i) A.append (lemmas) # This array can be saved to a file or "backed up" import pickle with open ("mystem.pkl", "wb") as handle: pickle.dump (A, handle)


Let's create a TF-IDF weights matrix. Let's count each search query as a document (this is done when analyzing posts on Twitter, where each tweet is a document). We will take the tfidf_vectorizer from the sklearn package, and we will take the stop words from the ntlk corpus (initially we will have to download it via nltk.download ()). The parameters can be adjusted as you see fit - from the upper and lower bounds to the number of n-gram (in this case, let's take 3).

Stopwords = nltk.corpus.stopwords.words ("russian") # you can expand the list of stop words stopwords.extend (["what", "this", "so", "here", "to be", "how", "in", "to", "on"]) from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer n_featur = 200000 tfidf_vectorizer = TfidfVectorizer (max_df = 0.8, max_features = 10000, min_df = 0.01, stop_words = stopwords, tokenizer = token_and_stem, ngram_range = (1,3)) get_ipython (). magic ("time tfidf_matrix = tfidf_vectorizer.fit_transform (titles)") print (tfidf_matrix.shape)
On the resulting matrix, we begin to apply various clustering methods:

Num_clusters = 5 # K-Means Method - KMeans from sklearn.cluster import KMeans km = KMeans (n_clusters = num_clusters) get_ipython (). Magic ("time km.fit (tfidf_matrix)") idx = km.fit (tfidf_matrix) clusters = km.labels_.tolist () print (clusters) print (km.labels_) # MiniBatchKMeans from sklearn.cluster import MiniBatchKMeans mbk = MiniBatchKMeans (init = "random", n_clusters = num_clusters) # (init = "k-means ++", ' random 'or an ndarray) mbk.fit_transform (tfidf_matrix)% time mbk.fit (tfidf_matrix) miniclusters = mbk.labels_.tolist () print (mbk.labels_) # DBSCAN from sklearn.cluster import DBSCAN get_ipython () .cluster import DBSCAN get_ipython (). time db = DBSCAN (eps = 0.3, min_samples = 10) .fit (tfidf_matrix) ") labels = db.labels_ labels.shape print (labels) # Agglomerative Clustering from sklearn.cluster import AgglomerativeClustering agglo1 = AgglomerativeClustering (n_clusters = n_clusters = n_clusters = "euclidean") #affinity you can choose any or try everything in turn: cosine, l1, l2, manhattan get_ipython (). magic ("time answe r = agglo1.fit_predict (tfidf_matrix.toarray ()) ") answer.shape
The received data can be grouped into a dataframe and count the number of requests that hit each cluster.

# k-means clusterkm = km.labels_.tolist () #minikmeans clustermbk = mbk.labels_.tolist () #dbscan clusters3 = labels #agglo # clusters4 = answer.tolist () frame = pd.DataFrame (titles, index =) # k-means out = ("title": titles, "cluster": clusterkm) frame1 = pd.DataFrame (out, index =, columns = ["title", "cluster"]) #mini out = ("title" : titles, "cluster": clustermbk) frame_minik = pd.DataFrame (out, index =, columns = ["title", "cluster"]) frame1 ["cluster"]. value_counts () frame_minik ["cluster"]. value_counts ()
Due to the large number of queries, it is not very convenient to look at tables and I would like more interactivity for understanding. Therefore, we will make graphs of the relative position of requests relative to each other.

First you need to calculate the distance between the vectors. For this, the cosine distance will be applied. The articles suggest using subtraction from one so that there are no negative values ​​and is in the range from 0 to 1, so we will do the same:

From sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity (tfidf_matrix) dist.shape
Since the graphs will be two-, three-dimensional, and the original distance matrix is ​​n-dimensional, you will have to apply dimension reduction algorithms. There are many algorithms to choose from (MDS, PCA, t-SNE), but let's opt for Incremental PCA. This choice was made as a result of practical application - I tried MDS and PCA, but I did not have enough RAM (8 gigabytes) and when I started using the paging file, I could immediately take the computer to reboot.

The Incremental PCA algorithm is used as a replacement for principal component analysis (PCA) when the dataset to be decomposed is too large to fit in memory. IPCA creates a low-level approximation for the input data using an amount of memory that is independent of the number of input data samples.

# Principal component method - PCA from sklearn.decomposition import IncrementalPCA icpa = IncrementalPCA (n_components = 2, batch_size = 16) get_ipython (). Magic ("time icpa.fit (dist) #demo =") get_ipython (). Magic (" time demo2 = icpa.transform (dist) ") xs, ys = demo2 [:, 0], demo2 [:, 1] # PCA 3D from sklearn.decomposition import IncrementalPCA icpa = IncrementalPCA (n_components = 3, batch_size = 16) get_ipython () .magic ("time icpa.fit (dist) #demo =") get_ipython (). magic ("time ddd = icpa.transform (dist)") xs, ys, zs = ddd [:, 0], ddd [:, 1], ddd [:, 2] # You can immediately see roughly what the result will be #from mpl_toolkits.mplot3d import Axes3D #fig = plt.figure () #ax = fig.add_subplot (111, projection = "3d ") # ax.scatter (xs, ys, zs) # ax.set_xlabel (" X ") # ax.set_ylabel (" Y ") # ax.set_zlabel (" Z ") # plt.show ()
Let's go directly to the visualization itself:

From matplotlib import rc # enable Russian symbols on the graph font = ("family": "Verdana") #, "weigth": "normal") rc ("font", ** font) # you can generate colors for the clusters import random def generate_colors (n): color_list = for c in range (0, n): r = lambda: random.randint (0,255) color_list.append ("#% 02X% 02X% 02X"% (r (), r (), r ())) return color_list # set colors cluster_colors = (0: "# ff0000", 1: "# ff0066", 2: "# ff0099", 3: "# ff00cc", 4: "# ff00ff",) # we give names to clusters, but because of the randomness, let them be just 01234 cluster_names = (0: "0", 1: "1", 2: "2", 3: "3", 4: "4",) #matplotlib inline # create a data frame that contains coordinates (from PCA) + cluster numbers and the queries themselves df = pd.DataFrame (dict (x = xs, y = ys, label = clusterkm, title = titles)) # group by clusters groups = df .groupby ("label") fig, ax = plt.subplots (figsize = (72, 36)) #figsize matches your taste for name, group in groups: ax.plot (group.x, group.y, marker = "o", linestyle = "", ms = 12, label = cluster_name s, color = cluster_colors, mec = "none") ax.set_aspect ("auto") ax.tick_params (axis = "x", which = "both", bottom = "off", top = "off", labelbottom = "off") ax.tick_params (axis = "y", which = "both", left = "off", top = "off", labelleft = "off") ax.legend (numpoints = 1) # show legend only 1 dots # add labels / names in x, y position with the search query #for i in range (len (df)): # ax.text (df.ix [i] ["x"], df.ix [i] ["y"], df.ix [i] ["title"], size = 6) # show graph plt.show () plt.close ()
If you uncomment the line with the addition of names, then it will look something like this:

Example with 10 clusters


Not exactly what one would expect. Let's use mpld3 to translate the figure into an interactive graph.

# Plot fig, ax = plt.subplots (figsize = (25,27)) ax.margins (0.03) for name, group in groups_mbk: points = ax.plot (group.x, group.y, marker = "o" , linestyle = "", ms = 12, # ms = 18 label = cluster_names, mec = "none", color = cluster_colors) ax.set_aspect ("auto") labels = tooltip = mpld3.plugins.PointHTMLTooltip (points, labels, voffset = 10, hoffset = 10, # css = css) mpld3.plugins.connect (fig, tooltip) #, TopToolbar () ax.axes.get_xaxis (). set_ticks () ax.axes.get_yaxis (). set_ticks () # ax.axes.get_xaxis (). set_visible (False) # ax.axes.get_yaxis (). set_visible (False) ax.set_title ("Mini K-Means", size = 20) #groups_mbk ax.legend (numpoints = 1 ) mpld3.disable_notebook () # mpld3.display () mpld3.save_html (fig, "mbk.html") mpld3.show () # mpld3.save_json (fig, "vivod.json") # mpld3.fig_to_html (fig) fig , ax = plt.subplots (figsize = (51,25)) scatter = ax.scatter (np.random.normal (size = N), np.random.normal (size = N), c = np.random.random (size = N), s = 1000 * np.random.random (size = N), alpha = 0.3, cmap = plt.cm.jet) ax.grid (color = "white", linestyle = "solid") ax.set_title ("Clusters", size = 20) fig, ax = plt.subplots (figsize = (51,25)) labels = ["point (0)". format ( i + 1) for i in range (N)] tooltip = mpld3.plugins.PointLabelTooltip (scatter, labels = labels) mpld3.plugins.connect (fig, tooltip) mpld3.show () fig, ax = plt.subplots (figsize = (72,36)) for name, group in groups: points = ax.plot (group.x, group.y, marker = "o", linestyle = "", ms = 18, label = cluster_names, mec = " none ", color = cluster_colors) ax.set_aspect (" auto ") labels = tooltip = mpld3.plugins.PointLabelTooltip (points, labels = labels) mpld3.plugins.connect (fig, tooltip) ax.set_title (" K-means " , size = 20) mpld3.display ()
Now, when you hover over any point on the chart, a text pops up with the corresponding search query. An example of a finished html file can be viewed here: Mini K-Means

If you want in 3D and with a variable scale, then there is the Plotly service, which has a plugin for Python.

Plotly 3D

# for example, just a 3D plot from the obtained values ​​import plotly plotly .__ version__ import plotly.plotly as py import plotly.graph_objs as go trace1 = go.Scatter3d (x = xs, y = ys, z = zs, mode = "markers", marker = dict (size = 12, line = dict (color = "rgba (217, 217, 217, 0.14)", width = 0.5), opacity = 0.8)) data = layout = go.Layout (margin = dict (l = 0, r = 0, b = 0, t = 0)) fig = go.Figure (data = data, layout = layout) py.iplot (fig, filename = "cluster-3d-plot")


The results can be seen here: Example

And the final point is to perform hierarchical (agglomerative) clustering according to the Ward method to create a dendogram.

In: from scipy.cluster.hierarchy import ward, dendrogram linkage_matrix = ward (dist) fig, ax = plt.subplots (figsize = (15, 20)) ax = dendrogram (linkage_matrix, orientation = "right", labels = titles) ; plt.tick_params (\ axis = "x", which = "both", bottom = "off", top = "off", labelbottom = "off") plt.tight_layout () # save the picture plt.savefig ("ward_clusters2. png ", dpi = 200)
conclusions

Unfortunately, there are a lot of unresolved issues in the field of natural language research and not all data can be easily grouped into specific groups. But I hope this guide will increase interest in this topic and provide a basis for further experimentation.

Keywords:

  • numbered lists
  • bulleted lists
  • tiered lists
  • table
  • graphic images

It is known that textual information is perceived by a person better if it is visualized - organized in the form of lists, tables, diagrams, provided with illustrations (photographs, drawings, diagrams). Modern word processors provide users with ample opportunities to visualize information in the documents they create.

4.4.1. Lists

All kinds of lists in documents are drawn up using lists. In this case, all items of the list are considered as paragraphs, drawn up according to a single sample.

According to the design method, numbered and bulleted lists are distinguished.

Items (items) of a numbered list are designated using sequential numbers, which can be written in Arabic and Roman numerals. List items can be numbered and letters - Russian or Latin (Fig. 4.14).

Rice. 4.14.
Examples of numbered lists

It is customary to use a numbered list when the order of the items is important. These lists are especially often used to describe a sequence of actions. You regularly create numbered lists by filling out the lesson schedule for each school day in your diary.

When you create new, delete, or move existing numbered list items in the word processor, the entire list is renumbered automatically.

Bullet points are identified using bullet symbols. The user can select any character of the computer alphabet as a marker, and even a small graphic image (Fig. 4.15). Use a bulleted list to format the keywords at the beginning of each paragraph in your textbook.

Rice. 4.15.
Examples of bulleted lists

A bulleted list is used when the order of the elements in it is not important. For example, in the form of a bulleted list, you can arrange a list of subjects you study in grade 8.

By structure, single-level and multi-level lists are distinguished.

The lists in the examples discussed above have a single-level structure.

A list whose element is itself a list is called multilevel. So, the table of contents of your computer science textbook is a multilevel (three-level) list.

Lists are created in a word processor using the command of the menu bar or the buttons on the formatting panel (Fig. 4.16).

Rice. 4.16.
List building tools

4.4.2. Tables

To describe a number of objects with the same sets of properties, tables consisting of columns (graph) and rows are most often used. You are well aware of the tabular presentation of the lesson timetable, the timetables of buses, airplanes, trains and much more are presented in tabular form.

The information presented in the table is clear, compact and easy to see.

A properly formatted table has the structure shown in Fig. 4.17.

Rice. 4.17.
Table structure

The following rules for table design must be observed:

  1. The heading of the table should give an idea of ​​the information it contains.
  2. Column and row headings should be short and free of unnecessary words and, if possible, abbreviations.
  3. Units of measurement should be indicated in the table. If they are common for the entire table, then they are indicated in the table heading (either in brackets or separated by a comma after the name). If the units of measure differ, they are indicated in the heading of the corresponding row or column.
  4. It is desirable that all table cells are filled. If necessary, the following symbols are entered into them:

      The data is unknown;

      x - data is not possible;

      ↓ - data should be taken from the cell above.

The cells of tables can contain texts, numbers, images. An example of a table is shown in Fig. 4.18.

Rice. 4.18.
Example table

You can create a table using the appropriate menu item or the button on the toolbar, specifying the required number of columns and rows; in some word processors, the table can be "drawn". The created table can be edited by changing the column width and row height, adding and removing columns and rows, merging and splitting cells. You can enter information into cells as follows: using the keyboard; copy and paste previously prepared fragments. Word processors have the ability to automatically convert existing text to a table.

You can customize the appearance of the table yourself by choosing the type, width and color of the cell borders, the background color of the cells, and formatting the contents of the cells. In addition, you can format the table automatically.

4.4.3. Graphic images

Modern word processors allow you to include in documents various graphic images created by the user in other programs or found by him on the Internet. Ready-made graphics can be edited by changing their sizes, primary colors, brightness and contrast, rotating, overlapping, etc.

Many word processors have the ability to directly create graphic images from sets of autoshapes (graphic primitives). It is also possible to create colorful inscriptions using built-in text effects.

You can visualize the numerical information contained in a table using charts, the creation tools for which are also included in word processors.

The most powerful word processors allow you to build different types of graphic schemes (Fig. 4.19), which provide visualization of text information.

Rice. 4.19. Types of graphic schemes in the word processor Microsoft Word

The most important thing

It is known that textual information is perceived by a person better if it is visualized - organized in the form of lists, tables, diagrams, provided with illustrations (photographs, drawings, diagrams).

All kinds of lists in documents are drawn up using lists. According to the design method, numbered and bulleted lists are distinguished. It is customary to use a numbered list when the order of items is important; marked - when the order of the items in it is not important. By structure, single-level and multi-level lists are distinguished.

To describe a number of objects with the same set of properties, tables consisting of columns and rows are most often used. The information presented in the table is clear, compact and easy to see.

Modern word processors provide the ability to include, process, and create graphic objects.

Questions and tasks

  1. For what purpose do developers include lists, tables, graphics in text documents?
  2. What are lists used for? Give examples.
  3. Compare numbered and bulleted lists. What do they have in common? What is the difference?
  4. What is a tiered list? Give an example of such a list?
  5. What information can be organized in tabular form? What are the benefits of tabular presentation of information?
  6. What rules should be followed when designing tables?
  7. What graphics can be included in a text document?
  8. List the main features of word processors for working with graphic objects.

VISUALIZATION OF INFORMATION IN TEXT DOCUMENTS

Visualization - presentation of information in a visual form. Text information is presented in the form of lists, tables, diagrams, and is supplied with illustrations (photographs, diagrams, drawings).

Lists

All kinds of lists in documents are drawn up using lists... List items are considered as paragraphs, drawn up according to a single sample. According to the design method, lists can be numbered and bulleted.

The elements numbered list denoted by numbers or letters (Latin or Russian).

The elements bulleted list indicated by icons- markers.

By structure, lists are distinguished: single-level and multi-level. All examples discussed earlier are siblings

A list whose element is itself a list is called multilevel.

Tables

To describe a number of objects with the same set of properties, the most commonly used are tables consisting of columns (graph) and rows.

The information presented in the table is clear, compact and easy to understand.

A properly formatted table has the following structure:

The cells of tables can contain texts, numbers, images.

Example table:

You can create a table using the appropriate menu item or the button on the toolbar, specifying the required number of columns and rows; some word processors can draw a table. The created table can be edited by changing the column width and row height, merging and splitting cells. You can enter information into table cells as follows: using the keyboard, copy and paste previously prepared fragments. Word processors have the ability to automatically convert existing text to a table.

You can customize the appearance of the table yourself by choosing the type, width and color of the cell borders, the background color of the cells, and formatting the contents of the cells. In addition, you can format the table automatically.

Graphic images

Modern word processors allow various graphic images to be included in documents.

Word processors allow you to build different types of graphic schemes that provide visualization of text information.

Types of graphic schemes in the word processor Microsoft Word.

| Lesson planning and lesson materials | 7 classes | Planning lessons for the academic year (FSES) | Visualization of information in text documents

Lesson 25
Visualization of information in text documents

4.4.1. Lists





Keywords:

numbered lists bulleted lists multilevel lists table graphics

It is known that textual information is perceived by a person better if it is visualized - organized in the form of lists, tables, diagrams, provided with illustrations (photographs, drawings, diagrams).

Modern word processors provide users with ample opportunities to visualize information in the documents they create.

All kinds of lists in documents are drawn up using lists. In this case, all items of the list are considered as paragraphs, drawn up according to a single sample.

According to the design method, numbered and bulleted lists are distinguished.

Items (items) of a numbered list are designated using sequential numbers, which can be written in Arabic and Roman numerals. List items can be numbered and letters - Russian or Latin (Fig. 4.14).

It is customary to use a numbered list when the order of items is important.... These lists are especially often used to describe a sequence of actions.

You regularly create numbered lists by filling out the lesson schedule for each school day in your diary.

When you create new, delete, or move existing numbered list items in the word processor, the entire list is renumbered automatically.

The elements bulleted list indicated by marker icons. The user can select any symbol of the computer alphabet and even a small graphic image as a marker (Fig. 4.15). Use a bulleted list to format the keywords at the beginning of each paragraph in your textbook.

A bulleted list is used when the order of the elements in it is not important. For example, in the form of a bulleted list, you can arrange a list of subjects you study in grade 7.

The structure is distinguished single-level and multi-level lists.

The lists in the examples discussed above have a single-level structure.

A list whose element is itself a list is called multilevel. So, the table of contents of your computer science textbook is a multilevel (three-level) list.

Lists are created in a word processor using the command of the menu bar or the buttons on the formatting panel (Fig. 4.16).