<em>Mac</em>Book项目 2009年学校开始实施<em>Mac</em>Book项目,所有师生配备一本<em>Mac</em>Book,并同步更新了校园无线网络。学校每周进行电脑技术更新,每月发送技术支持资料,极大改变了教学及学习方式。因此2011
2021-06-01 09:32:01
# pip install PyPDF2 安裝 PyPDF2 import PyPDF2 from PyPDF2 import PdfFileReader # Creating a pdf file object. pdf = open("test.pdf", "rb") # Creating pdf reader object. pdf_reader = PyPDF2.PdfFileReader(pdf) # Checking total number of pages in a pdf file. print("Total number of Pages:", pdf_reader.numPages) # Creating a page object. page = pdf_reader.getPage(200) # Extract data from a specific page number. print(page.extractText()) # Closing the object. pdf.close()
# pip install python-docx 安裝 python-docx import docx def main(): try: doc = docx.Document('test.docx') # Creating word reader object. data = "" fullText = [] for para in doc.paragraphs: fullText.append(para.text) data = 'n'.join(fullText) print(data) except IOError: print('There was an error opening the file!') return if __name__ == '__main__': main()
# pip install bs4 安裝 bs4 from urllib.request import Request, urlopen from bs4 import BeautifulSoup req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'}) webpage = urlopen(req).read() # Parsing soup = BeautifulSoup(webpage, 'html.parser') # Formating the parsed html file strhtm = soup.prettify() # Print first 500 lines print(strhtm[:500]) # Extract meta tag value print(soup.title.string) print(soup.find('meta', attrs={'property':'og:description'})) # Extract anchor tag value for x in soup.find_all('a'): print(x.string) # Extract Paragraph tag value for x in soup.find_all('p'): print(x.text)
import requests import json r = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json") res = r.json() # Extract specific node content. print(res['quiz']['sport']) # Dump data as string data = json.dumps(res) print(data)
import csv with open('test.csv','r') as csv_file: reader =csv.reader(csv_file) next(reader) # Skip first row for row in reader: print(row)
import re import string data = "Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen!" # Methood 1 : Regex # Remove the special charaters from the read string. no_specials_string = re.sub('[!#?,.:";]', '', data) print(no_specials_string) # Methood 2 : translate() # Rake translator object translator = str.maketrans('', '', string.punctuation) data = data.translate(translator) print(data)
from nltk.corpus import stopwords data = ['Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen!'] # Remove stop words stopwords = set(stopwords.words('english')) output = [] for sentence in data: temp_list = [] for word in sentence.split(): if word.lower() not in stopwords: temp_list.append(word) output.append(' '.join(temp_list)) print(output)
from textblob import TextBlob data = "Natural language is a cantral part of our day to day life, and it's so antresting to work on any problem related to langages." output = TextBlob(data).correct() print(output)
import nltk from textblob import TextBlob data = "Natural language is a central part of our day to day life, and it's so interesting to work on any problem related to languages." nltk_output = nltk.word_tokenize(data) textblob_output = TextBlob(data).words print(nltk_output) print(textblob_output)
Output:
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']
from nltk.stem import PorterStemmer st = PorterStemmer() text = ['Where did he learn to dance like that?', 'His eyes were dancing with humor.', 'She shook her head and danced away', 'Alex was an excellent dancer.'] output = [] for sentence in text: output.append(" ".join([st.stem(i) for i in sentence.split()])) for item in output: print(item) print("-" * 50) print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))
Output:
where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump
from nltk.stem import WordNetLemmatizer wnl = WordNetLemmatizer() text = ['She gripped the armrest as he passed two cars at a time.', 'Her car was in full view.', 'A number of cars carried out of state license plates.'] output = [] for sentence in text: output.append(" ".join([wnl.lemmatize(i) for i in sentence.split()])) for item in output: print(item) print("*" * 10) print(wnl.lemmatize('jumps', 'n')) print(wnl.lemmatize('jumping', 'v')) print(wnl.lemmatize('jumped', 'v')) print("*" * 10) print(wnl.lemmatize('saddest', 'a')) print(wnl.lemmatize('happiest', 'a')) print(wnl.lemmatize('easiest', 'a'))
Output:
She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy
import nltk from nltk.corpus import webtext from nltk.probability import FreqDist nltk.download('webtext') wt_words = webtext.words('testing.txt') data_analysis = nltk.FreqDist(wt_words) # Let's take the specific words only if their frequency is greater than 3. filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3]) for key in sorted(filter_words): print("%s: %s" % (key, filter_words[key])) data_analysis = nltk.FreqDist(filter_words) data_analysis.plot(25, cumulative=False)
Output:
[nltk_data] Downloading package webtext to
[nltk_data] C:UsersamitAppDataRoamingnltk_data...
[nltk_data] Unzipping corporawebtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...
import nltk from nltk.corpus import webtext from nltk.probability import FreqDist from wordcloud import WordCloud import matplotlib.pyplot as plt nltk.download('webtext') wt_words = webtext.words('testing.txt') # Sample data data_analysis = nltk.FreqDist(wt_words) filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3]) wcloud = WordCloud().generate_from_frequencies(filter_words) # Plotting the wordcloud plt.imshow(wcloud, interpolation="bilinear") plt.axis("off") (-0.5, 399.5, 199.5, -0.5) plt.show()
import nltk from nltk.corpus import webtext from nltk.probability import FreqDist from wordcloud import WordCloud import matplotlib.pyplot as plt words = ['data', 'science', 'dataset'] nltk.download('webtext') wt_words = webtext.words('testing.txt') # Sample data points = [(x, y) for x in range(len(wt_words)) for y in range(len(words)) if wt_words[x] == words[y]] if points: x, y = zip(*points) else: x = y = () plt.plot(x, y, "rx", scalex=.1) plt.yticks(range(len(words)), words, color="b") plt.ylim(-1, len(words)) plt.title("Lexical Dispersion Plot") plt.xlabel("Word Offset") plt.show()
import pandas as pd from sklearn.feature_extraction.text import CountVectorizer # Sample data for analysis data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages." data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural." data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing." df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]}) # Initialize vectorizer = CountVectorizer() doc_vec = vectorizer.fit_transform(df1.iloc[0]) # Create dataFrame df2 = pd.DataFrame(doc_vec.toarray().transpose(), index=vectorizer.get_feature_names()) # Change column headers df2.columns = df1.columns print(df2)
Output:
Go Java Python
and 2 2 2
application 0 1 0
are 1 0 1
bytecode 0 1 0
can 0 1 0
code 0 1 0
comes 1 0 1
compiled 0 1 0
derived 0 1 0
develops 0 1 0
for 0 2 0
from 0 1 0
functional 1 0 1
imperative 1 0 1
...
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer # Sample data for analysis data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages." data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural." data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing." df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]}) # Initialize vectorizer = TfidfVectorizer() doc_vec = vectorizer.fit_transform(df1.iloc[0]) # Create dataFrame df2 = pd.DataFrame(doc_vec.toarray().transpose(), index=vectorizer.get_feature_names()) # Change column headers df2.columns = df1.columns print(df2)
Output:
Go Java Python
and 0.323751 0.137553 0.323751
application 0.000000 0.116449 0.000000
are 0.208444 0.000000 0.208444
bytecode 0.000000 0.116449 0.000000
can 0.000000 0.116449 0.000000
code 0.000000 0.116449 0.000000
comes 0.208444 0.000000 0.208444
compiled 0.000000 0.116449 0.000000
derived 0.000000 0.116449 0.000000
develops 0.000000 0.116449 0.000000
for 0.000000 0.232898 0.000000
...
自然語言工具包:NLTK
import nltk from nltk.util import ngrams # Function to generate n-grams from sentences. def extract_ngrams(data, num): n_grams = ngrams(nltk.word_tokenize(data), num) return [ ' '.join(grams) for grams in n_grams] data = 'A class is a blueprint for the object.' print("1-gram: ", extract_ngrams(data, 1)) print("2-gram: ", extract_ngrams(data, 2)) print("3-gram: ", extract_ngrams(data, 3)) print("4-gram: ", extract_ngrams(data, 4))
文書處理工具:TextBlob
from textblob import TextBlob # Function to generate n-grams from sentences. def extract_ngrams(data, num): n_grams = TextBlob(data).ngrams(num) return [ ' '.join(grams) for grams in n_grams] data = 'A class is a blueprint for the object.' print("1-gram: ", extract_ngrams(data, 1)) print("2-gram: ", extract_ngrams(data, 2)) print("3-gram: ", extract_ngrams(data, 3)) print("4-gram: ", extract_ngrams(data, 4))
Output:
1-gram: ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram: ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram: ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram: ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']
import pandas as pd from sklearn.feature_extraction.text import CountVectorizer # Sample data for analysis data1 = "Machine language is a low-level programming language. It is easily understood by computers but difficult to read by people. This is why people use higher level programming languages. Programs written in high-level languages are also either compiled and/or interpreted into machine language so that computers can execute them." data2 = "Assembly language is a representation of machine language. In other words, each assembly language instruction translates to a machine language instruction. Though assembly language statements are readable, the statements are still low-level. A disadvantage of assembly language is that it is not portable, because each platform comes with a particular Assembly Language" df1 = pd.DataFrame({'Machine': [data1], 'Assembly': [data2]}) # Initialize vectorizer = CountVectorizer(ngram_range=(2, 2)) doc_vec = vectorizer.fit_transform(df1.iloc[0]) # Create dataFrame df2 = pd.DataFrame(doc_vec.toarray().transpose(), index=vectorizer.get_feature_names()) # Change column headers df2.columns = df1.columns print(df2)
Output:
Assembly Machine
also either 0 1
and or 0 1
are also 0 1
are readable 1 0
are still 1 0
assembly language 5 0
because each 1 0
but difficult 0 1
by computers 0 1
by people 0 1
can execute 0 1
...
from textblob import TextBlob #Extract noun blob = TextBlob("Canada is a country in the northern part of North America.") for nouns in blob.noun_phrases: print(nouns)
Output:
canada
northern part
america
import numpy as np import nltk from nltk import bigrams import itertools import pandas as pd def generate_co_occurrence_matrix(corpus): vocab = set(corpus) vocab = list(vocab) vocab_index = {word: i for i, word in enumerate(vocab)} # Create bigrams from all words in corpus bi_grams = list(bigrams(corpus)) # Frequency distribution of bigrams ((word1, word2), num_occurrences) bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams)) # Initialise co-occurrence matrix # co_occurrence_matrix[current][previous] co_occurrence_matrix = np.zeros((len(vocab), len(vocab))) # Loop through the bigrams taking the current and previous word, # and the number of occurrences of the bigram. for bigram in bigram_freq: current = bigram[0][1] previous = bigram[0][0] count = bigram[1] pos_current = vocab_index[current] pos_previous = vocab_index[previous] co_occurrence_matrix[pos_current][pos_previous] = count co_occurrence_matrix = np.matrix(co_occurrence_matrix) # return the matrix and the index return co_occurrence_matrix, vocab_index text_data = [['Where', 'Python', 'is', 'used'], ['What', 'is', 'Python' 'used', 'in'], ['Why', 'Python', 'is', 'best'], ['What', 'companies', 'use', 'Python']] # Create one list using many lists data = list(itertools.chain.from_iterable(text_data)) matrix, vocab_index = generate_co_occurrence_matrix(data) data_matrix = pd.DataFrame(matrix, index=vocab_index, columns=vocab_index) print(data_matrix)
Output:
best use What Where ... in is Python used
best 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
use 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0
What 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Where 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Pythonused 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
Why 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
companies 0.0 1.0 0.0 1.0 ... 1.0 0.0 0.0 0.0
in 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0
is 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
Python 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
used 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
[11 rows x 11 columns]
from textblob import TextBlob def sentiment(polarity): if blob.sentiment.polarity < 0: print("Negative") elif blob.sentiment.polarity > 0: print("Positive") else: print("Neutral") blob = TextBlob("The movie was excellent!") print(blob.sentiment) sentiment(blob.sentiment.polarity) blob = TextBlob("The movie was not bad.") print(blob.sentiment) sentiment(blob.sentiment.polarity) blob = TextBlob("The movie was ridiculous.") print(blob.sentiment) sentiment(blob.sentiment.polarity)
Output:
Sentiment(polarity=1.0, subjectivity=1.0)
Positive
Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
Positive
Sentiment(polarity=-0.3333333333333333, subjectivity=1.0)
Negative
import goslate text = "Comment vas-tu?" gs = goslate.Goslate() translatedText = gs.translate(text, 'en') print(translatedText) translatedText = gs.translate(text, 'zh') print(translatedText) translatedText = gs.translate(text, 'de') print(translatedText)
from textblob import TextBlob blob = TextBlob("Comment vas-tu?") print(blob.detect_language()) print(blob.translate(to='es')) print(blob.translate(to='en')) print(blob.translate(to='zh'))
Output:
fr
¿Como estas tu?
How are you?
你好嗎?
from textblob import TextBlob from textblob import Word text_word = Word('safe') print(text_word.definitions) synonyms = set() for synset in text_word.synsets: for lemma in synset.lemmas(): synonyms.add(lemma.name()) print(synonyms)
Output:
['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound']
{'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}
from textblob import TextBlob from textblob import Word text_word = Word('safe') antonyms = set() for synset in text_word.synsets: for lemma in synset.lemmas(): if lemma.antonyms(): antonyms.add(lemma.antonyms()[0].name()) print(antonyms)
Output:
{'dangerous', 'out'}
到此這篇關於25個值得收藏的Python文書處理案例的文章就介紹到這了,更多相關Python文書處理案例內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com!
相關文章
<em>Mac</em>Book项目 2009年学校开始实施<em>Mac</em>Book项目,所有师生配备一本<em>Mac</em>Book,并同步更新了校园无线网络。学校每周进行电脑技术更新,每月发送技术支持资料,极大改变了教学及学习方式。因此2011
2021-06-01 09:32:01
综合看Anker超能充系列的性价比很高,并且与不仅和iPhone12/苹果<em>Mac</em>Book很配,而且适合多设备充电需求的日常使用或差旅场景,不管是安卓还是Switch同样也能用得上它,希望这次分享能给准备购入充电器的小伙伴们有所
2021-06-01 09:31:42
除了L4WUDU与吴亦凡已经多次共事,成为了明面上的厂牌成员,吴亦凡还曾带领20XXCLUB全队参加2020年的一场音乐节,这也是20XXCLUB首次全员合照,王嗣尧Turbo、陈彦希Regi、<em>Mac</em> Ova Seas、林渝植等人全部出场。然而让
2021-06-01 09:31:34
目前应用IPFS的机构:1 谷歌<em>浏览器</em>支持IPFS分布式协议 2 万维网 (历史档案博物馆)数据库 3 火狐<em>浏览器</em>支持 IPFS分布式协议 4 EOS 等数字货币数据存储 5 美国国会图书馆,历史资料永久保存在 IPFS 6 加
2021-06-01 09:31:24
开拓者的车机是兼容苹果和<em>安卓</em>,虽然我不怎么用,但确实兼顾了我家人的很多需求:副驾的门板还配有解锁开关,有的时候老婆开车,下车的时候偶尔会忘记解锁,我在副驾驶可以自己开门:第二排设计很好,不仅配置了一个很大的
2021-06-01 09:30:48
不仅是<em>安卓</em>手机,苹果手机的降价力度也是前所未有了,iPhone12也“跳水价”了,发布价是6799元,如今已经跌至5308元,降价幅度超过1400元,最新定价确认了。iPhone12是苹果首款5G手机,同时也是全球首款5nm芯片的智能机,它
2021-06-01 09:30:45