Posts

python 爬取及時股價

如何取得即時的股價資訊進入證交所提供的基本市況報導網站，右上方輸入股票代號，以2330為例。看到當日的最高、最低、成交價量和最佳五檔等等。此時在網頁上右鍵點選Inspect打開DevTools切換到Network欄位並觀察爬蟲頁面發現會一直get某個網址，名稱開頭是getStockInfo，應該就是我們要的資訊了。 import requests url = "https://mis.twse.com.tw/stock/api/getStockInfo.jsp?ex_ch=tse_2330.tw" res = requests.get(url) res.json() 得到一個排列整齊的json {'queryTime': {'stockInfoItem': 4329, 'sessionKey': 'tse_2330.tw_20200908|', 'sessionStr': 'UserSession', 'sysDate': '20200908', 'sessionFromTime': -1, 'stockInfo': 2084673, 'showChart': False, 'sessionLatestTime': -1, 'sysTime': '12:05:35'}, 'referer': '', 'rtmessage': 'OK', 'exKey': 'if_tse_2330.tw_zh-tw.null', 'msgArray': [{'n': '台積電', 'g': '281_174_260_166_385_', 'u': '468.5000', 'mt': '060262', 'o': '428.0000', 'ps': '593', 'tk0': '2330.tw_tse_20200908_B_9998775018', 'a': '430.5000_431.0000_431.5000_432.0000_432.5000_', 'tlong': '1599537930000', 't': '12:05:30', 'it': '12', 'ch': '2330.tw', 'b': '430.0000_429.5000_429.0000_428.5000_428.0000_', 'f': '143_239_162_400_391_', 'w': '383.5000', 'pz': '428.0000', 'l': '427.5000', 'c': '2330', 'v': '16843', 'd': '20200908', 'tv': '-', 'tk1': '2330.tw_tse_20200908_B_9998774678', 'ts': '0', 'nf': '台灣積體電路製造股份有限公司', 'y': '426.0000', 'p': '0', 'i': '24', 'ip': '0', 'z': '-', 's': '-', 'h': '433.0000', 'ex': 'tse'}], 'userDelay': 5000, 'rtcode': '0000', 'cachedAlive': 7891} 在爬取網址時，不要亂刪後面的query parameters，除非你確認過差別是甚麼。如果不能爬，Request Headers就是你要注意的地方。理解和實驗精神比較一下哪個是我們要的資訊。 u: 漲停 v: 跌停 z: 當盤成交價，有時候會沒有 s: 當盤成交量，有時候也會沒有；整理數據時可以根據z和s的有無來過濾。 a: 賣出最佳五檔價 f: 賣出最佳五檔量 l: 當日最低 h: 當日最高 ….. 其他參數可以再自行看看，如果今天你想專注於某支股票的狀態；例如盤中是否有大量，那麼只需重複get url取得json做判斷；若想要得到更多支當下股票資訊以及儲存就需要用到dataframe。下面給個盤中抓取多隻股價的方式。 ...

Google NLP API parsing

使用google 提供的API做語意分析。語意分析(syntactic analysis)能夠提取語言的訊息，把文章拆成句子，句子在拆成更小的每個分詞，做更進一步的分析，Goole NLP API 會給予每個字詞的詞性以及彼此的關係。 Analyzing syntax 進入GCP新增一個API Key 並確認NLP API狀態為enable；詳細的GCP申請操作步驟可以看官方文件。(或是以後有機會寫。) API Enabled 因為這次是介紹，所以使用google cloud shell；在平常使用下可以把某些步驟改成習慣的語言及IDE。新增環境變數 export API_KEY=<YOUR_KEY> 確認輸入後，增加要丟進API的文字json檔 text.json { "document":{ "type":"PLAIN_TEXT", "content": "Beirut rescuers search the site for possible survivor 30 days after the explosion." }, "encodingType": "UTF8" } 標準的json檔輸入資訊：https://cloud.google.com/natural-language/docs 使用curl post資料 curl "https://language.googleapis.com/v1/documents:analyzeSyntax?key=${API_KEY}" \ -s -X POST -H "Content-Type: application/json" --data-binary @text.json 會得到解析出來的資訊 { "sentences": [ { "text": { "content": "Beirut rescuers search the site for possible survivor 30 days after the explosion.", "beginOffset": 0 } } ], "tokens": [ { "text": { "content": "Beirut", "beginOffset": 0 }, "partOfSpeech": { "tag": "NOUN", "aspect": "ASPECT_UNKNOWN", "case": "CASE_UNKNOWN", "form": "FORM_UNKNOWN", "gender": "GENDER_UNKNOWN", "mood": "MOOD_UNKNOWN", "number": "SINGULAR", "person": "PERSON_UNKNOWN", "proper": "PROPER", "reciprocity": "RECIPROCITY_UNKNOWN", "tense": "TENSE_UNKNOWN", "voice": "VOICE_UNKNOWN" }, "dependencyEdge": { "headTokenIndex": 1, "label": "NN" }, "lemma": "Beirut" }, { "text": { "content": "rescuers", "beginOffset": 7 }, "partOfSpeech": { "tag": "NOUN", "aspect": "ASPECT_UNKNOWN", "case": "CASE_UNKNOWN", "form": "FORM_UNKNOWN", "gender": "GENDER_UNKNOWN", "mood": "MOOD_UNKNOWN", "number": "PLURAL", "person": "PERSON_UNKNOWN", "proper": "PROPER_UNKNOWN", "reciprocity": "RECIPROCITY_UNKNOWN", "tense": "TENSE_UNKNOWN", "voice": "VOICE_UNKNOWN" }, "dependencyEdge": { "headTokenIndex": 2, "label": "NSUBJ" }, "lemma": "rescuer" }, { "text": { "content": "search", "beginOffset": 16 }, "partOfSpeech": { "tag": "VERB", "aspect": "ASPECT_UNKNOWN", "case": "CASE_UNKNOWN", "form": "FORM_UNKNOWN", "gender": "GENDER_UNKNOWN", "mood": "INDICATIVE", "number": "NUMBER_UNKNOWN", "person": "PERSON_UNKNOWN", "proper": "PROPER_UNKNOWN", "reciprocity": "RECIPROCITY_UNKNOWN", "tense": "PRESENT", "voice": "VOICE_UNKNOWN" } } ...... ], "language": "en" } 觀察一下上面的結果 ...

Before Data processing: ELT

Before ELT : ETL ETL stands for Extract, Transform, and Load. Historically, ETL has been the best and most reliable way to migrate data from one database to another. In addition to move data from one database to another, it also converts databases into a single format that can be utilized in the final point. Extract: Collecting data from different database. Sometimes using a staging table. Transform: It’s critical. Converting recently extracted data into the correct form so that it can be placed into another database. Sometimes there are other types of transformation involved in this step. Load: Load data into the target database or storage. ...

Python Comments

開發時加入註釋有助於描述思考過程，並幫助自己和其他人了解意圖，可以更輕鬆地發現錯誤、改進程式，以及在其他地方做更多應用。單行註釋加入註釋以 # 開頭， # defining the start code startCode = 50 也可加在程式碼後方，會被忽略， startCode = 50 # defining the start code 注意不要加入無用的描述，如同變數命名時不要取沒意義的名稱。多行註釋當要註釋的內容很多，或是撰寫文件、功能之類的，可以使用這種方式。 PEP8中建議單行不要超過79個字，一般情況則是會照公司或是團隊的開發習慣決定。多行#開頭， # PythonComments version 1.0.3 # -a (--all): show all features # -h (--help): show the help # ..... 或是用""" 包住 """ PythonComments version 1.0.3 -a (--all): show all features -h (--help): show the help ..... """

Python f-string

python 3.6後，字串多了個處理方法 PEP 498 – Literal String Interpolation 下面直接用例子來比較f-string和我們之前常用的 %-formatting、str.format()語法不同之處 >>> # %-formatting ... >>> text = "Hello" >>> number1 = 10 >>> number2 = 20 >>> print("%s, test numbers are %s and %s" % (text, number1, number2)) Hello, test numbers are 10 and 20 >>> # str.format() ... >>> text = "Hello" >>> number1 = 10 >>> number2 = 20 >>> print("{}, test numbers are {} and {}".format(text, number1, number2)) Hello, test numbers are 10 and 20 >>> print("{0}, test numbers are {2} and {1}".format(text, number1, number2)) Hello, test numbers are 20 and 10 >>> # f-string ... >>> text = "Hello" >>> number1 = 10 >>> number2 = 20 >>> print(f"{text}, test numbers are {number1} and {number2}") Hello, test numbers are 10 and 20 F-string 看起來更python了，也解決了之前會遇到的問題；例如使用 %時的參數限制等等。在變數變多的情況下更易讀也易改。嘗試做更多操作 >>> f"{3 + 8}" '11' >>> text = "Literal String Interpolation" >>> f"{text.upper()}" 'LITERAL STRING INTERPOLATION' >>> f"{1/3:.2f}" '0.33' 也可以放入lambda表達式。 ...

Python lambda

lambda function（匿名函式）基本語法 l a m b d a a r g 1 , a r g 2 , . . . : e x p r e s s i o n fun = lambda x: x + 1 print(fun(5)) 6 lambda function可以看做是一個簡單的function，有好幾個輸入，但是只能有一個運算式。適合的使用時機有幾個時機適合使用lambda function 無法重複使用：“don’t repeat yourself”，因此若知道這個功能簡單且不會在類似的地方重複使用，那這是個好時機。不想去想變數名稱：在實作功能時，會希望變數名稱就能知道這個東西可能會是甚麼，而不是只有x,y,i,j等等看不出意義或是會搞混的名稱；要注意情況，大多還是乖乖想名字吧。

Python context manager

內文管理器 python with 語句，能讓我們更輕易的實行資源管理，例如數據、開啟文件，或是各種會lock的行為。要保證處理完相關事情，資源有被釋放。簡單行為中，我們會這樣去開啟文件 test_file = open('test.txt', 'w') try: test_file.write('line one') finally: test_file.close() 上述行為除了是非慣用以外，若try-finally裡面邏輯複雜，還面臨著維護的困難。這裡有著使用 with 的簡單用法 with open('test.txt', 'w') as test_file: test_file.write('line one') 上述程式碼中，當 with 內的語句執行結束後，會自動關閉該資源，且變數test_file也會結束。實現context manager 若想實現 context manager的功能，則要定義好__enter__ 與 __exit__ 兩個函式，分別管理with的進入行為和結束行為。

Python Iterable

要了解python 哪些對象是可以迭代的，可以先了解兩個相似的名詞 Iterable Iterator Iterable 可以被迭代、遍歷(loop, iteration)的物件對象可以被稱為iterable，從官方文件得知，要實現__iter__或是__getitem__的方法即可。包含了常見的list、tuple、set、dict、str、range， >>> dir(str()) ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', ......] 但若是使用collection去檢查是否是iterable 只有實現__getitem__的對象可以被迭代但不會是iterable Iterator https://docs.python.org/3.7/c-api/iter.html 從官方文件看出，含有__iter__和__next__的對象可稱為iterator， iterator是iterable的子集合，上述提到的幾種方式是iterable但都不是iterator，可以使用上面用到的isinstance或是dir來確認， >>> from collections.abc import Iterable, Iterator >>> for i in ([1,2,3], "123", (1,2,3)): ... print(f"{i} is iterable: {isinstance(i, Iterable)}") ... print(f"{i} is iterator: {isinstance(i, Iterator)}") ... [1, 2, 3] is iterable: True [1, 2, 3] is iterator: False 123 is iterable: True 123 is iterator: False (1, 2, 3) is iterable: True (1, 2, 3) is iterator: False 而文件則是iterator >>> file_path = os.path.abspath("test.py") >>> with open(file_path) as ifile: ... isinstance(ifile, Iterator) ... True 結語了解了iterable和iterator，以後開發時，若想創造出可以被迭代的對象或是迭代器，則要知道必須要包含哪些基礎功能那麼Generator呢?

python pdb

pdb — The Python Debugger 一段簡單的程式碼 print(f'file = {__file__}') 常見pdb幾種使用方式 1. 直接使用 python -m pdb file.py 執行上面指令會讓整個檔案進入pdb模式操作 ( P / d h f b o i ) m l e e / s = r c / _ t f e i s l t e . _ p _ y ( 1 ) < m o d u l e > ( ) 2. 設斷點把上面程式碼改成 ...

OPENCV 人臉辨識

做初始的人臉檢測，主要是用於人臉辨識的前置處理，我們要利用Haar特徵處理在訓練的時候使用AdaBoost，也就是用弱分類來判別，每一步都拿出一個特徵值，判斷是否人臉，是的話在進入下一步，這樣一步一步循序漸進；廣義來看就像是讓所有的弱分類器投票，再根據準確率加成從而達到結果。其組成的分類器是一個Cascade的形式，長的像是簡單的決策樹。在實際使用發現Haar cascade的問題主要就是參數的正確性，尤其是scaleFactor和 MinNeighbors；第一個參數是控制比例變化，如果調大，檢測到的層樹就少，會導致發現的物件也變少；而第二個參數則是看檢測到鄰居的數量決定。由此可知，根據不同的圖片大小類型，都需要去調參數，這在使用上不可能辦到；未來來嘗試其他方式。原始文章，由此改的檢測人臉和眼睛，圖片是由網路下載的USA volleyball national team 合照 My Github