Effective Python

Python Chunks

當我們要把list分成好幾個chunk時的幾種做法 yield def chunks1(input_list, n): for i in range(0, len(input_list), n): yield input_list[i:i + n] input_list = [i for i in range(0, 15)] print(list(chunks1(input_list, 4))) ## [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14]] 一行for迴圈 input_list = [i for i in range(0, 15)] n = 3 output_list = [input_list[i:i+ n] for i in range(0, len(input_list), n)] print(output_list) ## [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14]] iterable 針對任何iterable from itertools import islice def chunks2(input_iter, n): input_list = iter(input_iter) return iter(lambda: tuple(islice(input_list, n)), ()) input_list = [i for i in range(0, 15)] n = 4 print(list(chunks2(input_list, n))) ## [(0, 1, 2, 3), (4, 5, 6, 7), (8, 9, 10, 11), (12, 13, 14)] Numpy import numpy as np input_list = [i for i in range(0, 15)] np.array_split(input_list, 5) ## [array([0, 1, 2]), ## array([3, 4, 5]), ## array([6, 7, 8]), ## array([ 9, 10, 11]), ## array([12, 13, 14])] 上述幾種簡單的方式皆可達成 ...

Before Data processing: ELT

Before ELT : ETL ETL stands for Extract, Transform, and Load. Historically, ETL has been the best and most reliable way to migrate data from one database to another. In addition to move data from one database to another, it also converts databases into a single format that can be utilized in the final point. Extract: Collecting data from different database. Sometimes using a staging table. Transform: It’s critical. Converting recently extracted data into the correct form so that it can be placed into another database. Sometimes there are other types of transformation involved in this step. Load: Load data into the target database or storage. ...

Python f-string

python 3.6後，字串多了個處理方法 PEP 498 – Literal String Interpolation 下面直接用例子來比較f-string和我們之前常用的 %-formatting、str.format()語法不同之處 >>> # %-formatting ... >>> text = "Hello" >>> number1 = 10 >>> number2 = 20 >>> print("%s, test numbers are %s and %s" % (text, number1, number2)) Hello, test numbers are 10 and 20 >>> # str.format() ... >>> text = "Hello" >>> number1 = 10 >>> number2 = 20 >>> print("{}, test numbers are {} and {}".format(text, number1, number2)) Hello, test numbers are 10 and 20 >>> print("{0}, test numbers are {2} and {1}".format(text, number1, number2)) Hello, test numbers are 20 and 10 >>> # f-string ... >>> text = "Hello" >>> number1 = 10 >>> number2 = 20 >>> print(f"{text}, test numbers are {number1} and {number2}") Hello, test numbers are 10 and 20 F-string 看起來更python了，也解決了之前會遇到的問題；例如使用 %時的參數限制等等。在變數變多的情況下更易讀也易改。嘗試做更多操作 >>> f"{3 + 8}" '11' >>> text = "Literal String Interpolation" >>> f"{text.upper()}" 'LITERAL STRING INTERPOLATION' >>> f"{1/3:.2f}" '0.33' 也可以放入lambda表達式。 ...

Python lambda

lambda function（匿名函式）基本語法 l a m b d a a r g 1 , a r g 2 , . . . : e x p r e s s i o n fun = lambda x: x + 1 print(fun(5)) 6 lambda function可以看做是一個簡單的function，有好幾個輸入，但是只能有一個運算式。適合的使用時機有幾個時機適合使用lambda function 無法重複使用：“don’t repeat yourself”，因此若知道這個功能簡單且不會在類似的地方重複使用，那這是個好時機。不想去想變數名稱：在實作功能時，會希望變數名稱就能知道這個東西可能會是甚麼，而不是只有x,y,i,j等等看不出意義或是會搞混的名稱；要注意情況，大多還是乖乖想名字吧。

Python context manager

內文管理器 python with 語句，能讓我們更輕易的實行資源管理，例如數據、開啟文件，或是各種會lock的行為。要保證處理完相關事情，資源有被釋放。簡單行為中，我們會這樣去開啟文件 test_file = open('test.txt', 'w') try: test_file.write('line one') finally: test_file.close() 上述行為除了是非慣用以外，若try-finally裡面邏輯複雜，還面臨著維護的困難。這裡有著使用 with 的簡單用法 with open('test.txt', 'w') as test_file: test_file.write('line one') 上述程式碼中，當 with 內的語句執行結束後，會自動關閉該資源，且變數test_file也會結束。實現context manager 若想實現 context manager的功能，則要定義好__enter__ 與 __exit__ 兩個函式，分別管理with的進入行為和結束行為。

Python Iterable

要了解python 哪些對象是可以迭代的，可以先了解兩個相似的名詞 Iterable Iterator Iterable 可以被迭代、遍歷(loop, iteration)的物件對象可以被稱為iterable，從官方文件得知，要實現__iter__或是__getitem__的方法即可。包含了常見的list、tuple、set、dict、str、range， >>> dir(str()) ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', ......] 但若是使用collection去檢查是否是iterable 只有實現__getitem__的對象可以被迭代但不會是iterable Iterator https://docs.python.org/3.7/c-api/iter.html 從官方文件看出，含有__iter__和__next__的對象可稱為iterator， iterator是iterable的子集合，上述提到的幾種方式是iterable但都不是iterator，可以使用上面用到的isinstance或是dir來確認， >>> from collections.abc import Iterable, Iterator >>> for i in ([1,2,3], "123", (1,2,3)): ... print(f"{i} is iterable: {isinstance(i, Iterable)}") ... print(f"{i} is iterator: {isinstance(i, Iterator)}") ... [1, 2, 3] is iterable: True [1, 2, 3] is iterator: False 123 is iterable: True 123 is iterator: False (1, 2, 3) is iterable: True (1, 2, 3) is iterator: False 而文件則是iterator >>> file_path = os.path.abspath("test.py") >>> with open(file_path) as ifile: ... isinstance(ifile, Iterator) ... True 結語了解了iterable和iterator，以後開發時，若想創造出可以被迭代的對象或是迭代器，則要知道必須要包含哪些基礎功能那麼Generator呢?

python pdb

pdb — The Python Debugger 一段簡單的程式碼 print(f'file = {__file__}') 常見pdb幾種使用方式 1. 直接使用 python -m pdb file.py 執行上面指令會讓整個檔案進入pdb模式操作 ( P / d h f b o i ) m l e e / s = r c / _ t f e i s l t e . _ p _ y ( 1 ) < m o d u l e > ( ) 2. 設斷點把上面程式碼改成 ...

OPENCV 人臉辨識

做初始的人臉檢測，主要是用於人臉辨識的前置處理，我們要利用Haar特徵處理在訓練的時候使用AdaBoost，也就是用弱分類來判別，每一步都拿出一個特徵值，判斷是否人臉，是的話在進入下一步，這樣一步一步循序漸進；廣義來看就像是讓所有的弱分類器投票，再根據準確率加成從而達到結果。其組成的分類器是一個Cascade的形式，長的像是簡單的決策樹。在實際使用發現Haar cascade的問題主要就是參數的正確性，尤其是scaleFactor和 MinNeighbors；第一個參數是控制比例變化，如果調大，檢測到的層樹就少，會導致發現的物件也變少；而第二個參數則是看檢測到鄰居的數量決定。由此可知，根據不同的圖片大小類型，都需要去調參數，這在使用上不可能辦到；未來來嘗試其他方式。原始文章，由此改的檢測人臉和眼睛，圖片是由網路下載的USA volleyball national team 合照 My Github

ML KNN

k-th nearest neighbor 又簡稱KNN，是_Supervised learning_ 的一種，看英文意思很簡明扼要，就是K個最相近的鄰居。因此這個演算法在實作時，會找到附近K個最近的點，來判斷自己要歸在哪一類。雖然說他是監督式學習的一種，但是他並不用去訓練參數，而是把資料都儲存起來做資料分類。我們可以藉由增加K的數值來增加此演算法的noise margin。此演算法會有著儲存空間大以及導致空間複雜度高的問題，還有著容易被數據不平衡所影響。在這個問題的實作上就是算點到點之間的距離，因此我們使用Scipy的函數來實作，為了方便取K值等於1，並且拿來和SKlearn的KNN比較。想法是用for迴圈來取test data對於每個train data的最近距離，那他就會是最近train data 的 label 準確率 sklearn knn : 0.973333333 手刻 knn : 0.9466666 My Github

PYTHON 機器學習基石 LS-PLA

Perceptron Learning Algorithm (PLA) 根據林軒田教授的機器學習基石課程，實作一下這個基礎的機器學習演算法討論這個問題，用到的學習大架構(Supervised learning)，它也是所謂的YES/NO問題。 Perceptron⇔ linear (Binary) Classifiers 我們有一組訓練資料D，裡面包含著數據Xn和對應的Yn(在這裡就是1，-1)；Hypothesis set H代表這全部的解(無限多條線)，經過演算法A，從H找到一個可能的g和我們的目標函數f相近。這個演算法的主要兩大步驟；找到錯誤的點，向量修正它，詳細課程可以看教授的仔細講解！！naive cycle是常用的作法這方法只適用於 linear separable PLA 除了上面這些以外，當資料中有雜訊也無法使用這個方式，目前在線性問題上較好的解是用**Pocket PLA** Linear separable PLA 首先整理一下資料把[‘x0\ty0\tz0\nx1\ty1\tz1\nx2\ty2\tz2\n….’] 變成arrar([[(x0, y0), z0],[(x1, y1), z1],[(x2, y2), z2]…..])的格式如同前幾張圖片(x1,y1),(x2,y2)的資料方式 NAIVE PLA ，畫線則是用 ax + by = 0 最終結果 Pocket PLA Pocket PLA 是一個貪婪演算法，把好的握在手上繼續往下算，每次都會比較看有沒有比手上的好，停止方式則是讓它跑到一定數量，或是多久沒有變更好等等的；這裡就不寫了。 My GitHub