Python

Python lambda

lambda function（匿名函式）基本語法 l a m b d a a r g 1 , a r g 2 , . . . : e x p r e s s i o n fun = lambda x: x + 1 print(fun(5)) 6 lambda function可以看做是一個簡單的function，有好幾個輸入，但是只能有一個運算式。適合的使用時機有幾個時機適合使用lambda function 無法重複使用：“don’t repeat yourself”，因此若知道這個功能簡單且不會在類似的地方重複使用，那這是個好時機。不想去想變數名稱：在實作功能時，會希望變數名稱就能知道這個東西可能會是甚麼，而不是只有x,y,i,j等等看不出意義或是會搞混的名稱；要注意情況，大多還是乖乖想名字吧。

Python context manager

內文管理器 python with 語句，能讓我們更輕易的實行資源管理，例如數據、開啟文件，或是各種會lock的行為。要保證處理完相關事情，資源有被釋放。簡單行為中，我們會這樣去開啟文件 test_file = open('test.txt', 'w') try: test_file.write('line one') finally: test_file.close() 上述行為除了是非慣用以外，若try-finally裡面邏輯複雜，還面臨著維護的困難。這裡有著使用 with 的簡單用法 with open('test.txt', 'w') as test_file: test_file.write('line one') 上述程式碼中，當 with 內的語句執行結束後，會自動關閉該資源，且變數test_file也會結束。實現context manager 若想實現 context manager的功能，則要定義好__enter__ 與 __exit__ 兩個函式，分別管理with的進入行為和結束行為。

Python Iterable

要了解python 哪些對象是可以迭代的，可以先了解兩個相似的名詞 Iterable Iterator Iterable 可以被迭代、遍歷(loop, iteration)的物件對象可以被稱為iterable，從官方文件得知，要實現__iter__或是__getitem__的方法即可。包含了常見的list、tuple、set、dict、str、range， >>> dir(str()) ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', ......] 但若是使用collection去檢查是否是iterable 只有實現__getitem__的對象可以被迭代但不會是iterable Iterator https://docs.python.org/3.7/c-api/iter.html 從官方文件看出，含有__iter__和__next__的對象可稱為iterator， iterator是iterable的子集合，上述提到的幾種方式是iterable但都不是iterator，可以使用上面用到的isinstance或是dir來確認， >>> from collections.abc import Iterable, Iterator >>> for i in ([1,2,3], "123", (1,2,3)): ... print(f"{i} is iterable: {isinstance(i, Iterable)}") ... print(f"{i} is iterator: {isinstance(i, Iterator)}") ... [1, 2, 3] is iterable: True [1, 2, 3] is iterator: False 123 is iterable: True 123 is iterator: False (1, 2, 3) is iterable: True (1, 2, 3) is iterator: False 而文件則是iterator >>> file_path = os.path.abspath("test.py") >>> with open(file_path) as ifile: ... isinstance(ifile, Iterator) ... True 結語了解了iterable和iterator，以後開發時，若想創造出可以被迭代的對象或是迭代器，則要知道必須要包含哪些基礎功能那麼Generator呢?

python pdb

pdb — The Python Debugger 一段簡單的程式碼 print(f'file = {__file__}') 常見pdb幾種使用方式 1. 直接使用 python -m pdb file.py 執行上面指令會讓整個檔案進入pdb模式操作 ( P / d h f b o i ) m l e e / s = r c / _ t f e i s l t e . _ p _ y ( 1 ) < m o d u l e > ( ) 2. 設斷點把上面程式碼改成 ...

Keras IMDb

IMDb是一個電影相關的線上資料庫這次要利用IMDb的影評文字預測它是正面評價或是負面評價在深度學習模型中只能接受數字，Keras有提供Tokenizer模組會依照英文次數進行排序，在給每個單字編號:Keras Tokenizer 在利用Word embedding 將數字list 轉換成向量list，最後丟進去LSTM做學習 (在Keras 使用 RNN LSTM 模型很方便，一行解決) Keras也提供讓我們方便把英文轉成數字的模型這是model summary 把數字list轉換成64維的向量list，並且用三層的隱藏層來做訓練。準確率：0.8543 實際使用進入IMDb網站，抓取Spider-Man: Homecoming評論，檢驗是否正確。拿了正面評論結果也是顯示正面(1:正面，0:負面) My Github

Keras Cifar-10

這次來用Keras建立CNN，辨識Cifar-10影像資料 Cifar-10 是32*32 RBG的圖形，裡面包含了10種，像是飛機、狗、貓等等可以看成是MNIST的困難版因此在Preprocess的時候做的事情都是一樣的，並進行one hot encoding 其中在convolution選擇兩層，kernal 3*3 ，same padding maxpooling 是2*2的大小，在接上 NN從4096–1024–10(最後輸出) 可以注意一下Keras 和 Tensorflow一些參數表現的不同這是Cifar-10的圖像利用pandas建立confusion matrix，來看出是不是混淆了某些類別。可以看出第三類(cat)和第五類(dog)容易混淆，以及動物類和交通工具類不太容易混淆兩層CNN準確率：0.732 My Github

ML KNN

k-th nearest neighbor 又簡稱KNN，是_Supervised learning_ 的一種，看英文意思很簡明扼要，就是K個最相近的鄰居。因此這個演算法在實作時，會找到附近K個最近的點，來判斷自己要歸在哪一類。雖然說他是監督式學習的一種，但是他並不用去訓練參數，而是把資料都儲存起來做資料分類。我們可以藉由增加K的數值來增加此演算法的noise margin。此演算法會有著儲存空間大以及導致空間複雜度高的問題，還有著容易被數據不平衡所影響。在這個問題的實作上就是算點到點之間的距離，因此我們使用Scipy的函數來實作，為了方便取K值等於1，並且拿來和SKlearn的KNN比較。想法是用for迴圈來取test data對於每個train data的最近距離，那他就會是最近train data 的 label 準確率 sklearn knn : 0.973333333 手刻 knn : 0.9466666 My Github

Kaggle PM2.5 Prediction

嘗試用sklearn做分析使用豐原站的觀測記錄，分成train set跟test set，train set是豐原站每個月的前20天所有資料。test set則是從豐原站剩下的資料中取樣出來。 train.csv：每個月前20天的完整資料。 test_X.csv：從剩下的10天資料中取樣出連續的10小時為一筆，前九小時的所有觀測數據當作feature，第十小時的PM2.5當作answer。一共取出240筆不重複的test data，請根據feauure預測這240筆的PM2.5。 sklearn在使用上看起來很直接因此我們的feature使用最笨的方式：取出所有前九小時的值，甚麼都不做直接看結果。不觀察feature也不簡化在Private上排名在中間，略高於Baseline 因為是linear regression，對Gradient descent：算一次斜率，結束。直接就找到解 My Github

Kaggle Titanic

Kaggle The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. 從題目可以知道，這是一個 binary classification 最初想到SVM和perception 從題目給的數據，選擇Decision Tree 或 Random Forest可能是比較合理的想法不過這邊我想用 Logistic Regression 來試試(sigmoid + cross entropy) 把訓練資料的內容全部都變成0-1的數字，剩下的就交給NN去解決因為我們最後一層的active function是sigmoid 為了避免梯度消失，因次在做cross entropy時把最大最小值定為0.00001和0.99999 做每次的訓練時才不會有Nan的問題 ...

TENSORFLOW 練習4: word2vec

把字詞轉成word embedding 要在字詞中找到他們之間的某種關聯，而不是分散無意義的符號代表做這個問題的概念是假設兩個不同句子中的詞上下文相同，則代表兩個詞的語意相同今天要來使用skip-gram模型，一個類似二分法的方式(像或著不像) 一開始也同之前的問題，先做數據處理 [(most count word1, n1),(second word2, n2)] 計算出現數量文字轉成向量 The actual code for this tutorial is very short ([the, code], actual), ([actual, for], code), … skip-gram pairs (actual, the), (actual, code), (code, actual), … 在這之間都會給他編號，變成像是 (10,20),(10,30),(30,10),(30,40),.. 的形式用上nce loss 我還不熟，大概是我們讓目標的機率越高越好，其餘K個數的機率很低，negative samples king - queen = man - woman ==> king - queen + woman = man 給queen加上負號，並取不要的值，我想是這種感覺吧?? 結果會把相似的詞分的近些原版 tensorflow 有用上sklearn的TSNE 來做降維，在很多地方都比PCA好，讀了以後可以來試試 My Github ...