Kaggle PM2.5 Prediction

嘗試用sklearn做分析 使用豐原站的觀測記錄,分成train set跟test set,train set是豐原站每個月的前20天所有資料。test set則是從豐原站剩下的資料中取樣出來。 train.csv:每個月前20天的完整資料。 test_X.csv:從剩下的10天資料中取樣出連續的10小時為一筆,前九小時的所有觀測數據當作feature,第十小時的PM2.5當作answer。一共取出240筆不重複的test data,請根據feauure預測這240筆的PM2.5。 sklearn在使用上看起來很直接 因此我們的feature使用最笨的方式:取出所有前九小時的值,甚麼都不做直接看結果。 不觀察feature也不簡化 在Private上排名在中間,略高於Baseline 因為是linear regression,對Gradient descent:算一次斜率,結束。 直接就找到解 My Github

2017-06-13 · 1 min · 19 words · KbWen

Kaggle Titanic

Kaggle The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. 從題目可以知道,這是一個 binary classification 最初想到SVM和perception 從題目給的數據,選擇Decision Tree 或 Random Forest可能是比較合理的想法 不過這邊我想用 Logistic Regression 來試試(sigmoid + cross entropy) 把訓練資料的內容全部都變成0-1的數字,剩下的就交給NN去解決 因為我們最後一層的active function是sigmoid 為了避免梯度消失,因次在做cross entropy時把最大最小值定為0.00001和0.99999 做每次的訓練時才不會有Nan的問題 ...

2017-06-09 · 1 min · 191 words · KbWen

Kaggle Digit Recognizer

進入 Kaggle的第一個試題 Kaggle digit recognizer 是一個用CSV儲存的 MNIST 問題 因次用CNN來解決這次的問題 Visually, if we omit the “pixel” prefix, the pixels make up the image like this: 000 001 002 003 … 026 027 028 029 030 031 … 054 055 056 057 058 059 … 082 083 | | | | … | | 728 729 730 731 … 754 755 756 757 758 759 … 782 783 The test data set, (test.csv), is the same as the training set, except that it does not contain the “label” column. Your submission file should be in the following format: For each of the 28000 images in the test set, output a single line containing the ImageId and the digit you predict. For example, if you predict that the first image is of a 3, the second image is of a 7, and the third image is of a 8, then your submission file would look like: ...

2017-06-05 · 1 min · 211 words · KbWen