Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem on importing data #2

Open
yuanzhouIR opened this issue Mar 5, 2018 · 6 comments
Open

Problem on importing data #2

yuanzhouIR opened this issue Mar 5, 2018 · 6 comments

Comments

@yuanzhouIR
Copy link

yuanzhouIR commented Mar 5, 2018

Dear Dr. Watanabe,

I'm using your tutorial on Japanese text analysis, but I got a problem at the first step. I successfully imported your sample date "asahi.csv" using the code
data <- read.csv('data/asahi.csv', sep = "\t", stringsAsFactors = FALSE, encoding = 'UTF-8')
but when I use head(data), something wrong happened.

> date    edition section      page length
1 text592027 2016-01-01    朝刊 3総合\t3   1288
2 text592028 2016-01-01    朝刊 3総合\t3    595
3 text592029 2016-01-01    朝刊 1経済\t4   3214
4 text592030 2016-01-01    朝刊 1経済\t4    983
5 text592031 2016-01-01    朝刊 1外報\t7    375
6 text592032 2016-01-01    朝刊 1外報\t7    497
                                                                                            head                             hash year
1 解散時期、政権見極め 支持率<U+30FB>株価も考慮か 同日選視野\t8b94af77cf10b662e4728e89257d252b                             2016    1
2           大統領府が談話、世論沈静化図る 日韓合意受け2度目\t2c974c3cdb7a2e995fda5316d1bf6961                             2016    1
3           (新発想で挑む 地方の現場から:1)常識を打ち破ろう 水田にトウモロコシ、農の救世主 845f04b7f5bf7b8a3641a959bcbb93ba 2016
4             「中国はいずれTPPに参加」 「『Gゼロ』後の世界」著者、イアン<U+30FB>ブレマー氏 ffbb56d8ce52e4faa11e2b4cc2eec56f 2016
5                                                                 国産空母の建造、中国政府認める e3b5c2a6947176d8c9645db599ab8daa 2016
6                                             韓国、大学生30人を検挙 慰安婦問題合意で抗議行動 5412a90ca34a10016410bea6a15c0d41 2016
  month
1    NA
2    NA
3     1
4     1
5     1
6     1

Some data are not located in the right place. Can you help me fix this problem?
Here are my environment information

> R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936    LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C                               LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3    yaml_2.1.17  
@koheiw
Copy link
Owner

koheiw commented Mar 5, 2018

Hi @SyuuGenn , there are two possibilities:

  1. R failed to import the CSV file
  2. R console cannot print Japanese characters properly

In order to check what is really happening, try View(data) in R studio.

Sorry to say, but doing text analysis on Windows is really hard because of its poor support of Unicode. I use Windows machine, but do text analysis on Linux in Virtual Box. This is what you should do if you are seriously want to analyze non-Chinese texts.

@koheiw
Copy link
Owner

koheiw commented Mar 13, 2018

@SyuuGenn I forget to mention that the problem might not be about importing but dispalying a dfm on Windows (it is a known bug). On my non-Japanese Windows,

data <- read.csv('C:/Users/Kohei/Desktop/asahi.csv', sep = "\t", 
                 stringsAsFactors = FALSE, encoding = 'UTF-8')
head(data, 2)
#                 date          edition           section page length
#text592027 2016-01-01 <U+671D><U+520A> 3<U+7DCF><U+5408>    3   1288
# text592028 2016-01-01 <U+671D><U+520A> 3<U+7DCF><U+5408>    3    595
                                                                                                                                            #                                                              head
# text592027 <U+89E3><U+6563><U+6642><U+671F><U+3001><U+653F><U+6A29><U+898B><U+6975><U+3081> <U+652F><U+6301><U+7387>·<U+682A><U+4FA1><U+3082><U+8003><U+616E><U+304B> <U+540C><U+65E5><U+9078><U+8996><U+91CE>
# text592028          <U+5927><U+7D71><U+9818><U+5E9C><U+304C><U+8AC7><U+8A71><U+3001><U+4E16><U+8AD6><U+6C88><U+9759><U+5316><U+56F3><U+308B> <U+65E5><U+97D3><U+5408><U+610F><U+53D7><U+3051>2<U+5EA6><U+76EE>
#                                       hash year month
# text592027 8b94af77cf10b662e4728e89257d252b 2016     1
# text592028 2c974c3cdb7a2e995fda5316d1bf6961 2016     1

as.matrix(head(data, 2))

#            date         edition section  page length head                                                    
# text592027 "2016-01-01" "朝刊"  "3総合" "3"  "1288" "解散時期、政権見極め 支持率・株価も考慮か 同日選視野"
# text592028 "2016-01-01" "朝刊"  "3総合" "3"  " 595" "大統領府が談話、世論沈静化図る 日韓合意受け2度目"    
#            hash                               year   month
# text592027 "8b94af77cf10b662e4728e89257d252b" "2016" "1"  
# text592028 "2c974c3cdb7a2e995fda5316d1bf6961" "2016" "1" 

In short, as.matrix() could be a workaround.

@yuanzhouIR
Copy link
Author

@koheiw Thank you very much! Short after I posted this issue, I bought a MacBook. This problem does not exist in MacOS.
I just got some questions on collecting date from newspaper database and hope you can help me. I searched "一帯一路" in Asahi Shimbun database "聞蔵Ⅱビジュアル" and found 235 results. I want to save them like your sample data, but I don't know how to do that. Of course I can copy them one by one, but I don't think it's an effective way. Your sample data includes all the news on Asahi Shimbun in 2016, and I think you may have some methods to scrap the data automatically. I'm wondering if you can tell me how you collected the data.
Best wishes!

@koheiw
Copy link
Owner

koheiw commented Mar 13, 2018

Nice, but can you check if as.matrix() solves on your Windows if you still have it?

@yuanzhouIR
Copy link
Author

I tried the as.matrix() code in my Windows, but it made no difference. I think it's not the matrix problem but the decoding problem. My Windows may not recognize the tab places correctly because there appeared some \t where should be separated. I really appreciate for your perseverance to solve problems. Thanks again.

@koheiw
Copy link
Owner

koheiw commented Mar 13, 2018

Thanks. It seems like importing problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants