Problem on importing data #2

yuanzhouIR · 2018-03-05T12:51:52Z

Dear Dr. Watanabe,

I'm using your tutorial on Japanese text analysis, but I got a problem at the first step. I successfully imported your sample date "asahi.csv" using the code
data <- read.csv('data/asahi.csv', sep = "\t", stringsAsFactors = FALSE, encoding = 'UTF-8')
but when I use head(data), something wrong happened.

> date    edition section      page length
1 text592027 2016-01-01    朝刊 ３総合\t3   1288
2 text592028 2016-01-01    朝刊 ３総合\t3    595
3 text592029 2016-01-01    朝刊 １経済\t4   3214
4 text592030 2016-01-01    朝刊 １経済\t4    983
5 text592031 2016-01-01    朝刊 １外報\t7    375
6 text592032 2016-01-01    朝刊 １外報\t7    497
                                                                                            head                             hash year
1 解散時期、政権見極め　支持率<U+30FB>株価も考慮か　同日選視野\t8b94af77cf10b662e4728e89257d252b                             2016    1
2           大統領府が談話、世論沈静化図る　日韓合意受け２度目\t2c974c3cdb7a2e995fda5316d1bf6961                             2016    1
3           （新発想で挑む　地方の現場から：１）常識を打ち破ろう　水田にトウモロコシ、農の救世主 845f04b7f5bf7b8a3641a959bcbb93ba 2016
4             「中国はいずれＴＰＰに参加」　「『Ｇゼロ』後の世界」著者、イアン<U+30FB>ブレマー氏 ffbb56d8ce52e4faa11e2b4cc2eec56f 2016
5                                                                 国産空母の建造、中国政府認める e3b5c2a6947176d8c9645db599ab8daa 2016
6                                             韓国、大学生３０人を検挙　慰安婦問題合意で抗議行動 5412a90ca34a10016410bea6a15c0d41 2016
  month
1    NA
2    NA
3     1
4     1
5     1
6     1

Some data are not located in the right place. Can you help me fix this problem?
Here are my environment information

> R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936    LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C                               LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3    yaml_2.1.17

The text was updated successfully, but these errors were encountered:

koheiw · 2018-03-05T14:37:41Z

Hi @SyuuGenn , there are two possibilities:

R failed to import the CSV file
R console cannot print Japanese characters properly

In order to check what is really happening, try View(data) in R studio.

Sorry to say, but doing text analysis on Windows is really hard because of its poor support of Unicode. I use Windows machine, but do text analysis on Linux in Virtual Box. This is what you should do if you are seriously want to analyze non-Chinese texts.

koheiw · 2018-03-13T11:35:29Z

@SyuuGenn I forget to mention that the problem might not be about importing but dispalying a dfm on Windows (it is a known bug). On my non-Japanese Windows,

data <- read.csv('C:/Users/Kohei/Desktop/asahi.csv', sep = "\t", 
                 stringsAsFactors = FALSE, encoding = 'UTF-8')
head(data, 2)
#                 date          edition           section page length
#text592027 2016-01-01 <U+671D><U+520A> 3<U+7DCF><U+5408>    3   1288
# text592028 2016-01-01 <U+671D><U+520A> 3<U+7DCF><U+5408>    3    595
                                                                                                                                            #                                                              head
# text592027 <U+89E3><U+6563><U+6642><U+671F><U+3001><U+653F><U+6A29><U+898B><U+6975><U+3081> <U+652F><U+6301><U+7387>·<U+682A><U+4FA1><U+3082><U+8003><U+616E><U+304B> <U+540C><U+65E5><U+9078><U+8996><U+91CE>
# text592028          <U+5927><U+7D71><U+9818><U+5E9C><U+304C><U+8AC7><U+8A71><U+3001><U+4E16><U+8AD6><U+6C88><U+9759><U+5316><U+56F3><U+308B> <U+65E5><U+97D3><U+5408><U+610F><U+53D7><U+3051>2<U+5EA6><U+76EE>
#                                       hash year month
# text592027 8b94af77cf10b662e4728e89257d252b 2016     1
# text592028 2c974c3cdb7a2e995fda5316d1bf6961 2016     1

as.matrix(head(data, 2))

#            date         edition section  page length head                                                    
# text592027 "2016-01-01" "朝刊"  "３総合" "3"  "1288" "解散時期、政権見極め　支持率・株価も考慮か　同日選視野"
# text592028 "2016-01-01" "朝刊"  "３総合" "3"  " 595" "大統領府が談話、世論沈静化図る　日韓合意受け２度目"    
#            hash                               year   month
# text592027 "8b94af77cf10b662e4728e89257d252b" "2016" "1"  
# text592028 "2c974c3cdb7a2e995fda5316d1bf6961" "2016" "1"

In short, as.matrix() could be a workaround.

yuanzhouIR · 2018-03-13T12:15:05Z

@koheiw Thank you very much! Short after I posted this issue, I bought a MacBook. This problem does not exist in MacOS.
I just got some questions on collecting date from newspaper database and hope you can help me. I searched "一帯一路" in Asahi Shimbun database "聞蔵Ⅱビジュアル" and found 235 results. I want to save them like your sample data, but I don't know how to do that. Of course I can copy them one by one, but I don't think it's an effective way. Your sample data includes all the news on Asahi Shimbun in 2016, and I think you may have some methods to scrap the data automatically. I'm wondering if you can tell me how you collected the data.
Best wishes!

koheiw · 2018-03-13T14:17:55Z

Nice, but can you check if as.matrix() solves on your Windows if you still have it?

yuanzhouIR · 2018-03-13T15:22:09Z

I tried the as.matrix() code in my Windows, but it made no difference. I think it's not the matrix problem but the decoding problem. My Windows may not recognize the tab places correctly because there appeared some \t where should be separated. I really appreciate for your perseverance to solve problems. Thanks again.

koheiw · 2018-03-13T15:26:53Z

Thanks. It seems like importing problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem on importing data #2

Problem on importing data #2

yuanzhouIR commented Mar 5, 2018 •

edited by koheiw

Loading

koheiw commented Mar 5, 2018

koheiw commented Mar 13, 2018

yuanzhouIR commented Mar 13, 2018

koheiw commented Mar 13, 2018

yuanzhouIR commented Mar 13, 2018

koheiw commented Mar 13, 2018

Problem on importing data #2

Problem on importing data #2

Comments

yuanzhouIR commented Mar 5, 2018 • edited by koheiw Loading

koheiw commented Mar 5, 2018

koheiw commented Mar 13, 2018

yuanzhouIR commented Mar 13, 2018

koheiw commented Mar 13, 2018

yuanzhouIR commented Mar 13, 2018

koheiw commented Mar 13, 2018

yuanzhouIR commented Mar 5, 2018 •

edited by koheiw

Loading