자료 불러오기, 저장, 파일 형식¶

텍스트 형식으로 읽고 쓰기¶

판다스는 표형식의 자료를 데이터프레임 형식으로 불러올 수있는 다양한 함수들을 갖추고 있다. 다음은 그 중의 일부를 표로 나타낸 것이다.

Function	Description
`read_csv`	Load delimited data from a file, URL, or file-like object; use comma as default delimiter
`read_table`	Load delimited data from a file, URL, or file-like object; use tab (‘t’) as default delimiter
`read_fwf`	Read data in fixed-width column format (i.e., no delimiters)
`read_clipboard`	Version of read_table that reads data from the clipboard; useful for converting tables from web pages
`read_excel`	Read tabular data from an Excel XLS or XLSX file
`read_hdf`	Read HDF5 files written by pandas
`read_html`	Read all tables found in the given HTML document
`read_json`	Read data from a JSON (JavaScript Object Notation) string representation
`read_msgpack`	Read pandas data encoded using the MessagePack binary format
`read_pickle`	Read an arbitrary object stored in Python pickle format
`read_sas`	Read a SAS dataset stored in one of the SAS system’s custom storage formats
`read_sql`	Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
`read_stata`	Read a dataset from Stata file format
`read_feather`	Read the Feather binary file format

ex1.csv 파일은 다음과 같다.

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

쉼표 구분자로 구성된 파일이므로 read_csv 함수를 이용하여 데이터프레임으로 불러올 수 있다.

In [28]:

import pandas as pd
import numpy as np

root_url = 'http://compmath.korea.ac.kr/appmath/data/'

In [3]:

df = pd.read_csv(root_url+'ex1.csv')
df

Out[3]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

또한 read_table 함수와 구분자를 지정함으로 불러올 수 있다.

In [4]:

pd.read_table(root_url + 'ex1.csv', sep=',')

Out[4]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

모든 파일이 열이름을 갖는 것은 아니다. 다음과 같은 파일 ex2.csv을 보자.

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

이 파일을 읽기 위한 몇 가지 선택들을 취할 수 있다. 판다스가 기본 열이름을 설정할 수 있도록 하거나 직접 열이름을 설정할 수 있다.

In [6]:

pd.read_csv(root_url + 'ex2.csv', header=None)

Out[6]:

	0	1	2	3	4
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

In [7]:

pd.read_csv(root_url + 'ex2.csv',  names=['a', 'b', 'c', 'd', 'message'])

Out[7]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

데이터프레임의 열이름 중에서 행 이름을 설정하기 위해서는 index_col= 인자를 사용한다.

In [8]:

names=['a', 'b', 'c', 'd', 'message']

pd.read_csv(root_url + 'ex2.csv', names=names, index_col='message')

Out[8]:

	a	b	c	d
message
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

인덱스 이름을 수준별로 설정하고 싶으면 열 번호 또는 이름에 대한 리스트를 인자로 넘기면 된다. 다음과 같은 파일 csv_mindex.csv에 대해서 살펴보자.

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

In [10]:

parsed = pd.read_csv(root_url + 'csv_mindex.csv', index_col=['key1', 'key2'])
parsed

Out[10]:

		value1	value2
key1	key2
one	a	1	2
	b	3	4
	c	5	6
	d	7	8
two	a	9	10
	b	11	12
	c	13	14
	d	15	16

어떤 경우에는 구분자로 항목을 구분하지 않고 탭이나 공백 문자로 항목이 구분된 경우도 있다. 다음 파일 ex3.txt 예를 보자.

            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491

보는 바와 같이 공백문자들로 항목이 구분되어 있는 것을 알 수 있다. 이 경우는 read_table 함수의 구분자 인자 sep=\s+에 정규 표현식을 사용할 수 있다. \s+는 공백문자 하나 이상의 문자열을 의미한다.

In [3]:

pd.read_table(root_url + 'ex3.txt', sep='\s+')

Out[3]:

	A	B	C
aaa	-0.264438	-1.026059	-0.619500
bbb	0.927272	0.302904	-0.032399
ccc	-0.264273	-0.386314	-0.217601
ddd	-0.871858	-0.348382	1.100491

read_table 함수는 표의 첫번째 행 항목의 갯수가 나머지 행의 항목의 갯수보다 한 개 적으므로 첫번째 행과 첫번째 열을 자동으로 인덱스로 간주했다. 다양한 예외적인 파일 형태에 대해서 대처를 할 수 있도록 많은 선택 인자들을 함수들이 갖추고 있다. 다음 표는 read_csv와 read_table 함수의 자주 사용되는 인자들을 나열한 것이다.

Argument	Description
path	String indicating filesystem location, URL, or file-like object
sep or delimiter	Character sequence or regular expression to use to split fields in each row
header	Row number to use as column names; defaults to 0 (first row), but should be None if there is no header row
index_col	Column numbers or names to use as the row index in the result; can be a single name/number or a list of them for a hierarchical index
names	List of column names for result, combine with header=None
skiprows	Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip.
na_values	Sequence of values to replace with NA.
comment	Character(s) to split comments off the end of lines.
parse_date s	Attempt to parse data to datetime; False by default. If True, will attempt to parse all columns. Otherwise can specify a list of column numbers or name to parse. If element of list is tuple or list, will combine multiple columns together and parse to date (e.g., if date/time split across two columns).
keep_date_col	If joining columns to parse date, keep the joined columns; False by default.
converters	Dict containing column number of name mapping to functions (e.g., {‘foo’: f} would apply the function f to all values in the ‘foo’ column).
dayfirst	When parsing potentially ambiguous dates, treat as international format (e.g., 7/6/2012 -> June 7, 2012); False by default.
date_parse r	Function to use to parse dates.
nrows	Number of rows to read from beginning of file.
iterator	Return a TextParser object for reading file piecemeal.
chunksize	For iteration, size of file chunks.
skip_foote r	Number of lines to ignore at end of file.
verbose	Print various parser output information, like the number of missing values placed in non-numeric columns.
encoding	Text encoding for Unicode (e.g., ‘utf-8’ for UTF-8 encoded text).
squeeze	If the parsed data only contains one column, return a Series.
thousands	Separator for thousands (e.g., ‘,’ or ‘.’).

다음 파일 ex4.csv는 첫번째, 세번째, 네번째 행은 주석을 나타내는 구문이다.

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

skiprows= 인자를 이용하면 특정한 행들을 읽어들이지 않게 할 수 있다.

In [5]:

pd.read_csv(root_url + 'ex4.csv', skiprows=[0, 2, 3])

Out[5]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

파일을 읽는 중에 소실값(missing value)을 처리하는 것은 중요하면서도 성가신 부분 중의 하나다. 판다스는 소실값을 표시하는 문자열로 빈문자(empty string), NA 또는 NULL 등을 사용한다. 다음 파일 ex5.csv을 살펴보자.

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

In [7]:

res = pd.read_csv(root_url + 'ex5.csv')
res

Out[7]:

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

In [8]:

pd.isnull(res)

Out[8]:

	something	a	b	c	d	message
0	False	False	False	False	False	True
1	False	False	False	True	False	False
2	False	False	False	False	False	False

선택 인자 na_values를 이용해서 추가 소실 문자열을 직접 지정할 수 있다.

In [11]:

pd.read_csv(root_url + 'ex5.csv', na_values=['one'])

Out[11]:

	something	a	b	c	d	message
0	NaN	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

각 열마다 다른 소실 문자열을 추가로 지정할 수 있다.

In [9]:

소실 = {'message': ['foo', 'NA'], 'something': ['two']}
pd.read_csv(root_url + 'ex5.csv', na_values=소실)

Out[9]:

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	NaN	5	6	NaN	8	world
2	three	9	10	11.0	12	NaN

부분적으로 텍스트 파일 읽기¶

대용량의 파일을 처리하거나, 적절한 인자들을 설정하기 위해서는 먼저 파일의 일부분 또는 얼마 단위로 끊어서 읽어야 할 필요가 있다. 대용량의 파일을 읽기 전에 판다스 처리 결과를 표시하는 설정을 변경하자.

In [12]:

pd.options.display.max_rows = 10

10,000 줄로 이루어진 파일 ex6.csv를 읽어보자.

In [13]:

res = pd.read_csv(root_url + 'ex6.csv')
res

Out[13]:

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q
...	...	...	...	...	...
9995	2.311896	-0.417070	-1.409599	-0.515821	L
9996	-0.479893	-0.650419	0.745152	-0.646038	E
9997	0.523331	0.787112	0.486066	1.093156	K
9998	-0.362559	0.598894	-1.843201	0.887292	G
9999	-0.096376	-1.012999	-0.657431	-0.573315	0

10000 rows × 5 columns

파일의 몇 행만 읽기를 원하면 nrows= 인자를 사용해서 행의 갯수를 지정한다.

In [14]:

pd.read_csv(root_url + 'ex6.csv', nrows=5)

Out[14]:

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q

파일을 일정한 행의 갯수 단위로 반복적으로 읽으려면 chunksize=를 이용하면 된다.

In [15]:

부분 = pd.read_csv(root_url + 'ex6.csv', chunksize=1000)
부분

Out[15]:

<pandas.io.parsers.TextFileReader at 0x1ed4f2c0780>

read_csv함수와 chunksize=1000 인자에 의해서 반환된 TextFileReader 객체는 파일의 일부분을 반복해서 읽어 올 수 있다. key열에 나오는 문자의 빈도수를 세기 위해서 다음과 같이 파일을 부분적으로 읽어서 계산할 수 있다.

In [17]:

부분 = pd.read_csv(root_url + 'ex6.csv', chunksize=1000)

총 = pd.Series([])
for 데 in 부분:
    총 = 총.add(데['key'].value_counts(), fill_value=0)

총 = 총.sort_values(ascending=False)
총

Out[17]:

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
     ...
5    157.0
2    152.0
0    151.0
9    150.0
1    146.0
Length: 36, dtype: float64

TextFileReader 객체의 get_chunk 메소드를 이용해서 원하는 행의 갯수만큼을 읽어 올 수 있다.

텍스트 형식으로 파일에 쓰기¶

자료 객체는 구분자를 가진 파일로 쓸 수 있다. 우선 다음 파일을 예로 들자.

In [18]:

데 = pd.read_csv(root_url + 'ex5.csv')
데

Out[18]:

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

데이터프레임 객체는 to_csv 메소드를 이용해 쉼표 분리형식으로 파일에 저장할 수 있다.

In [21]:

데.to_csv('data/out.csv')
!type data\out.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

구분자를 다른 걸로 사용해도 된다. sys.stdout은 파일이 아닌 콘솔에 출력하라는 의미이다.

In [23]:

import sys

데.to_csv(sys.stdout, sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

소실값을 다른 문자열로 바꾸어 저장할 수 있다.

In [24]:

데.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo

기본값으로 행 이름과 열이름이 저장되게 되어 있다. 저장하고 싶지 않으면 index=False(행이름), header=False(열이름)을 지정하면 된다.

In [25]:

데.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

또한 원하는 열들만 골라서 원하는 순서로 저장할 수도 있다.

In [26]:

데.to_csv(sys.stdout, index=False, columns=['c', 'b', 'a'])

c,b,a
3.0,2,1
,6,5
11.0,10,9

시리즈 객체도 to_csv 메소드를 이용해서 파일에 저장할 수 있다.

In [29]:

날 = pd.date_range('2018-01-01', periods=7)

시 = pd.Series(np.arange(7), index=날)
시.to_csv('data/ts.csv')
!type data\ts.csv

2018-01-01,0
2018-01-02,1
2018-01-03,2
2018-01-04,3
2018-01-05,4
2018-01-06,5
2018-01-07,6

Working with Delimited Formats¶

JSON Data¶

JSON(JavaScript Object Notation)은 웹브라우저와 응용프로그램간의 자료를 HTTP 규약에 의해 보내는 표준 형식이 되었다. csv 형식보다 더 자유롭게 자료를 구성할 수 있다. 예를 들어 보자.

In [30]:

obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

JSON 기본 자료형은 object(dict), array(list), string, number, boolean, null이 있다. object는 중괄호로 시작해서 중괄로 끝난다. object의 모든 키는 문자열(string)이어야 한다. array는 대괄호로 시작해서 대괄호로 끝난다. json을 읽고 쓸 수 있는 모듈들은 여러 가지가 있지만 여기서는 파이썬 표준 라이브러리인 json 모듈을 사용한다. JSON 문자열을 파이썬 객체로 변경하려면 json.loads 또는 json.load 메소드를 이용한다.

In [31]:

import json

res = json.loads(obj)
res

Out[31]:

{'name': 'Wes',
 'pet': None,
 'places_lived': ['United States', 'Spain', 'Germany'],
 'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']},
  {'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisco']}]}

반대로 파이썬 객체를 JSON 문자열로 변환하려면 json.dumps 또는 json.dump를 이용한다.

In [34]:

asjson = json.dumps(res)
asjson

Out[34]:

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

JSON 객체 또는 객체들의 리스트를 어떻게 데이터프레임 또는 다른 자료형으로 바꾸는가는 사용자에게 달려있다. 사전형 리스트를 편리하게 데이터프레임으로 넘겨줄 수 있고 원하는 열들만 사용할 수 있다.

In [35]:

자녀 = pd.DataFrame(res['siblings'], columns=['age', 'name'])
자녀

Out[35]:

	age	name
0	30	Scott
1	38	Katie

판다스의 read_json 메소드는 JSON 자료를 시리즈 또는 데이터프레임으로 자동으로 변경해준다. 다음 파일을 살펴보자

In [37]:

!type data\example.json

[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]

read_json 메소드는 기본적으로 JSON 배열의 각 객체들을 표에서 하나의 행으로 간주한다.

In [38]:

data = pd.read_json('data/example.json')
data

Out[38]:

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

판다스 객체를 JSON 자료로 변환하려면 to_json 메소드를 사용한다.

In [39]:

data.to_json()

Out[39]:

'{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}'

In [40]:

data.to_json(orient='records')

Out[40]:

'[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]'

XML 및 HTML: 웹문서 긁어오기¶

파이썬은 XML 및 HTML 형식의 문서를 읽어 올 수 있는 많은 라이브러리들이 있다. 예를 들면 lxml, beautiful soup, html5lib 등이 있다. lxml은 일반적으로 다른 라이브러리에 비해 훨씬 빠르지만, 다른 라이브러리들에 비해 파싱하기가 불편한다.

판다스는 HTML 파일로부터 표를 데이터프레임으로 변경하는 read_html 함수가 있다. read_html의 사용법을 알아보기 위해서 United States FDIC government 로부터 부실은행(bank failure)을 나타내는 HTML 파일을 내려받아서 fdic_failed_bank_list.html파일로 저장했다.

In [41]:

tables = pd.read_html('data/fdic_failed_bank_list.html')
len(tables)

Out[41]:

In [42]:

failures = tables[0]
failures.head()

Out[42]:

	Bank Name	City	ST	CERT	Acquiring Institution	Closing Date	Updated Date
0	Allied Bank	Mulberry	AR	91	Today's Bank	September 23, 2016	November 17, 2016
1	The Woodbury Banking Company	Woodbury	GA	11297	United Bank	August 19, 2016	November 17, 2016
2	First CornerStone Bank	King of Prussia	PA	35312	First-Citizens Bank & Trust Company	May 6, 2016	September 6, 2016
3	Trust Company Bank	Memphis	TN	9956	The Bank of Fayette County	April 29, 2016	September 6, 2016
4	North Milwaukee State Bank	Milwaukee	WI	20364	First-Citizens Bank & Trust Company	March 11, 2016	June 16, 2016

폐업 일자(closing date)에 대한 시리즈를 만들자.

In [43]:

closing_timestamps = pd.to_datetime(failures['Closing Date'])
closing_timestamps

Out[43]:

   2016-09-23
   2016-08-19
   2016-05-06
   2016-04-29
   2016-03-11
         ...
 2001-07-27
 2001-05-03
 2001-02-02
 2000-12-14
 2000-10-13
Name: Closing Date, Length: 547, dtype: datetime64[ns]

시리즈의 datetime 객체 dt 중에서 년도에 대한 빈도수를 계산한다.

In [45]:

closing_timestamps.dt.year.value_counts()

Out[45]:

  157
  140
   92
   51
   25
       ...
    4
    4
    3
    3
    2
Name: Closing Date, Length: 15, dtype: int64

직접하기

기상청 날씨누리 http://www.weather.go.kr/weather/main.jsp 사이트로부터 동네예보 정보를 출력하시오.(requests 모듈의 get 메소드를 이용하시오.)
set_axis 메소드를 이용해서 첫번째 행을 열이름으로 설정하시오.
filter 메소드를 이용해서 열이름중에서 날이 포함된 열을 출력하시오.

lxml.objectify를 이용한 xml 파싱¶

XML(eXtensible Markup Language)은 메타 자료를 이용해서 자료를 중첩 형태로 만들 수 있는 자료구조 형태이다. XML과 HTML은 구조적인 면에서 유사하며 XML이 더 일반적인 형태이다. 여기서 XML 자료를 이용해서 파싱하는 예를 들어 본다. 다음은 New York Metropolitan Transportation Authority (MTA)에서 발행하는 자료중 버스와 기차의 성능에 관한 것 중 일부분이다.

<INDICATOR>
  <INDICATOR_SEQ>373889</INDICATOR_SEQ>
  <PARENT_SEQ></PARENT_SEQ>
  <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>
  <INDICATOR_NAME>Escalator Availability</INDICATOR_NAME>
  <DESCRIPTION>Percent of the time that escalators are operational
  systemwide. The availability rate is based on physical observations performed
  the morning of regular business days only. This is a new indicator the agency
  began reporting in 2009.</DESCRIPTION>
  <PERIOD_YEAR>2011</PERIOD_YEAR>
  <PERIOD_MONTH>12</PERIOD_MONTH>
  <CATEGORY>Service Indicators</CATEGORY>
  <FREQUENCY>M</FREQUENCY>
  <DESIRED_CHANGE>U</DESIRED_CHANGE>
  <INDICATOR_UNIT>%</INDICATOR_UNIT>
  <DECIMAL_PLACES>1</DECIMAL_PLACES>
  <YTD_TARGET>97.00</YTD_TARGET>
  <YTD_ACTUAL></YTD_ACTUAL>
  <MONTHLY_TARGET>97.00</MONTHLY_TARGET>
  <MONTHLY_ACTUAL></MONTHLY_ACTUAL>
</INDICATOR>

lxml.objectify 함수를 이용해서 파일을 파싱하고 getroot 메소드를 이용해 루트 노드를 얻는다.

In [46]:

from lxml import objectify

parsed = objectify.parse(open('data/Performance_MNR.xml'))
root = parsed.getroot()

root.INDICATOR는 XML 성분을 반환한다.

In [47]:

root.INDICATOR

Out[47]:

<Element INDICATOR at 0x1ed53485408>

각각의 레코드마다 태그 이름과 값을 사전형으로 저장할 수 있다.

In [48]:

data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
               'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

마지막으로 사전형 data를 데이터프레임으로 변환한다.

In [50]:

perf = pd.DataFrame(data)
perf.head()

Out[50]:

	AGENCY_NAME	CATEGORY	DESCRIPTION	FREQUENCY	INDICATOR_NAME	INDICATOR_UNIT	MONTHLY_ACTUAL	MONTHLY_TARGET	PERIOD_MONTH	PERIOD_YEAR	YTD_ACTUAL	YTD_TARGET
0	Metro-North Railroad	Service Indicators	Percent of commuter trains that arrive at thei...	M	On-Time Performance (West of Hudson)	%	96.9	95	1	2008	96.9	95
1	Metro-North Railroad	Service Indicators	Percent of commuter trains that arrive at thei...	M	On-Time Performance (West of Hudson)	%	95	95	2	2008	96	95
2	Metro-North Railroad	Service Indicators	Percent of commuter trains that arrive at thei...	M	On-Time Performance (West of Hudson)	%	96.9	95	3	2008	96.3	95
3	Metro-North Railroad	Service Indicators	Percent of commuter trains that arrive at thei...	M	On-Time Performance (West of Hudson)	%	98.3	95	4	2008	96.8	95
4	Metro-North Railroad	Service Indicators	Percent of commuter trains that arrive at thei...	M	On-Time Performance (West of Hudson)	%	95.8	95	5	2008	96.6	95

XML 자료는 이것보다 훨씬 복잡한 형태가 많다. 각 태그 안에 또 다른 메타 자료를 포함하는 경우가 일반적이다. HTML도 XML의 일종이기 때문에 다음과 같이 사용할 수 있다.

In [51]:

from io import StringIO

tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()

In [52]:

root

Out[52]:

<Element a at 0x1ed538e8e48>

In [53]:

root.get('href')

Out[53]:

'http://www.google.com'

In [54]:

root.text

Out[54]:

'Google'

바이너리 자료 형식¶

바이너리 형식으로 저장하는 가장 쉬운 방법 중의 하나는 파이썬 내장 모듈 pickle을 이용하는 것이다. 판다스 객체들은 to_pickle 메소드들을 이용해 파일로 저장할 수 있다.

In [55]:

프 = pd.read_csv(root_url + 'ex1.csv')
프

Out[55]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

In [56]:

프.to_pickle('data/frame_pickle')

pickle 객체는 내장 모듈 pickle 메소드들을 이용해서 불러올 수 있지만 판단스 read_pickle을 이용해서도 쉽게 부를 수 있다.

In [57]:

pd.read_pickle('data/frame_pickle')

Out[57]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

pickle 형식으로 저장하는 것은 짧은 기간으로 사용하는 것이 좋다. 훗날에는 호환이 안될 수도 있다.

HDF5 형식¶

HDF5(Hierarchical data format)는 대용량 과학용 배열을 저장하는데 적합한 형식이다. 각각의 HDF5 파일은 메타 자료 및 여러 개의 객체를 저장할 수 있다.

엑셀 파일¶

엑셀 2003 이후 버전의 파일들을 ExcelFile 클래스나 read_excel 함수를 이용해서 읽을 수 있다.

In [58]:

엑 = pd.ExcelFile(root_url + 'ex1.xlsx')

read_excel을 이용해 시트에 있는 자료들을 읽어온다.

In [60]:

pd.read_excel(엑, 'Sheet1')

Out[60]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

여러 개의 시트가 있는 파일을 읽을 때는 ExcelFile을 이용하는 것이 더 빠르다. read_excel 함수에 파일명을 넘겨줘서 읽어올 수도 있다.

In [61]:

pd.read_excel(root_url + 'ex1.xlsx')

Out[61]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

판다스 객체를 엑셀 형식으로 저장하려면 ExcelWriter 객체를 만든 후, to_excel 메소드를 사용한다.

In [62]:

writer = pd.ExcelWriter('data/ex2.xlsx')

In [63]:

프.to_excel(writer, 'Sheet1')

In [64]:

writer.save()

또는 단순히 객체의 to_excel 메소드를 사용해도 된다.

In [65]:

프.to_excel('data/ex2.xlsx')

웹 API 자료¶

많은 웹 사이트들은 JSON 또는 다른 형태의 자료 피드를 제공하는 API(Application Programming Interface)들을 가지고 있다. 이러한 API를 접근하는 방법은 많이 있지만 여기서는 requests 패키지를 이용한다.

깃헙(Github)에 있는 판다스 최근 이슈들에 접근하기 위해서 requests 패키지의 GET HTTP 요청을 이용한다.

In [66]:

import requests

url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

resp = requests.get(url)
resp

Out[66]:

<Response [200]>

응답(response) 객체의 json 메소드를 이용해서 파이썬 사전 리스트 객체로 변환할 수 있다.

In [68]:

data = resp.json()
data[0]['title']

Out[68]:

'Serialization / Deserialization of ExtensionArrays'

data 객체는 깃헙 이슈 페이지에서 제공되는 최근 30개 자료를 사전 객체들 형태로 이루어진 리스트이다. 원하는 항목들만 뽑아서 데이터프레임으로 만들수 있다.

In [69]:

issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])
issues

Out[69]:

	number	title	labels	state
0	20612	Serialization / Deserialization of ExtensionAr...	[{'id': 2301354, 'url': 'https://api.github.co...	open
1	20611	REF: IntervalIndex[IntervalArray]	[{'id': 849023693, 'url': 'https://api.github....	open
2	20608	read_json reads large integers as strings inco...	[]	open
3	20607	Calling pandas.cut with series of timedelta an...	[]	open
4	20604	Is it necessary to unify pandas and pd in docs...	[]	open
...	...	...	...	...
25	20575	Plotting 128Hz timeseries crashes	[{'id': 2413328, 'url': 'https://api.github.co...	open
26	20572	Feat/scatter by size	[]	open
27	20565	concat handles MultiIndex differently when ind...	[{'id': 76811, 'url': 'https://api.github.com/...	open
28	20564	DOC: intersphinx to pandas-gbq	[{'id': 134699, 'url': 'https://api.github.com...	open
29	20562	[WIP] Complete offset prefix mapping	[{'id': 53181044, 'url': 'https://api.github.c...	open

30 rows × 4 columns

데이터베이스¶

일반적으로 자료를 사업용도로 사용할 때 엑셀이나 텍스트 형식으로 저장하지 않고 SQL 기반의 데이터베이스(PostgresSQL, MySQL)를 이용한다. 어떤 데이터베이스를 선택하느냐는 응용 프로그램의 성능, 자료 무결성, 확장성등을 고려해야 한다. 판다스는 SQL로부터 자료를 가져와 데이터프레임으로 변경할 수 있는 기능들을 갖추고 있다. 예를 들어, 파이썬 내장 데이터베이스인 sqlite3를 이용하여 데이터베이스를 만드는 것을 보자.

In [70]:

import sqlite3

In [71]:

query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);
"""

In [72]:

con = sqlite3.connect('data/mydata.sqlite')

In [73]:

con.execute(query)

Out[73]:

<sqlite3.Cursor at 0x1ed530eef80>

In [74]:

con.commit()

몇 개의 자료들을 삽입해보자.

In [75]:

data = [('Atlanta', 'Georgia', 1.25, 6),
   ('Tallahassee', 'Florida', 2.6, 3),
   ('Sacramento', 'California', 1.7, 5)]

In [76]:

stmt = 'INSERT INTO test VALUES(?, ?, ?, ?)'

In [77]:

con.executemany(stmt, data)

Out[77]:

<sqlite3.Cursor at 0x1ed530eee30>

In [78]:

con.commit()

대부분 파이썬 SQL 드라이버들은 표로부터 자료를 선택하면 튜플 형식으로 반환을 한다.

In [79]:

cursor = con.execute('SELECT * FROM test')

In [80]:

rows = cursor.fetchall()

In [81]:

rows

Out[81]:

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

데이터베이스로부터 얻은 튜플 자료들을 데이터프레임으로 만들 수 있다. 그러기위해서는 데이터베이스 커서로부터 열이름을 알아낸다.

In [82]:

cursor.description

Out[82]:

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [83]:

pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Out[83]:

	a	b	c	d
0	Atlanta	Georgia	1.25	6
1	Tallahassee	Florida	2.60	3
2	Sacramento	California	1.70	5

데이터베이스 마다 사용되는 문법이 약간씩 다르므로 이러한 것을 통일되게 사용할 수 있는 파이썬 SQL 도구들이 있다. 이 중에서 SQLAlchemy Project가 많이 사용된다. 판다슨 SQLAlchemy 연결을 쉽게 할 수 있는 read_sql 함수를 제공한다. 다음은 앞에서 만든 데이터베이스를 read_sql을 이용해서 읽어오는 예이다.

In [84]:

import sqlalchemy as sqla

In [85]:

db = sqla.create_engine('sqlite:///data/mydata.sqlite')

In [86]:

pd.read_sql('SELECT * FROM test', db)

Out[86]:

	a	b	c	d
0	Atlanta	Georgia	1.25	6
1	Tallahassee	Florida	2.60	3
2	Sacramento	California	1.70	5

db는 데이터베이스 연결 객체이면 된다. 따라서 위에서 con = sqlite3.connect('data/mydata.sqlite') 객체를 대입해도 된다.