블로그 이미지
Flying Mr.Cheon youGom

Recent Comment»

Recent Post»

Recent Trackback»

« 2024/5 »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31

 
 

jupyter themes 주피터 노트북 테마 바꾸기

서버/Python | 2020. 6. 24. 15:57 | Posted by youGom

매우 간단함.

> pip install jupyterthemes

> jt -l

> jt -t {theme name}

> jt -t chesterish

theme name 목록
   chesterish
   grade3
   gruvboxd
   gruvboxl
   monokai
   oceans16
   onedork
   solarizedd
   solarizedl

 

-----------------------------------------------

넘 밝아서 눈 피로좀 줄이려고 한거라.. 추천하자면 monokai가 눈을 좀 편안하게 해주는 것 같음

> jt -t monokai

-----------------------------------------------

conda에서 한다면, conda activation 한 다음 위에 설치 과정 해주면 된다.

그것도 귀찮다면, (윈도우 사용자라면)

winkey > ananconda prompt consol을 검색해서 실행하면 (base) 로 명령어 창이 뜸

------------------------------------------------

그럴 일은 없겠지만, jt 명령어가 실행이 안될 수 도 있다. 환경변수에 path를 입력해두지 않았다면?

jt의 실행위치는 programdata 폴더 안에 anaconda 폴더가 있고 그 안에 script라는 폴더를 찾아보면 된다.

보통 c드라이드에 설치할텐데.. 리눅스 이거나 다른 드라이브에 있는지도 찾아보면 도움이 될테고, 도저히 모르겠다면

find나 F3으로 jt 를 찾는게 가장 빠를것 같다.

-------------------------------------------------

한번에 되면 좋은데, 오류들이 발생한다.

프록시 있거나 인증서가 필요하다면 이렇게 쓰면 될듯..

pip --cert="path/authfile" --proxy 0.0.0.0:8080 install jupyterthemes

-------------------------------------------------------

 

내 경우에는 두가지 오류가 발생해서 모듈들을 다시 설치함

error : no module named pywin32_bootstrap

> pip install --ignore-installed pywin32 --user

error : numpy.core._ufunc_config'

> pip install --upgrade numpy

---------------------------------------------------

command line에 대해 좀 더 살펴보거나,

python code 내에서 활용하는 등 좀 더 디테일한 내용을 알고 싶다면, 

아래 링크에서 상세히 알아볼 수 있다. ( 영어이니 참고 )

https://github.com/dunovank/jupyter-themes/blob/master/README.md

'서버 > Python' 카테고리의 다른 글

10 Minutes from pandas to Koalas on Apache Spark  (0) 2020.05.07
couchdb basic tutorial  (0) 2020.04.26
tistory api access token 얻는 방법  (0) 2020.04.25
gsm bts using rasberry pi 3  (0) 2019.03.04
pwn basic in python  (0) 2018.12.11
:

10 Minutes from pandas to Koalas on Apache Spark

서버/Python | 2020. 5. 7. 10:49 | Posted by youGom

pandas is a great tool to analyze small datasets on a single machine. When the need for bigger datasets arises, users often choose PySpark. However, the converting code from pandas to PySpark is not easy as PySpark APIs are considerably different from pandas APIs. Koalas makes the learning curve significantly easier by providing pandas-like APIs on the top of PySpark. With Koalas, users can take advantage of the benefits of PySpark with minimal efforts, and thus get to value much faster.

A number of blog posts such as Koalas: Easy Transition from pandas to Apache Spark, How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas, and 10 minutes to Koalas in Koalas official docs have demonstrated the ease of conversion between pandas and Koalas. However, despite having the same APIs, there are subtleties when working in a distributed environment that may not be obvious to pandas users. In addition, only about ~70% of pandas APIs are implemented in Koalas. While the open-source community is actively implementing the remaining pandas APIs in Koalas, users would need to use PySpark to work around. Finally, Koalas also offers its own APIs such as to_spark(), DataFrame.map_in_pandas(), ks.sql(), etc. that can significantly improve user productivity.

Therefore, Koalas is not meant to completely replace the needs for learning PySpark. Instead, Koalas makes learning PySpark much easier by offering pandas-like functions. To be proficient in Koalas, users would need to understand the basics of Spark and some PySpark APIs. In fact, we find that users using Koalas and PySpark interchangeably tend to extract the most value from Koalas.

In particular, two types of users benefit the most from Koalas:

  • pandas users who want to scale out using PySpark and potentially migrate codebase to PySpark. Koalas is scalable and makes learning PySpark much easier
  • Spark users who want to leverage Koalas to become more productive. Koalas offers pandas-like functions so that users don’t have to build these functions themselves in PySpark

This blog post will not only demonstrate how easy it is to convert code written in pandas to Koalas, but also discuss the best practices of using Koalas; when you use Koalas as a drop-in replacement of pandas, how you can use PySpark to work around when the pandas APIs are not available in Koalas, and when you apply Koalas-specific APIs to improve productivity, etc. The example notebook in this blog can be found here.

Distributed and Partitioned Koalas DataFrame

Even though you can apply the same APIs in Koalas as in pandas, under the hood a Koalas DataFrame is very different from a pandas DataFrame. A Koalas DataFrame is distributed, which means the data is partitioned and computed across different workers. On the other hand, all the data in a pandas DataFrame fits in a single machine. As you will see, this difference leads to different behaviors.

Migration from pandas to Koalas

This section will describe how Koalas supports easy migration from pandas to Koalas with various code examples.

Object Creation

The packages below are customarily imported in order to use Koalas. Technically those packages like numpy or pandas are not necessary, but allow users to utilize Koalas more flexibly.

import numpy as np
import pandas as pd
import databricks.koalas as ks

A Koalas Series can be created by passing a list of values, the same way as a pandas Series. A Koalas Series can also be created by passing a pandas Series.

# Create a pandas Series
pser = pd.Series([1, 3, 5, np.nan, 6, 8]) 
# Create a Koalas Series
kser = ks.Series([1, 3, 5, np.nan, 6, 8])
# Create a Koalas Series by passing a pandas Series
kser = ks.Series(pser)
kser = ks.from_pandas(pser)

Best Practice: As shown below, Koalas does not guarantee the order of indices unlike pandas. This is because almost all operations in Koalas run in a distributed manner. You can use Series.sort_index() if you want ordered indices.

>>> pser
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
>>> kser
3    NaN
2    5.0
1    3.0
5    8.0
0    1.0
4    6.0
Name: 0, dtype: float64
# Apply sort_index() to a Koalas series
>>> kser.sort_index() 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: 0, dtype: float64

A Koalas DataFrame can also be created by passing a NumPy array, the same way as a pandas DataFrame. A Koalas DataFrame has an Index unlike PySpark DataFrame. Therefore, Index of the pandas DataFrame would be preserved in the Koalas DataFrame after creating a Koalas DataFrame by passing a pandas DataFrame.

# Create a pandas DataFrame
pdf = pd.DataFrame({'A': np.random.rand(5),
                    'B': np.random.rand(5)})
# Create a Koalas DataFrame
kdf = ks.DataFrame({'A': np.random.rand(5),
                    'B': np.random.rand(5)})
# Create a Koalas DataFrame by passing a pandas DataFrame
kdf = ks.DataFrame(pdf)
kdf = ks.from_pandas(pdf)

Likewise, the order of indices can be sorted by DataFrame.sort_index().

>>> pdf
          A         B
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
3  0.810577  0.388611
4  0.037077  0.876712
>>> kdf.sort_index()
          A         B
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
3  0.810577  0.388611
4  0.037077  0.876712

Viewing Data

As with a pandas DataFrame, the top rows of a Koalas DataFrame can be displayed using DataFrame.head(). Generally, a confusion can occur when converting from pandas to PySpark due to the different behavior of the head() between pandas and PySpark, but Koalas supports this in the same way as pandas by using limit() of PySpark under the hood.

>>> kdf.head(2)
          A         B
0  0.015869  0.584455
1  0.224340  0.632132

A quick statistical summary of a Koalas DataFrame can be displayed using DataFrame.describe().

>>> kdf.describe()
              A         B
count  5.000000  5.000000
mean   0.344998  0.660481
std    0.360486  0.195485
min    0.015869  0.388611
25%    0.037077  0.584455
50%    0.224340  0.632132
75%    0.637126  0.820495
max    0.810577  0.876712

Sorting a Koalas DataFrame can be done using DataFrame.sort_values().

>>> kdf.sort_values(by='B')
          A         B
3  0.810577  0.388611
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
4  0.037077  0.876712

Transposing a Koalas DataFrame can be done using DataFrame.transpose().

>>> kdf.transpose()
          0         1         2         3         4
A  0.015869  0.224340  0.637126  0.810577  0.037077
B  0.584455  0.632132  0.820495  0.388611  0.876712

Best Practice: DataFrame.transpose() will fail when the number of rows is more than the value of compute.max_rows, which is set to 1000 by default. This is to prevent users from unknowingly executing expensive operations. In Koalas, you can easily reset the default compute.max_rows. See the official docs for DataFrame.transpose() for more details.

>>> from databricks.koalas.config import set_option, get_option
>>> ks.get_option('compute.max_rows')
1000
>>> ks.set_option('compute.max_rows', 2000)
>>> ks.get_option('compute.max_rows')
2000

Selecting or Accessing Data

As with a pandas DataFrame, selecting a single column from a Koalas DataFrame returns a Series.

>>> kdf['A']  # or kdf.A
0    0.015869
1    0.224340
2    0.637126
3    0.810577
4    0.037077
Name: A, dtype: float64

Selecting multiple columns from a Koalas DataFrame returns a Koalas DataFrame.

>>> kdf[['A', 'B']]
          A         B
0  0.015869  0.584455
1  0.224340  0.632132
2  0.637126  0.820495
3  0.810577  0.388611
4  0.037077  0.876712

Slicing is available for selecting rows from a Koalas DataFrame.

>>> kdf.loc[1:2]
          A         B
1  0.224340  0.632132
2  0.637126  0.820495

Slicing rows and columns is also available.

>>> kdf.iloc[:3, 1:2]
          B
0  0.584455
1  0.632132
2  0.820495

Best Practice: By default, Koalas disallows adding columns coming from different DataFrames or Series to a Koalas DataFrame as adding columns requires join operations which are generally expensive. This operation can be enabled by setting compute.ops_on_diff_frames to True. See Available options in the docs for more detail.

>>> kser = ks.Series([100, 200, 300, 400, 500], index=[0, 1, 2, 3, 4])
>>> kdf['C'] = kser


...
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.
# Those are needed for managing options
>>> from databricks.koalas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> kdf['C'] = kser
# Reset to default to avoid potential expensive operation in the future
>>> reset_option("compute.ops_on_diff_frames")
>>> kdf
          A         B    C
0  0.015869  0.584455  100
1  0.224340  0.632132  200
3  0.810577  0.388611  400
2  0.637126  0.820495  300
4  0.037077  0.876712  500

Applying a Python Function to Koalas DataFrame

DataFrame.apply() is a very powerful function favored by many pandas users. Koalas DataFrames also support this function.

>>> kdf.apply(np.cumsum)
          A         B     C
0  0.015869  0.584455   100
1  0.240210  1.216587   300
3  1.050786  1.605198   700
2  1.687913  2.425693  1000
4  1.724990  3.302404  1500

DataFrame.apply() also works for axis = 1 or ‘columns’ (0 or ‘index’ is the default).

>>> kdf.apply(np.cumsum, axis=1)
          A         B           C
0  0.015869  0.600324  100.600324
1  0.224340  0.856472  200.856472
3  0.810577  1.199187  401.199187
2  0.637126  1.457621  301.457621
4  0.037077  0.913788  500.913788

Also, a Python native function can be applied to a Koalas DataFrame.

>>> kdf.apply(lambda x: x ** 2)
          A         B       C
0  0.000252  0.341588   10000
1  0.050329  0.399591   40000
3  0.657035  0.151018  160000
2  0.405930  0.673212   90000
4  0.001375  0.768623  250000

Best Practice: While it works fine as it is, it is recommended to specify the return type hint for Spark’s return type internally when applying user defined functions to a Koalas DataFrame. If the return type hint is not specified, Koalas runs the function once for a small sample to infer the Spark return type which can be fairly expensive.

>>> def square(x) -> ks.Series[np.float64]:
...     return x ** 2
>>> kdf.apply(square)
          A         B         C
0  0.405930  0.673212   90000.0
1  0.001375  0.768623  250000.0
2  0.000252  0.341588   10000.0
3  0.657035  0.151018  160000.0
4  0.050329  0.399591   40000.0

Note that DataFrame.apply() in Koalas does not support global aggregations by its design. However, If the size of data is lower than compute.shortcut_limit, it might work because it uses pandas as a shortcut execution.

# Working properly since size of data <= compute.shortcut_limit (1000)
>>> ks.DataFrame({'A': range(1000)}).apply(lambda col: col.max())
A    999
Name: 0, dtype: int64
# Not working properly since size of data > compute.shortcut_limit (1000)
>>> ks.DataFrame({'A': range(1001)}).apply(lambda col: col.max())
A     165
A     580
A     331
A     497
A     829
A     414
A     746
A     663
A     912
A    1000
A     248
A      82
Name: 0, dtype: int64

Best Practice: In Koalas, compute.shortcut_limit (default = 1000) computes a specified number of rows in pandas as a shortcut when operating on a small dataset. Koalas uses the pandas API directly in some cases when the size of input data is below this threshold. Therefore, setting this limit too high could slow down the execution or even lead to out-of-memory errors. The following code example sets a higher compute.shortcut_limit, which then allows the previous code to work properly. See the Available options for more details.

>>> ks.set_option('compute.shortcut_limit', 1001)
>>> ks.DataFrame({'A': range(1001)}).apply(lambda col: col.max())
A    1000
Name: 0, dtype: int64

Grouping Data

Grouping data by columns is one of the common APIs in pandas. DataFrame.groupby() is available in Koalas as well.

>>> kdf.groupby('A').sum()
                 B    C
A                      
0.224340  0.632132  200
0.637126  0.820495  300
0.015869  0.584455  100
0.810577  0.388611  400
0.037077  0.876712  500

See also grouping data by multiple columns below.

>>> kdf.groupby(['A', 'B']).sum()
                     C
A        B            
0.224340 0.632132  200
0.015869 0.584455  100
0.037077 0.876712  500
0.810577 0.388611  400
0.637126 0.820495  300

Plotting and Visualizing Data

In pandas, DataFrame.plot is a good solution for visualizing data. It can be used in the same way in Koalas.

Note that Koalas leverages approximation for faster rendering. Therefore, the results could be slightly different when the number of data is larger than plotting.max_rows.

See the example below that plots a Koalas DataFrame as a bar chart with DataFrame.plot.bar().

>>> speed = [0.1, 17.5, 40, 48, 52, 69, 88]
>>> lifespan = [2, 8, 70, 1.5, 25, 12, 28]
>>> index = ['snail', 'pig', 'elephant',
...          'rabbit', 'giraffe', 'coyote', 'horse']
>>> kdf = ks.DataFrame({'speed': speed,
...                     'lifespan': lifespan}, index=index)
>>> kdf.plot.bar()

Example visualization plotting a Koalas DataFrame as a bar chart with DataFrame.plot.bar().

Also, The horizontal bar plot is supported with DataFrame.plot.barh()

>>> kdf.plot.barh()

Example visualization plotting a Koalas DataFrame as a horizontal bar chart

Make a pie plot using DataFrame.plot.pie().

>>> kdf = ks.DataFrame({'mass': [0.330, 4.87, 5.97],
...                     'radius': [2439.7, 6051.8, 6378.1]},
...                    index=['Mercury', 'Venus', 'Earth'])
>>> kdf.plot.pie(y='mass')

Example pie chart visualization using a Koalas DataFrame

Best Practice: For bar and pie plots, only the top-n-rows are displayed to render more efficiently, which can be set by using option plotting.max_rows.

Make a stacked area plot using DataFrame.plot.area().

>>> kdf = ks.DataFrame({
...     'sales': [3, 2, 3, 9, 10, 6, 3],
...     'signups': [5, 5, 6, 12, 14, 13, 9],
...     'visits': [20, 42, 28, 62, 81, 50, 90],
... }, index=pd.date_range(start='2019/08/15', end='2020/03/09',
...                        freq='M'))
>>> kdf.plot.area()

Example stacked area plot visualization using a Koalas DataFrame

Make line charts using DataFrame.plot.line().

>>> kdf = ks.DataFrame({'pig': [20, 18, 489, 675, 1776],
...                     'horse': [4, 25, 281, 600, 1900]},
...                    index=[1990, 1997, 2003, 2009, 2014])
>>> kdf.plot.line()

Example line chart visualization using a Koalas DataFrame

Best Practice: For area and line plots, the proportion of data that will be plotted can be set by plotting.sample_ratio. The default is 1000, or the same as plotting.max_rows. See Available options for details.

Make a histogram using DataFrame.plot.hist()

>>> kdf = pd.DataFrame(
...     np.random.randint(1, 7, 6000),
...     columns=['one'])
>>> kdf['two'] = kdf['one'] + np.random.randint(1, 7, 6000)
>>> kdf = ks.from_pandas(kdf)
>>> kdf.plot.hist(bins=12, alpha=0.5)

Example histogram visualization using a Koalas DataFrame

Make a scatter plot using DataFrame.plot.scatter()

>>> kdf = ks.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
...                     [6.4, 3.2, 1], [5.9, 3.0, 2]],
...                    columns=['length', 'width', 'species'])
>>> kdf.plot.scatter(x='length', y='width', c='species', colormap='viridis')

Example scatter plot visualization using a Koalas DataFrame

Missing Functionalities and Workarounds in Koalas

When working with Koalas, there are a few things to look out for. First, not all pandas APIs are currently available in Koalas. Currently, about ~70% of pandas APIs are available in Koalas. In addition, there are subtle behavioral differences between Koalas and pandas, even if the same APIs are applied. Due to the difference, it would not make sense to implement certain pandas APIs in Koalas. This section discusses common workarounds.

Using pandas APIs via Conversion

When dealing with missing pandas APIs in Koalas, a common workaround is to convert Koalas DataFrames to pandas or PySpark DataFrames, and then apply either pandas or PySpark APIs. Converting between Koalas DataFrames and pandas/PySpark DataFrames is pretty straightforward: DataFrame.to_pandas() and koalas.from_pandas() for conversion to/from pandas; DataFrame.to_spark() and DataFrame.to_koalas() for conversion to/from PySpark. However, if the Koalas DataFrame is too large to fit in one single machine, converting to pandas can cause an out-of-memory error.

Following code snippets shows a simple usage of DataFrame.to_pandas().

>>> kidx = kdf.index
>>> kidx.to_list()

...
PandasNotImplementedError: The method `pd.Index.to_list()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.

Best Practice: Index.to_list() raises PandasNotImplementedError. Koalas does not support this because it requires collecting all data into the client (driver node) side. A simple workaround is to convert to pandas using to_pandas().

>>> kidx.to_pandas().to_list()
[0, 1, 2, 3, 4]

Native Support for pandas Objects

Koalas has also made available the native support for pandas objects. Koalas can directly leverage pandas objects as below.

>>> kdf = ks.DataFrame({'A': 1.,
...                     'B': pd.Timestamp('20130102'),
...                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...                     'D': np.array([3] * 4, dtype='int32'),
...                     'F': 'foo'})
>>> kdf
     A          B    C  D    F
0  1.0 2013-01-02  1.0  3  foo
1  1.0 2013-01-02  1.0  3  foo
2  1.0 2013-01-02  1.0  3  foo
3  1.0 2013-01-02  1.0  3  foo

ks.Timestamp() is not implemented yet, and ks.Series() cannot be used in the creation of Koalas DataFrame. In these cases, the pandas native objects pd.Timestamp() and pd.Series() can be used instead.

Distributing a pandas Function in Koalas

In addition, Koalas offers Koalas-specific APIs such as DataFrame.map_in_pandas(), which natively support distributing a given pandas function in Koalas.

>>> i = pd.date_range('2018-04-09', periods=2000, freq='1D1min')
>>> ts = ks.DataFrame({'A': ['timestamp']}, index=i)
>>> ts.between_time('0:15', '0:16')


...
PandasNotImplementedError: The method `pd.DataFrame.between_time()` is not implemented yet.

DataFrame.between_time() is not yet implemented in Koalas. As shown below, a simple workaround is to convert to a pandas DataFrame using to_pandas(), and then applying the function.

>>> ts.to_pandas().between_time('0:15', '0:16')
                             A
2018-04-24 00:15:00  timestamp
2018-04-25 00:16:00  timestamp
2022-04-04 00:15:00  timestamp
2022-04-05 00:16:00  timestamp

However, DataFrame.map_in_pandas() is a better alternative workaround because it does not require moving data into a single client node and potentially causing out-of-memory errors.

>>> ts.map_in_pandas(func=lambda pdf: pdf.between_time('0:15', '0:16'))
                             A
2022-04-04 00:15:00  timestamp
2022-04-05 00:16:00  timestamp
2018-04-24 00:15:00  timestamp
2018-04-25 00:16:00  timestamp

Best Practice: In this way, DataFrame.between_time(), which is a pandas function, can be performed on a distributed Koalas DataFrame because DataFrame.map_in_pandas() executes the given function across multiple nodes. See DataFrame.map_in_pandas().

Using SQL in Koalas

Koalas supports standard SQL syntax with ks.sql() which allows executing Spark SQL query and returns the result as a Koalas DataFrame.

>>> kdf = ks.DataFrame({'year': [1990, 1997, 2003, 2009, 2014],
...                     'pig': [20, 18, 489, 675, 1776],
...                     'horse': [4, 25, 281, 600, 1900]})
>>> ks.sql("SELECT * FROM {kdf} WHERE pig > 100")
   year   pig  horse
0  1990    20      4
1  1997    18     25
2  2003   489    281
3  2009   675    600
4  2014  1776   1900

Also, mixing Koalas DataFrame and pandas DataFrame is supported in a join operation.

>>> pdf = pd.DataFrame({'year': [1990, 1997, 2003, 2009, 2014],
...                     'sheep': [22, 50, 121, 445, 791],
...                     'chicken': [250, 326, 589, 1241, 2118]})
>>> ks.sql('''
...     SELECT ks.pig, pd.chicken
...     FROM {kdf} ks INNER JOIN {pdf} pd
...     ON ks.year = pd.year
...     ORDER BY ks.pig, pd.chicken''')
    pig  chicken
0    18      326
1    20      250
2   489      589
3   675     1241
4  1776     2118

Working with PySpark

You can also apply several PySpark APIs on Koalas DataFrames. PySpark background can make you more productive when working in Koalas. If you know PySpark, you can use PySpark APIs as workarounds when the pandas-equivalent APIs are not available in Koalas. If you feel comfortable with PySpark, you can use many rich features such as the Spark UI, history server, etc.

Conversion from and to PySpark DataFrame

A Koalas DataFrame can be easily converted to a PySpark DataFrame using DataFrame.to_spark(), similar to DataFrame.to_pandas(). On the other hand, a PySpark DataFrame can be easily converted to a Koalas DataFrame using DataFrame.to_koalas(), which extends the Spark DataFrame class.

>>> kdf = ks.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]})
>>> sdf = kdf.to_spark()
>>> type(sdf)
pyspark.sql.dataframe.DataFrame
>>> sdf.show()
+---+---+
|  A|  B|
+---+---+
|  1| 10|
|  2| 20|
|  3| 30|
|  4| 40|
|  5| 50|
+---+---+

Note that converting from PySpark to Koalas can cause an out-of-memory error when the default index type is sequence. Default index type can be set by compute.default_index_type (default = sequence). If the default index must be the sequence in a large dataset, distributed-sequence should be used.

>>> from databricks.koalas import option_context
>>> with option_context(
...         "compute.default_index_type", "distributed-sequence"):
...     kdf = sdf.to_koalas()
>>> type(kdf)
databricks.koalas.frame.DataFrame
>>> kdf
   A   B
3  4  40
1  2  20
2  3  30
4  5  50
0  1  10

Best Practice: Converting from a PySpark DataFrame to Koalas DataFrame can have some overhead because it requires creating a new default index internally – PySpark DataFrames do not have indices. You can avoid this overhead by specifying the column that can be used as an index column. See the Default Index type for more detail.

>>> sdf.to_koalas(index_col='A')
    B
A    
1  10
2  20
3  30
4  40
5  50

Checking Spark’s Execution Plans

DataFrame.explain() is a useful PySpark API and is also available in Koalas. It can show the Spark execution plans before the actual execution. It helps you understand and predict the actual execution and avoid the critical performance degradation.

from databricks.koalas import option_context

with option_context(
        "compute.ops_on_diff_frames", True,
        "compute.default_index_type", 'distributed'):
    df = ks.range(10) + ks.range(10)
    df.explain()

The command above simply adds two DataFrames with the same values. The result is shown below.

== Physical Plan ==
*(5) Project [...]
+- SortMergeJoin [...], FullOuter
   :- *(2) Sort [...], false, 0
   :  +- Exchange hashpartitioning(...), [id=#]
   :     +- *(1) Project [...]
   :        +- *(1) Range (0, 10, step=1, splits=12)
   +- *(4) Sort [...], false, 0
      +- ReusedExchange [...], Exchange hashpartitioning(...), [id=#]

As shown in the physical plan, the execution will be fairly expensive because it will perform the sort merge join to combine DataFrames. To improve the execution performance, you can reuse the same DataFrame to avoid the merge. See Physical Plans in Spark SQL to learn more.

with option_context(
        "compute.ops_on_diff_frames", False,
        "compute.default_index_type", 'distributed'):
    df = ks.range(10)
    df = df + df
    df.explain()

Now it uses the same DataFrame for the operations and avoids combining different DataFrames and triggering a sort merge join, which is enabled by compute.ops_on_diff_frames.

== Physical Plan ==
*(1) Project [...]
+- *(1) Project [...]
   +- *(1) Range (0, 10, step=1, splits=12)

This operation is much cheaper than the previous one while producing the same output. Examine DataFrame.explain() to help improve your code efficiency.

Caching DataFrame

DataFrame.cache() is a useful PySpark API and is available in Koalas as well. It is used to cache the output from a Koalas operation so that it would not need to be computed again in the subsequent execution. This would significantly improve the execution speed when the output needs to be accessed repeatedly.

with option_context("compute.default_index_type", 'distributed'):
    df = ks.range(10)
    new_df = (df + df).cache()  # `(df + df)` is cached here as `df`
    new_df.explain()

As the physical plan shows below, new_df will be cached once it is executed.

== Physical Plan ==
*(1) InMemoryTableScan [...]
   +- InMemoryRelation [...], StorageLevel(...)
      +- *(1) Project [...]
         +- *(1) Project [...]
            +- *(1) Project [...]
               +- *(1) Range (0, 10, step=1, splits=12)

InMemoryTableScan and InMemoryRelation mean the new_df will be cached – it does not need to perform the same (df + df) operation when it is executed the next time.

A cached DataFrame can be uncached by DataFrame.unpersist().

new_df.unpersist()

Best Practice: A cached DataFrame can be used in a context manager to ensure the cached scope against the DataFrame. It will be cached and uncached back within the with scope.

with (df + df).cache() as df:
    df.explain()

Conclusion

The examples in this blog demonstrate how easily you can migrate your pandas codebase to Koalas when working with large datasets. Koalas is built on top of PySpark, and provides the same API interface as pandas. While there are subtle differences between pandas and Koalas, Koalas provides additional Koalas-specific functions to make it easy when working in a distributed setting. Finally, this blog shows common workarounds and best practices when working in Koalas. For pandas users who need to scale out, Koalas fits their needs nicely.

Get Started with Koalas on Apache Spark

You can get started with trying examples in this blog in this notebook, visit the Koalas documentation and peruse examples, and contribute at Koalas GitHub. Also, join the koalas-dev mailing list for discussions and new release announcements.

References

출처 : 

https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html

 

'서버 > Python' 카테고리의 다른 글

jupyter themes 주피터 노트북 테마 바꾸기  (0) 2020.06.24
couchdb basic tutorial  (0) 2020.04.26
tistory api access token 얻는 방법  (0) 2020.04.25
gsm bts using rasberry pi 3  (0) 2019.03.04
pwn basic in python  (0) 2018.12.11
:

couchdb basic tutorial

서버/Python | 2020. 4. 26. 16:42 | Posted by youGom

This is an unofficial manual for the couchdb Python module I wish I had had.

Installation

pip install couchdb

Importing the module

import couchdb

Connection

If you only need read access, use an anonymous connection:

couchserver = couchdb.Server("http://couchdb:5984/")

To write to the database, create an authenticated connection:

user = "admin"
password = "my secret password"
couchserver = couchdb.Server("http://%s:%s@couchdb:5984/" % (user, password))

Listing databases

Simply iterate over the server object like this:

for dbname in couchserver:
    print(dbname)

Selecting/creating a database to work with

dbname = "mydb"
if dbname in couchserver:
    db = couchserver[dbname]
else:
    db = couchserver.create(dbname)

Deleting a database

del couchserver[dbname]

Writing a document to a database

Storing a document with an auto-generated ID:

doc_id, doc_rev = db.save({'key': 'value'})

doc_id is the generated document ID, doc_rev is the revision identifier.

Setting a specific ID:

db["my_document_id"] = {'key': 'value'}

Writing multiple documents in one call is done via the update() method of the database object. This can either create new documents (when no _id field is present per document) or update existing ones.

docs = [{'key': 'value1'}, {'key': 'value2'}]
for (success, doc_id, revision_or_exception) in db.update(docs):
    print(success, docid, revision_or_exception)

Retrieving documents by ID

doc_id = "my_document_id"
doc = db[doc_id]  # or db.get(doc_id)

Querying documents from views

If your database has a design document and view under the path /db_name/_design/design_doc/_view/view_name, you can iterate this view using this syntax:

for item in db.view('design_doc/view_name'):
    print(item.key, item.id, item.value)

Limiting the output to a certain number of items:

for item in db.view('design_doc/view_name', limit=100):
    print(item.key, item.id, item.value)

Skipping the first n items:

for item in db.view('design_doc/view_name', skip=100):
    print(item.key, item.id, item.value)

Reverse sorting:

for item in db.view('design_doc/view_name', descending=True):
    print(item.key, item.id, item.value)

Including source documents in result entries:

for item in db.view('design_doc/view_name', include_docs=True):
    print(item.key, item.id, item.value)

Allow outdated data to be returned, prevent updating the view before returning results:

for item in db.view('design_doc/view_name', stale="ok"):
    print(item.key, item.id, item.value)

Update the view after returning the results:

for item in db.view('design_doc/view_name', stale="update_after"):
    print(item.key, item.id, item.value)

Grouping results

Grouping the results by key, using the Reduce function, must be activated explicitly:

for item in db.view('design_doc/view_name', group=True):
    print(item.key, item.value)

If the Map function emits a structured key (an array with multiple elements), the grouping level can be determined:

for item in db.view('design_doc/view_name', group=True, group_level=1):
    print(item.key, item.value)

Filtering

Return only entries from the view matching a certain key:

for item in db.view('design_doc/view_name', key="my_key"):
    print(item.key, item.id, item.value)

Return entries with keys in a certain range:

for item in db.view('design_doc/view_name', startkey="startkey", endkey="endkey"):
    print(item.key, item.id, item.value)

The key, startkey and endkey parameters also accept arrays, e. g.

for item in db.view('design_doc/view_name', startkey=["foo", "a"], endkey=["foo", "z"]):
    print(item.key, item.id, item.value)

 

 

출처 : https://gist.github.com/marians/8e41fc817f04de7c4a70

:

tistory api access token 얻는 방법

서버/Python | 2020. 4. 25. 14:49 | Posted by youGom
 
티스토리 API를 사용하기 위해서는 access_token을 발급받아야 한다. access_token을 발급받는 과정은 매우 귀찮은 과정이지만, 아래 절차를 차근차근 따라가기만 하면 어렵지 않게 발급받을 수 있을 것이다.
 
가장 먼저 아래 URL에 가서 클라이언트 등록을 하자. 그냥 쉽게 내가 티스토리 API를 통해서 접근할 블로그 정보를 입력하는 곳이라 생각하자. 
 
 
 
1.클라이언트 등록
 
 
 
 
 
 
 
 
 
그리고 위에서 클라이언트 등록을 누르자.
 
 
 
오픈 API 이용약관에 동의표시를 하고
 
서비스명, 설명, 서비스 URL, CALL BACK 경로를 입력하자.
 
따로 CALLBACK 경로가 없는 경우는 그냥 서비스 URL이랑 같다고 생각하면 된다.
 
그리고 저장하면, 클라이언트 등록은 끝.
 
 
 
 
2. Code 발급 받기
 
 
 
먼저 Access_Token을 발급받기 전에, code를 발급받아야 한다. 
 
클라이언트 등록 옆에 있는 '클라이언트 관리'를 눌러보자.
 
 
 
 
그럼 위와 같은 화면을 볼 수 있다. 
 
해당 화면에서 Client ID 그리고 서비스 URL을 아래 URL에 입력해준다.
 
 
https://www.tistory.com/oauth/authorize?client_id=Client ID&redirect_uri=http://서비스 URL&response_type=code&state=someValue
 
 
위 URL의 빨간색 부분에 맞게 채워넣으면 된다.
 
 
이제 해당 URL을 익스플로러(Explorer) 주소창에 입력하자. (크롬 말고 익스플로러로 하시길 바랍니다!!!!)
 
 
 
 
그럼 위와 같은 창이 나오는데, '허가하기' 버튼을 눌러준다.
 
 
 
 
 
여기서 주의해야할 점은 '허가하기'버튼을 누르고 나면 위와 같이 ?code= 뒤에 문자열들이 있는 것을 볼 수 있다.
 
해당 문자열을 통해서 access token을 얻을 수 있으므로 잘 복사해두길 바란다. (&state전까지!!!!)
 
 
 
 
3.Access Token 얻기 (마지막 단계!)
 
그럼 이제 대망의 access token얻는 부분이다. 
아래 URL에 Client ID, Secret Key 그리고 서비스 URL을 적어준다. 이정보는 앞서 살펴봤던 '클라이언트 관리'에 나와있다.
 
마지막으로 코드에 해당하는 부분에는 앞서 2번 과정에서 발급받은 code= 뒤에 있는 문자열 (&state전까지!)을 복사해서 붙혀넣으면 된다.
 
 
https://www.tistory.com/oauth/access_token?client_id=Clinet ID&client_secret=Secret Key&redirect_uri=http://서비스 URL&code=코드(방금받은거)&grant_type=authorization_code
 
 
 
그런 다음 해당 URL을 익스플로러 주소입력하는 부분에 복사 붙혀넣기 해준다.
 
그리고 중요한 것은 여기서 F12를 눌러주면 하단에 창이 뜨는데, 거기서 '디버거'를 누르면 위와 같이 access_token이 발급된 것을 볼 수 있다.
 
해당 access_token은 티스토리 API를 사용할 때 아주 중요한 정보이고, 또 보안상 매우 중요하기 때문에 혼자 잘 간직하고 있길 바란다.
 
만약 3번 과정에서 실패가 발생하면, 2번 과정을 다시 해주셔야합니다. code는 한번밖에 쓰이지 않습니다.
 
 

'서버 > Python' 카테고리의 다른 글

10 Minutes from pandas to Koalas on Apache Spark  (0) 2020.05.07
couchdb basic tutorial  (0) 2020.04.26
gsm bts using rasberry pi 3  (0) 2019.03.04
pwn basic in python  (0) 2018.12.11
python file to exe as one file  (0) 2018.11.01
:

gsm bts using rasberry pi 3

서버/Python | 2019. 3. 4. 14:18 | Posted by youGom


ref : 

https://www.evilsocket.net/2016/03/31/how-to-build-your-own-rogue-gsm-bts-for-fun-and-profit/


HOW TO BUILD YOUR OWN ROGUE GSM BTS FOR FUN AND PROFIT


The last week I’ve been visiting my friend and colleque Ziggy in Tel Aviv which gave me something I’ve been waiting for almost a year, a brand new BladeRF x40, a low-cost USB 3.0 Software Defined Radio working in full-duplex, meaning that it can transmit and receive at the same time ( while for instance the HackRF is only half-duplex ).

In this blog post I’m going to explain how to create a portable GSM BTS which can be used either to create a private ( and vendor free! ) GSM network or for GSM active tapping/interception/hijacking … yes, with some (relatively) cheap electronic equipment you can basically build something very similar to what the governments are using from years to perform GSM interception.

I’m not writing this post to help script kiddies breaking the law, my point is that GSM is broken by design and it’s about time vendors do something about it considering how much we’re paying for their services.

my bts

Hardware Requirements

In order to build your BTS you’ll need the following hardware:

Software

Let’s start by installing the latest Raspbian image to the micrsd card ( use the “lite” one, no need for UI ;) ), boot the RPI, configure either the WiFi or ethernet and so forth, at the end of this process you should be able to SSH into the RPI.

Next, install a few dependecies we’re gonna need soon:

sudo apt-get install git apache2 php5 bladerf libbladerf-dev libbladerf0 automake

At this point, you should already be able to interact with the BladeRF, plug it into one of the USB ports of the RPI, dmesg should be telling you something like:

[ 2332.071675] usb 1-1.3: New USB device found, idVendor=1d50, idProduct=6066
[ 2332.071694] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 2332.071707] usb 1-1.3: Product: bladeRF
[ 2332.071720] usb 1-1.3: Manufacturer: Nuand
[ 2332.071732] usb 1-1.3: SerialNumber: b4ef330e19b718f752759b4c14020742

Start the bladeRF-cli utility and issue the version command:

pi@raspberrypi:~ $ sudo bladeRF-cli -i
bladeRF> version

  bladeRF-cli version:        0.11.1-git
  libbladeRF version:         0.16.2-git

  Firmware version:           1.6.1-git-053fb13-buildomatic
  FPGA version:               0.1.2

bladeRF>

IMPORTANT Make sure you have these exact versions of the firmware and the FPGA, other versions might not work in our setup.

Download the correct firmware and FPGA image.

Now we’re going to install Yate and YateBTS, two open source softwares that will make us able to create the BTS itself.

Since I spent a lot of time trying to figure out which specific version of each was compatible with the bladeRF, I’ve created a github repository with correct versions of both, so in your RPI home folder just do:

git clone https://github.com/evilsocket/evilbts.git
cd evilbts

Let’s start building both of them:

cd yate
./autogen.sh
./configure --prefix=/usr/local
make -j4
sudo make install
sudo ldconfig
cd ..

cd yatebts
./autogen.sh
./configure --prefix=/usr/local
make -j4
sudo make install
sudo ldconfig

This will take a few minutes, but eventually you’ll have everything installed in your system.

Next, we’ll symlink the NIB web ui into our apache www folder:

cd /var/www/html/
sudo ln -s /usr/local/share/yate/nib_web nib

And grant write permission to the configuration files:

sudo chmod -R a+w /usr/local/etc/yate

You can now access your BTS web ui from your browser:

http://ip-of-your-rpi/nib

Time for some configuration now!

Configuration

Open the /usr/local/etc/yate/ybts.conf file either with nano or vi and update the following values:

Radio.Band=900
Radio.C0=1000
Identity.MCC=YOUR_COUNTRY_MCC
Identity.MNC=YOUR_OPERATOR_MNC
Identity.ShortName=MyEvilBTS
Radio.PowerManager.MaxAttenDB=35
Radio.PowerManager.MinAttenDB=35

You can find valid MCC and MNC values here.

Now, edit the /usr/local/etc/yate/subscribers.conf:

country_code=YOUR_CONTRY_CODE
regexp=.*

WARNING Using the .* regular expression will make EVERY GSM phone in your area connect to your BTS.

In your NIB web ui you’ll see something like this:

NIB

Enable GSM-Tapping

In the “Tapping” panel, you can enable it for both GSM and GPRS, this will basically “bounce” every GSM packet to the loopback interface, since we haven’t configure any encryption, you’ll be able to see all the GSM traffic by simply tcpdump-ing your loopback interface :D

tapping

Start It!

Finally, you can start your new BTS by executing the command ( with the BladeRF plugged in! ) :

sudo yate -s

If everything was configured correctly, you’ll see a bunch of messages and the line:

Starting MBTS...
Yate engine is initialized and starting up on raspberrypi
RTNETLINK answers: File exists
MBTS ready

At this point, the middle LED for your bladeRF should start blinking.

Test It!

Now, phones will start to automatically connect, this will happen because of the GSM implementation itself:

  • You can set whatever MCC, MNC and LAC you like, effectly spoofing any legit GSM BTS.
  • Each phone will search for BTS of its operator and select the one with the strongest signal … guess which one will be the strongest? Yep … ours :D

Here’s a picture taken from my Samsung Galaxy S6 ( using the Network Cell Info Lite app ) which automatically connected to my BTS after 3 minutes:

MyEvilBTS

From now on, you can configure the BTS to do whatever you want … either act as a “proxy” to a legit SMC ( with a GSM/3g USB dongle ) and sniff the unencrypted GSM traffic of each phone, or to create a private GSM network where users can communicate for free using SIP, refer to the YateBTS Wiki for specific configurations.

Oh and of course, if you plug the USB battery, the whole system becomes completely portable :)

References and Further Readings


'서버 > Python' 카테고리의 다른 글

10 Minutes from pandas to Koalas on Apache Spark  (0) 2020.05.07
couchdb basic tutorial  (0) 2020.04.26
tistory api access token 얻는 방법  (0) 2020.04.25
pwn basic in python  (0) 2018.12.11
python file to exe as one file  (0) 2018.11.01
:

pwn basic in python

서버/Python | 2018. 12. 11. 13:37 | Posted by youGom

ref : http://docs.pwntools.com/en/stable/intro.html



'서버 > Python' 카테고리의 다른 글

10 Minutes from pandas to Koalas on Apache Spark  (0) 2020.05.07
couchdb basic tutorial  (0) 2020.04.26
tistory api access token 얻는 방법  (0) 2020.04.25
gsm bts using rasberry pi 3  (0) 2019.03.04
python file to exe as one file  (0) 2018.11.01
:

python file to exe as one file

서버/Python | 2018. 11. 1. 19:18 | Posted by youGom

pip install pyinstaller

python your_script.py
pyinstaller --onefile <your_script_name>.py


pyinstaller의 옵션은 다양하지만 보통 쓰이는 옵션은 몇 개 안된다.

  • -F--onefile: 최종 출력으로 exe파일 하나만 생성한다.
  • -w--windowed: 콘솔 창이 출력되지 않게한다.
  • -i--icon: exe파일의 아이콘을 지정한다.

더 자세한 설명과 옵션들은 아래의 이미지 참고.
options


추가 경로 필요시>>



pyinstaller --onefile --windowed --icon=heart.ico --clean -p C:\test main.py


난 아래껄 쓸예정,

pyinstaller --onefile --strip <your_script_name>.py



출처2 : https://medium.com/dreamcatcher-its-blog/making-an-stand-alone-executable-from-a-python-script-using-pyinstaller-d1df9170e263


출처2 : https://winterj.me/pyinstaller/


'서버 > Python' 카테고리의 다른 글

10 Minutes from pandas to Koalas on Apache Spark  (0) 2020.05.07
couchdb basic tutorial  (0) 2020.04.26
tistory api access token 얻는 방법  (0) 2020.04.25
gsm bts using rasberry pi 3  (0) 2019.03.04
pwn basic in python  (0) 2018.12.11
: